When it comes to coordinating people around a goal, you don't get limitless communication bandwidth for conveying arbitrarily nuanced messages. Instead, the "amount of words" you get to communicate depends on how many people you're trying to coordinate. Once you have enough people....you don't get many words.

Customize
Rationality+Rationality+World Modeling+World Modeling+AIAIWorld OptimizationWorld OptimizationPracticalPracticalCommunityCommunity
Personal Blog+
eggsyntax172
1
Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).  The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing.Remaining gaps I've thought of so far:   * What's lurking in the remaining reconstruction loss? Are there important missing features? * Will SAEs get all meaningful features given adequate dictionary size? * Are there important features which SAEs just won't find because they're not that sparse? * Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems? * How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be 'ability to predict model output given context + feature activations'? * Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs? * eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email * eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition * Do we find ways to make SAEs efficient enough to be scaled to production models with a sufficient number of features * (as opposed to the paper under discussion, where 'The features we found represent a small subset of all the concepts learned by the model during training, and finding a full set of features using our current techniques would be cost-prohibitive') Of course LLM alignment isn't necessarily sufficient on its own for safety, since eg scaffolded LLM-based agents introduce risk even if the underlying LLM is well-aligned. But I'm just thinking here about what I'd want to see to feel confident that we could use these techniques to do the LLM alignment portion.  1. ^ I think I'd be pretty surprised if it kept working much past human-level, although I haven't spent a ton of time thinking that through as yet.
jacquesthibs3220
1
I would find it valuable if someone could gather an easy-to-read bullet point list of all the questionable things Sam Altman has done throughout the years. I usually link to Gwern’s comment thread (https://www.lesswrong.com/posts/KXHMCH7wCxrvKsJyn/openai-facts-from-a-weekend?commentId=toNjz7gy4rrCFd99A), but I would prefer if there was something more easily-consumable.
A problem with overly kind PR is that many people know that you don't deserve the reputation. So if you start to fall, you can fall hard and fast. Likewise it incentivises investigation that you can't back up. If everyone thinks I am lovely, but I am two faced, I create a juicy story any time I am cruel. Not so if am known to be grumpy. eg My sense is that EA did this a bit with the press tour around What We Owe The Future. It built up a sense of wisdom that wasn't necessarily deserved, so with FTX it all came crashing down. Personally I don't want you to think I am kind and wonderful. I am often thoughtless and grumpy. I think you should expect a mediocre to good experience. But I'm not Santa Claus. I am never sure whether rats are very wise or very naïve to push for reputation over PR, but I think it's much more sustainable. @ESYudkowsky can't really take a fall for being goofy. He's always been goofy - it was priced in. Many organisations think they are above maintaining the virtues they profess to possess, instead managing it with media relations. In doing this they often fall harder eventually. Worse, they lose out on the feedback from their peers accurately seeing their current state. Journalists often frustrate me as a group, but they aren't dumb. Whatever they think is worth writing, they probably have a deeper sense of what is going on. Personally I'd prefer to get that in small sips, such that I can grow, than to have to drain my cup to the bottom.
robo9736
16
Our current big stupid: not preparing for 40% agreement Epistemic status: lukewarm take from the gut (not brain) that feels rightish The "Big Stupid" of the AI doomers 2013-2023 was AI nerds' solution to the problem "How do we stop people from building dangerous AIs?" was "research how to build AIs".  Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche.  When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared. Take: The "Big Stupid" of right now is still the same thing.  (We've not corrected enough).  Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job).  If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot!  Do we have a plan for then? Almost every LessWrong post on AIs are about analyzing AIs.  Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built. [Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment.  And fortunately GPU manufacture is hundreds of time harder than uranium enrichment!  We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can't unilaterally sanction, that sort of thing.] TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built. *My research included 😬
On the OpenPhil / OpenAI Partnership Epistemic Note:  The implications of this argument being true are quite substantial, and I do not have any knowledge of the internal workings of Open Phil.  (Both title and this note have been edited, cheers to Ben Pace for very constructive feedback.) Premise 1:  It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research. Premise 2: This was the default outcome.  Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.  Edit: To clarify, you need to be skeptical of seemingly altruistic statements and commitments made by humans when there are exceptionally lucrative incentives to break these commitments at a later point in time (and limited ways to enforce the original commitment). Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.  Conclusion: Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn't be allowed anywhere near AI Safety decision making in the future. To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.  This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.  To quote OpenPhil: "OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela."  

Popular Comments

Recent Discussion

This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

  • Give feedback on how this project is helpful or how it could be different to be much more helpful
  • Tell me what's wrong/missing; point me to sources
...
Wei Dai117

Unfortunately I don't have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I'm not personally aware of such writings. To brainstorm some ideas:

  1. Conduct and publish anonymous surveys of employee attitudes about safety.
  2. Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
  3. Officially encourage (e.g. via financial rewards) internal and external whistleblowers. E
... (read more)
17eggsyntax
Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model).  The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing.Remaining gaps I've thought of so far:   * What's lurking in the remaining reconstruction loss? Are there important missing features? * Will SAEs get all meaningful features given adequate dictionary size? * Are there important features which SAEs just won't find because they're not that sparse? * Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems? * How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be 'ability to predict model output given context + feature activations'? * Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs? * eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email * eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition * Do we find ways to make SAEs efficient enough to be scaled to production models wi

I wrote up a short post with a summary of their results. It doesn't really answer any of your questions. I do have thoughts on a couple, even though I'm not expert on interpretability. 

But my main focus is on your footnote: is this going to help much with aligning "real" AGI (I've been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I'm of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think ... (read more)

Meet inside The Shops at Waterloo Town Square - we will congregate in the seating area next to the Valu-Mart with the trees sticking out in the middle of the benches at 7pm for 15 minutes, and then head over to my nearby apartment's amenity room. If you've been around a few times, feel free to meet up at my apartment front door for 7:30 instead. (There is free city parking at Bridgeport and Regina, 22 Bridgeport Rd E.)

Event

It's been a while since the last one, so I'm running another session of authentic relating games!

Things to expect and prepare for, for those who haven't been to one of these before: edgy questions, physical touch, emotional connection, and a heightened sense of self-awareness. You can of course opt out from any individual game.

For more information, you can check out the Authentic Relating Games Mini-Manual for free on Gumroad, or just message me :)

Linkposting a writeup of my learnings from helping family members augment their investments. I encourage LessWrong users to check it out; I expect the post contains new and actionable information for a number of readers. 

Thanks in advance for any comments or feedback that can help the post be more useful to others!

It would be more useful with a little more info on what ideas you're offering. Linkposts with more description get more clickthrough. You can edit in a little more info.

This is the first post in a little series I'm slowly writing on how I see forecasting, particularly conditional forecasting; what it's good for; and whether we should expect people to agree if they just talk to each other enough.

Views are my own. I work at the Forecasting Research Institute (FRI), I forecast with the Samotsvety group, and to the extent that I have formal training in this stuff, it's mostly from studying and collaborating with Leonard Smith, a chaos specialist.

My current plan is:

  1. Forecasting: the way I think about it [this post]
  2. The promise of conditional forecasting / cruxing for parameterizing our models of the world
  3. What we're looking at and what we're paying attention to (Or: why we shouldn't expect people to agree today (Or: there is no "true" probability))

What...

Molly10

Figure 1 is clumsy, sorry. In the case of a smooth probability distribution of infinite worlds, I think the median and the average world are the same? But in practice, yes, it's an expected value calculation, summing P(world) * P(U|world) for all the worlds you've thought about.

Epistemic status: the stories here are all as true as possible from memory, but my memory is so so.

An AI made this

This is going to be big

It’s late Summer 2017. I am on a walk in the Mendip Hills. It’s warm and sunny and the air feels fresh. With me are around 20 other people from the Effective Altruism London community. We’ve travelled west for a retreat to discuss how to help others more effectively with our donations and careers. As we cross cow field after cow field, I get talking to one of the people from the group I don’t know yet. He seems smart, and cheerful. He tells me that he is an AI researcher at Google DeepMind. He explains how he is thinking about...

I can't recall another time when someone shared their personal feelings and experiences and someone else declared it "propaganda and alarmism". I haven't seen "zero-risker" types do the same, but I would be curious to hear the tale and, if they share it, I don't think anyone one will call it "propaganda and killeveryoneism".

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion

 

Some arguments that OpenAI is making, simultaneously:

  1. OpenAI will likely reach and own transformative AI (useful for attracting talent to work there).
  2. OpenAI cares a lot about safety (good for public PR and government regulations).
  3. OpenAI isn’t making anything dangerous and is unlikely to do so in the future (good for public PR and government regulations).
  4. OpenAI doesn’t need to spend many resources on safety, and implementing safe AI won’t put it at any competitive disadvantage (important for investors who own most of the company).
  5. Transformative AI will be incredibly valuable for all of humanity in the long term (for public PR and developers).
  6. People at OpenAI have thought long and
...

Broadly agree except for this part:

 Its in an area that some people (not the OpenAI management) think is unusually high-risk,

I really can't imagine that someone who wrote "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity." in 2015 and occasionally references extinction as a possibility when not directly asked about doesn't think AGI development is high risk.

I'm not sure how to square this circle. I almost hope Sam is being consciously dishonest and has a 4D chess plan, as opposed to ... (read more)

8Steven Byrnes
I think Yann LeCun thinks "AGI in 2040 is perfectly plausible", AND he believes "AGI is so far away it's not worth worrying about all that much". It's a really insane perspective IMO. As recently as like 2020, "AGI within 20 years" was universally (correctly) considered to be a super-soon forecast calling for urgent action, as contrasted with the people who say "centuries".
1O O
I recall him saying this on Twitter and linking a person in a leadership position who runs things there. Don’t know how to search that.
6Rebecca
My impression is that post-board drama, they’ve de-emphasised the non-profit messaging. Also in a more recent interview Sam said basically ‘well I guess it turns out the board can’t fire me’ and that in the long term there should be democratic governance of the company. So I don’t think it’s true that #8-10 are (still) being pushed simultaneously with the others. I also haven’t seen anything that struck me as communicating #3 or #11, though I agree it would be in OpenAI’s interest to say those things. Can you say more about where you are seeing that?

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushant jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment.

The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias and confusion, as well as our own search for clarifying concepts. 

At AOI, we try to...

I think you're right about these drawbacks of using the term "alignment" so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.

I don't see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they u... (read more)

2Seth Herd
I think you're right about these drawbacks of the widespread use of the term alignment for
1Lucas Teixeira
For clarity, how do you distinguish between P1 & P4?
2quiet_NaN
I think that "AI Alignment" is a useful label for the somewhat related problems around P1-P6. Having a term for the broader thing seems really useful.  Of course, sometimes you want labels to refer to a fairly narrow thing, like the label "Continuum Hypothesis". But broad labels are generally useful. Take "ethics", another broad field label. Nominative ethics, applied ethics, meta-ethics, descriptive ethics, value theory, moral psychology, et cetera. I someone tells me "I study ethics" this narrows down what problems they are likely to work on, but not very much. Perhaps they work out a QALY-based systems for assigning organ donations, or study the moral beliefs of some peoples, or argue if moral imperatives should have a truth value. Still, the label confers a lot of useful information over a broader label like "philosophy".  By contrast, "AI Alignment" still seems rather narrow. P2 for example seems a mostly instrumental goal: if we have interpretability, we have better chances to avoid a takeover of an unaligned AI. P3 seems helpful but insufficient for good long term outcomes: an AI prone to disobeying users or interpreting their orders in a hostile way would -- absent some other mechanism -- also fail to follow human values more broadly, but an P3-aligned AI in the hand of a bad human actor could still cause extinction, and I agree that social structures should probably be established to ensure that nobody can unilaterally assign the core task (or utility function) of an ASI. 

Two jobs in AI Safety Advocacy that AFAICT don't exist, but should and probably will very soon. Will EAs be the first to create them though? There is a strong first mover advantage waiting for someone -

1. Volunteer Coordinator - there will soon be a groundswell from the general population wanting to have a positive impact in AI. Most won't know how to. A volunteer manager will help capture and direct their efforts positively, for example, by having them write emails to politicians

2. Partnerships Manager - the President of the Voice Actors guild reached out... (read more)

LessOnline Festival

May 31st to June 2nd, Berkeley CA