LessWrong

211

21d

This is a linkpost for https://ailabwatch.org

I'm launching AI Lab Watch. I collected actions for frontier AI labs to improve AI safety, then evaluated some frontier labs accordingly.

It's a collection of information on what labs should do and what labs are doing. It also has some adjacent resources, including a list of other safety-ish scorecard-ish stuff.

(It's much better on desktop than mobile — don't read it on mobile.)

It's in beta—leave feedback here or comment or DM me—but I basically endorse the content and you're welcome to share and discuss it publicly.

It's unincorporated, unfunded, not affiliated with any orgs/people, and is just me.

Some clarifications and disclaimers.

How you can help:

Give feedback on how this project is helpful or how it could be different to be much more helpful
Tell me what's wrong/missing; point me to sources

...

(See More – 208 more words)

Wei Dai1h117

Unfortunately I don't have well-formed thoughts on this topic. I wonder if there are people who specialize in AI lab governance and have written about this, but I'm not personally aware of such writings. To brainstorm some ideas:

Conduct and publish anonymous surveys of employee attitudes about safety.
Encourage executives, employees, board members, advisors, etc., to regularly blog about governance and safety culture, including disagreements over important policies.
Officially encourage (e.g. via financial rewards) internal and external whistleblowers. E

... (read more)

eggsyntax's Shortform

eggsyntax

4mo

17eggsyntax3h

Anthropic's new paper 'Mapping the Mind of a Large Language Model' is exciting work that really advances the state of the art for the sparse-autoencoder-based dictionary learning approach to interpretability (which switches the unit of analysis in mechanistic interpretability from neurons to features). Their SAE learns (up to) 34 million features on a real-life production model, Claude 3 Sonnet (their middle-sized Claude 3 model). The paper (which I'm still reading, it's not short) updates me somewhat toward 'SAE-based steering vectors will Just Work for LLM alignment up to human-level intelligence[1].' As I read I'm trying to think through what I would have to see to be convinced of that hypothesis. I'm not expert here! I'm posting my thoughts mostly to ask for feedback about where I'm wrong and/or what I'm missing.Remaining gaps I've thought of so far: * What's lurking in the remaining reconstruction loss? Are there important missing features? * Will SAEs get all meaningful features given adequate dictionary size? * Are there important features which SAEs just won't find because they're not that sparse? * Is steering on clearly safety-relevant features sufficient, or are there interactions between multiple not-clearly-safety-relevant features that in combination cause problems? * How well do we even think we understand feature compositionality, especially across multiple layers? How would we measure that? I would think the gold standard would be 'ability to predict model output given context + feature activations'? * Does doing sufficient steering on safety-relevant features cause unacceptable distortions to model outputs? * eg if steering against scam emails causes the model to see too many emails as scammy and refuse to let you write a billing email * eg if steering against power-seeking causes refusal on legitimate tasks that include resource acquisition * Do we find ways to make SAEs efficient enough to be scaled to production models wi

Seth Herd2h20

I wrote up a short post with a summary of their results. It doesn't really answer any of your questions. I do have thoughts on a couple, even though I'm not expert on interpretability.

But my main focus is on your footnote: is this going to help much with aligning "real" AGI (I've been looking for a term; maybe REAL stands for Reflective Entities with Agency and Learning?:). I'm of course primarily thinking of foundation models scaffolded to have goals, cognitive routines, and incorporate multiple AI systems such as an episodic memory system. I think ... (read more)

Kitchener-Waterloo Rationality

Authentic Relating Games

May 23rdWaterloo

jenn

Meet inside The Shops at Waterloo Town Square - we will congregate in the seating area next to the Valu-Mart with the trees sticking out in the middle of the benches at 7pm for 15 minutes, and then head over to my nearby apartment's amenity room. If you've been around a few times, feel free to meet up at my apartment front door for 7:30 instead. (There is free city parking at Bridgeport and Regina, 22 Bridgeport Rd E.)

Event

It's been a while since the last one, so I'm running another session of authentic relating games!

Things to expect and prepare for, for those who haven't been to one of these before: edgy questions, physical touch, emotional connection, and a heightened sense of self-awareness. You can of course opt out from any individual game.

For more information, you can check out the Authentic Relating Games Mini-Manual for free on Gumroad, or just message me :)

Helping loved ones with their finances: the why and how of an unusually impactful opportunity

Sam Anschell

This is a linkpost for https://forum.effectivealtruism.org/posts/vQ9G7YCNpLRPQxbpH/helping-loved-ones-with-their-finances-the-why-and-how-of-an

Linkposting a writeup of my learnings from helping family members augment their investments. I encourage LessWrong users to check it out; I expect the post contains new and actionable information for a number of readers.

Thanks in advance for any comments or feedback that can help the post be more useful to others!

Seth Herd2h20

It would be more useful with a little more info on what ideas you're offering. Linkposts with more description get more clickthrough. You can edit in a little more info.

Forecasting: the way I think about it

Molly

13d

This is a linkpost for https://cuttyshark.substack.com/p/forecasting-the-way-i-think-about

This is the first post in a little series I'm slowly writing on how I see forecasting, particularly conditional forecasting; what it's good for; and whether we should expect people to agree if they just talk to each other enough.

Views are my own. I work at the Forecasting Research Institute (FRI), I forecast with the Samotsvety group, and to the extent that I have formal training in this stuff, it's mostly from studying and collaborating with Leonard Smith, a chaos specialist.

My current plan is:

Forecasting: the way I think about it [this post]
The promise of conditional forecasting / cruxing for parameterizing our models of the world
What we're looking at and what we're paying attention to (Or: why we shouldn't expect people to agree today (Or: there is no "true" probability))

What...

(See More – 590 more words)

Molly2h10

Figure 1 is clumsy, sorry. In the case of a smooth probability distribution of infinite worlds, I think the median and the average world are the same? But in practice, yes, it's an expected value calculation, summing P(world) * P(U|world) for all the worlds you've thought about.

"No-one in my org puts money in their pension"

246

Tobes

3mo

This is a linkpost for https://seekingtobejolly.substack.com/p/no-one-in-my-org-puts-money-in-their

Epistemic status: the stories here are all as true as possible from memory, but my memory is so so.

This is going to be big

It’s late Summer 2017. I am on a walk in the Mendip Hills. It’s warm and sunny and the air feels fresh. With me are around 20 other people from the Effective Altruism London community. We’ve travelled west for a retreat to discuss how to help others more effectively with our donations and careers. As we cross cow field after cow field, I get talking to one of the people from the group I don’t know yet. He seems smart, and cheerful. He tells me that he is an AI researcher at Google DeepMind. He explains how he is thinking about...

(Continue Reading – 2425 more words)

DPiepgrass3h10

I can't recall another time when someone shared their personal feelings and experiences and someone else declared it "propaganda and alarmism". I haven't seen "zero-risker" types do the same, but I would be curious to hear the tale and, if they share it, I don't think anyone one will call it "propaganda and killeveryoneism".

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

What's Going on With OpenAI's Messaging?

127

ozziegooen

This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.

Some arguments that OpenAI is making, simultaneously:

OpenAI will likely reach and own transformative AI (useful for attracting talent to work there).
OpenAI cares a lot about safety (good for public PR and government regulations).
OpenAI isn’t making anything dangerous and is unlikely to do so in the future (good for public PR and government regulations).
OpenAI doesn’t need to spend many resources on safety, and implementing safe AI won’t put it at any competitive disadvantage (important for investors who own most of the company).
Transformative AI will be incredibly valuable for all of humanity in the long term (for public PR and developers).
People at OpenAI have thought long and

...

(See More – 835 more words)

alcherblack3h90

Broadly agree except for this part:

Its in an area that some people (not the OpenAI management) think is unusually high-risk,

I really can't imagine that someone who wrote "Development of superhuman machine intelligence is probably the greatest threat to the continued existence of humanity." in 2015 and occasionally references extinction as a possibility when not directly asked about doesn't think AGI development is high risk.

I'm not sure how to square this circle. I almost hope Sam is being consciously dishonest and has a 4D chess plan, as opposed to ... (read more)

8Steven Byrnes6h

I think Yann LeCun thinks "AGI in 2040 is perfectly plausible", AND he believes "AGI is so far away it's not worth worrying about all that much". It's a really insane perspective IMO. As recently as like 2020, "AGI within 20 years" was universally (correctly) considered to be a super-soon forecast calling for urgent action, as contrasted with the people who say "centuries".

1O O11h

I recall him saying this on Twitter and linking a person in a leadership position who runs things there. Don’t know how to search that.

6Rebecca15h

My impression is that post-board drama, they’ve de-emphasised the non-profit messaging. Also in a more recent interview Sam said basically ‘well I guess it turns out the board can’t fire me’ and that in the long term there should be democratic governance of the company. So I don’t think it’s true that #8-10 are (still) being pushed simultaneously with the others. I also haven’t seen anything that struck me as communicating #3 or #11, though I agree it would be in OpenAI’s interest to say those things. Can you say more about where you are seeing that?

The Problem With the Word ‘Alignment’

peligrietzer, particlemania

Ω 181d

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushant jha), for AOI^[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment.

The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias and confusion, as well as our own search for clarifying concepts.

At AOI, we try to...

(Continue Reading – 1550 more words)

Seth Herd3h20

I think you're right about these drawbacks of using the term "alignment" so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.

I don't see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they u... (read more)

2Seth Herd3h

I think you're right about these drawbacks of the widespread use of the term alignment for

1Lucas Teixeira7h

For clarity, how do you distinguish between P1 & P4?

2quiet_NaN9h

I think that "AI Alignment" is a useful label for the somewhat related problems around P1-P6. Having a term for the broader thing seems really useful. Of course, sometimes you want labels to refer to a fairly narrow thing, like the label "Continuum Hypothesis". But broad labels are generally useful. Take "ethics", another broad field label. Nominative ethics, applied ethics, meta-ethics, descriptive ethics, value theory, moral psychology, et cetera. I someone tells me "I study ethics" this narrows down what problems they are likely to work on, but not very much. Perhaps they work out a QALY-based systems for assigning organ donations, or study the moral beliefs of some peoples, or argue if moral imperatives should have a truth value. Still, the label confers a lot of useful information over a broader label like "philosophy". By contrast, "AI Alignment" still seems rather narrow. P2 for example seems a mostly instrumental goal: if we have interpretability, we have better chances to avoid a takeover of an unaligned AI. P3 seems helpful but insufficient for good long term outcomes: an AI prone to disobeying users or interpreting their orders in a hostile way would -- absent some other mechanism -- also fail to follow human values more broadly, but an P3-aligned AI in the hand of a bad human actor could still cause extinction, and I agree that social structures should probably be established to ensure that nobody can unilaterally assign the core task (or utility function) of an ASI.

yanni's Shortform

yanni kyriacos

2mo

yanni kyriacos3h10

Two jobs in AI Safety Advocacy that AFAICT don't exist, but should and probably will very soon. Will EAs be the first to create them though? There is a strong first mover advantage waiting for someone -

1. Volunteer Coordinator - there will soon be a groundswell from the general population wanting to have a positive impact in AI. Most won't know how to. A volunteer manager will help capture and direct their efforts positively, for example, by having them write emails to politicians

2. Partnerships Manager - the President of the Voice Actors guild reached out... (read more)

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Event

This is going to be big

LessOnline Festival