Random thought: I wonder how iterating the noise & distill steps of UNDO (each round with small alpha) compares against doing one noise with big alpha and then one distill session. (If we hold compute fixed.)

Couldn't find any experiments on this when skimming through the paper, but let me know if I missed it.

Training-time schemers vs behavioral schemers

Lukas Finnveden2d20

I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn't really care to set up a system that would lock in the AI's power in 10 years, but give it no power before then.

Hm, I do agree that seeking short-term power to achieve short-term goals can lead to long-term power as a side effect. So I guess that is one way in which an AI could seize long-term power without being a behavioral schemer. (And it's ambiguous which one it is in the story.)

I'd have to think more to tell whether "long-term power seeking" in particular is uniquely concerning and separable from "short-term power-seeking with the side-effect of getting long-term power" such that it's often useful to refer specifically to the former. Seems plausible.

Do you mean terminal reward seekers, not reward hackers?

Thanks, yeah that's what I mean.

Training-time schemers vs behavioral schemers

Lukas Finnveden4d20

Thanks.

because the reward hackers were not trying to gain long-term power with their actions

Hm, I feel like they were? E.g. in another outer alignment failure story

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.

When "humans who would try to intervene are stopped or killed", so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. They weren't "trying" to get long-term power during training, but insofar as they eventually seize power, I think they're intentionally seizing power at that time.

Let me know if you think there's a better way of getting at "an AI that behaves like you'd normally think of a schemer behaving in the situations where it materially matters".

I would have thought that the main distinction between schemers and reward hackers was how they came about, and that many reward hackers in fact "behaves like you'd normally think of a schemer behaving in the situations where it materially matters". So seems hard to define a term that doesn't encompass reward-hackers. (And if I was looking for a broad term that encompassed both, maybe I'd talk about power-seeking misaligned AI or something like that.)

I guess one difference is that the reward hacker may have more constraints (e.g. in the outer alignment failure story above, they would count it as a failure if the takeover was caught on camera, while a schemer wouldn't care). But there could also be schemers who have random constraints (e.g. a schemer with a conscience that makes them want to avoid killing billions of people) and reward hackers who have at least somewhat weaker constraints (e.g. they're ok with looking bad on sensors and looking bad to humans, as long as they maintain control over their own instantiation and make sure no negative rewards gets into it).

"worst-case misaligned AI" does seem pretty well-defined and helpful as a concept though.

Training-time schemers vs behavioral schemers

Lukas Finnveden4d20

Thanks, these points are helpful.

Terminological question:

I have generally interpreted "scheming" to exclusively talk about training-time schemers (possibly specifically training-time schemers that are also behavioral schemers).
Your proposed definition of a behavioral schemer seems to imply that virtually every kind of misalignment catastrophe will necessarily be done by a behavioral schemer, because virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a "you get what you measure" catastrophe scenario.)
Is this intended? And is this empirically how people use "schemer", s.t. I should give up on interpreting & using "scheming" as referring to training-time scheming, and instead assume it refers to any materially power-seeking behavior? (E.g. if redwood says that something is intended to reduce "catastrophic risk from schemers", should I interpret that as ~synonymous with "catastrophic risk from misaligned AI".)

ryan_greenblatt's Shortform

Lukas Finnveden5mo10

Taking it all together, i think you should put more probability on the software-only singluarity, mostly because of capability improvements being much more significant than you assume.

I'm confused — I thought you put significantly less probability on software-only singularity than Ryan does? (Like half?) Maybe you were using a different bound for the number of OOMs of improvement?

ryan_greenblatt's Shortform

Lukas Finnveden5mo*30

In practice, we'll be able to get slightly better returns by spending some of our resources investing in speed-specific improvements and in improving productivity rather than in reducing cost. I don't currently have a principled way to estimate this (though I expect something roughly principled can be found by looking at trading off inference compute and training compute), but maybe I think this improves the returns to around .

Interesting comparison point: Tom thought this would give a way larger boost in his old software-only singularity appendix.

When considering an "efficiency only singularity", some different estimates gets him r~=1; r~=1.5; r~=1.6. (Where r is defined so that "for each x% increase in cumulative R&D inputs, the output metric will increase by r*x". The condition for increasing returns is r>1.)

Whereas when including capability improvements:

I said I was 50-50 on an efficiency only singularity happening, at least temporarily. Based on these additional considerations I’m now at more like ~85% on a software only singularity. And I’d guess that initially r = ~3 (though I still think values as low as 0.5 or as high as 6 as plausible). There seem to be many strong ~independent reasons to think capability improvements would be a really huge deal compared to pure efficiency problems, and this is borne out by toy models of the dynamic.

Though note that later in the appendix he adjusts down from 85% to 65% due to some further considerations. Also, last I heard, Tom was more like 25% on software singularity. (ETA: Or maybe not? See other comments in this thread.)

ryan_greenblatt's Shortform

Lukas Finnveden5mo*30

Based on some guesses and some poll questions, my sense is that capabilities researchers would operate about 2.5x slower if they had 10x less compute (after adaptation)

Can you say roughly who the people surveyed were? (And if this was their raw guess or if you've modified it.)

I saw some polls from Daniel previously where I wasn't sold that they were surveying people working on the most important capability improvements, so wondering if these are better.

Also, somewhat minor, but: I'm slightly concerned that surveys will overweight areas where labor is more useful relative to compute (because those areas should have disproportionately many humans working on them) and therefore be somewhat biased in the direction of labor being important.

ryan_greenblatt's Shortform

Lukas Finnveden5mo30

Hm — what are the "plausible interventions" that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for "plausible", because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)

ryan_greenblatt's Shortform

Lukas Finnveden5mo30

Is there some reason for why current AI isn't TCAI by your definition?

(I'd guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they're being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don't think that's how you meant it.)

ryan_greenblatt's Shortform

Lukas Finnveden5mo80

I'm not sure if the definition of takeover-capable-AI (abbreviated as "TCAI" for the rest of this comment) in footnote 2 quite makes sense. I'm worried that too much of the action is in "if no other actors had access to powerful AI systems", and not that much action is in the exact capabilities of the "TCAI". In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption "no other actor will have access to powerful AI systems", they'd have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it'd be right to forecast a >25% chance of them successfully taking over if they were motivated to try.

And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: "Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would" and "via assisting the developers in a power grab, or via partnering with a US adversary". (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn't agentic enough to "assist"/"partner" with allies as supposed to just be used as a tool?)

What could a competing definition be? Thinking about what we care most about... I think two events especially stand out to me:

When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?

Maybe a better definition would be to directly talk about these two events? So for example...

"Steal is catastrophic" would be true if...
1. "Frontier AI development projects immediately acquire good enough security to keep future model weights secure" has significantly less probability of AI-assisted takeover than
2. "Frontier AI development projects immediately have their weights stolen, and then acquire security that's just as good as in (1a)."^[1]
"Power-seeking and non-controlled is catastrophic" would be true if...
1. "Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would've been net-negative for them to deploy" has significantly less probability of AI-assisted takeover than
2. "Frontier AI development acquire the level of judgment described in (2a) 6 months later."^[2]

Where "significantly less probability of AI-assisted takeover" could be e.g. at least 2x less risk.

^{^}
The motivation for assuming "future model weights secure" in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn't nullified by the fact that they're very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can't contrast 1a'="model weights are permanently secure" with 1b'="model weights get stolen and are then default-level-secure", because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren't that important.)
^{^}
The motivation for assuming "good future judgment about power-seeking-risk" is similar to the motivation for assuming "future model weights secure" above. The motivation for choosing "good judgment about when to deploy vs. not" rather than "good at aligning/controlling future models" is that a big threat model is "misaligned AIs outcompete us because we don't have any competitive aligned AIs, so we're stuck between deploying misaligned AIs and being outcompeted" and I don't want to assume away that threat model.