johnswentworth

Karma: 52,599

johnswentworth 23 Feb 2025 5:39 UTC
LW: 14 AF: 8
5
AF
in reply to: joshc’s comment on: How might we safely pass the buck to AI?
That’s a much more useful answer, actually. So let’s bring it back to Eliezer’s original question:
Can you tl;dr how you go from “humans cannot tell which alignment arguments are good or bad” to “we justifiably trust the AI to report honest good alignment takes”? Like, not with a very large diagram full of complicated parts such that it’s hard to spot where you’ve messed up. Just whatever simple principle you think lets you bypass GIGO.
[...]
Broadly speaking, the standard ML paradigm lets you bootstrap somewhat from “I can verify whether this problem was solved” to “I can train a generator to solve this problem”.
So to summarize your short, simple answer to Eliezer’s question: you want to “train AI agents that are [somewhat] smarter than ourselves with ground truth reward signals from synthetically generated tasks created from internet data + a bit of fine-tuning with scalable oversight at the end”. And then you hope/expect/(??have arguments or evidence??) that this allows us to (?justifiably?) trust the AI to report honest good alignment takes sufficient to put shortly-posthuman AIs inside the basin of attraction of a good eventual outcome, despite (as Eliezer puts it) humans being unable to tell which alignment takes are good or bad.
Or, to compact the summary even further: you want to train the somewhat-smarter-than-human AI on easily-verifiable synthetically-generated tasks, and then hope/expect that its good performance on those tasks generalizes to a problem which is not easily verifiable or synthetically generated, namely the problem of checking that a next generation of AI is in the basin of attraction of a good-to-humans outcome.
(Note: I know you’ve avoided talking about the basin of attraction of a good-to-humans outcome, instead focused on just some short-term goal like e.g. not being killed by the very next generation of AI. Not focusing on the basin of attraction is a mistake, and we can go into why it’s a mistake if that turns out to be cruxy.)
In Eliezer’s comment, he was imagining a training setup somewhat different from easily-verifiable synthetically-generated tasks:
Assume that whenever OpenPhil tries to run an essay contest for saying what they’re getting wrong, their panel of judges ends up awarding the prize to somebody reassuringly saying that AI risk is an even smaller deal than OpenPhil thinks. How does OpenPhil bootstrap from that pattern of thumbs-up/thumbs-down to an AI that actually has better-than-OpenPhil alignment takes?
… but the analogue of the problem Eliezer was highlighting, in the context of training on easily-verifiable synthetically-generated tasks, is the question: how and why would we justifiably trust that an AI trained on easily-verifiable synthetic tasks generalizes to not-easily-verifiable real-world tasks?

johnswentworth 22 Feb 2025 18:40 UTC
LW: 23 AF: 12
15
AF
in reply to: joshc’s comment on: How might we safely pass the buck to AI?
The 3 month Eliezer sim might spin up many copies of other 3 month Eliezer sims, which together produce outputs that a 6-month Eliezer sim might produce.
This seems very blatantly not viable-in-general, in both theory and practice.
On the theory side: there are plenty of computations which cannot be significantly accelerated via parallelism with less-than-exponential resources. (If we do have exponential resources, then all binary circuits can be reduced to depth 2, but in the real world we do not and will not have exponential resources.) Serial computation requirements are, in fact, a thing. So you can’t just have a bunch of Eliezers do in 3 months what a single 6 month Eliezer could do, in general.
Even if you allow the 3-month Eliezer sims to act one-after-another, rather than trying to get everything done in 3 months simulated time via pure parallelism, there’s still a tight communication bottleneck between each Eliezer sim. There are presumably plenty of circuits which cannot be implemented with tight information bottlenecks every n serial steps.
… of course you could circumvent all that theory by e.g. just having each Eliezer emulate a single logic gate, or some comparably trivial thing, but at that point you run afoul of the non-compositionality of safety properties: putting together a bunch of “safe” things (or “interpretable” things, or “aligned” things, or “corrigible” things, …) does not in-general produce a safe thing.
So that’s the theory side. What about the practice side?
Well, Ought did that roughly that experiment years ago and it did not work at all. And that should not be surprising—as the link argues, we have extremely ample evidence from day-to-day life that such things do not work in general.

johnswentworth 13 Feb 2025 23:04 UTC
LW: 7 AF: 4
0
AF
in reply to: kave’s comment on: Why Agent Foundations? An Overly Abstract Explanation
I think that’s basically right, and good job explaining it clearly and compactly.
I would also highlight that it’s not just about adversaries. One the main powers of proof-given-assumptions is that it allows to rule out large classes of unknown unknowns in one go. And, insofar as the things-proven-given-assumptions turn out to be false, it allows to detect previously-unknown unknowns.

johnswentworth 9 Feb 2025 17:07 UTC
14 points
9
in reply to: johnswentworth’s comment on: So You Want To Make Marginal Progress...
Now to get a little more harsh...
Without necessarily accusing Kaj specifically, this general type of argument feels motivated to me. It feels like willful ignorance, like sticking one’s head in the sand and ignoring the available information, because one wants to believe that All Research is Valuable or that one’s own research is valuable or some such, rather than facing the harsh truth that much research (possibly one’s own) is predictably-in-advance worthless.

johnswentworth 9 Feb 2025 16:51 UTC
7 points
4
in reply to: Kaj_Sotala’s comment on: So You Want To Make Marginal Progress...
That sort of reasoning makes sense insofar as it’s hard to predict which small pieces will be useful. And while that is hard to some extent, it is not full we-just-have-no-idea-so-use-a-maxent-prior hard. There is plenty of work (including lots of research which people sink their lives into today) which will predictably-in-advance be worthless. And robust generalizability is the main test I know of for that purpose.
Applying this to your own argument:
Often when I’ve had a hypothesis about something that interests me, I’ve been happy that there has been *so much* scientific research done on various topics, many of them seemingly insignificant. While most of it is of little interest to me, the fact that there’s so much of it means that there’s often some prior work on topics that do interest me.
It will predictably and systematically be the robustly generalizable things which are relevant to other people in unexpected ways.

johnswentworth 8 Feb 2025 3:12 UTC
15 points
11
in reply to: Ruby’s comment on: So You Want To Make Marginal Progress...
Yup, if you actually have enough knowledge to narrow it down to e.g. a 65% chance of one particular major route, then you’re good. The challenging case is when you have no idea what the options even are for the major route, and the possibility space is huge.

johnswentworth 8 Feb 2025 2:46 UTC
6 points
2
in reply to: ryan_greenblatt’s comment on: So You Want To Make Marginal Progress...
Yeah ok. Seems very unlikely to actually happen, and unsure whether it would even work in principle (as e.g. scaling might not take you there at all, or might become more resource intensive faster than the AIs can produce more resources). But I buy that someone could try to intentionally push today’s methods (both AI and alignment) to far superintelligence and simply turn down any opportunity to change paradigm.

johnswentworth 8 Feb 2025 2:43 UTC
13 points
7
in reply to: ryan_greenblatt’s comment on: So You Want To Make Marginal Progress...
Aliens kill you due to slop, humans depend on the details.
The basic issue here is that the problem of slop (i.e. outputs which look fine upon shallow review but aren’t fine) plus the problem of aligning a parent-AI in such a way that its more-powerful descendants will robustly remain aligned, is already the core of the superintelligence alignment problem. You need to handle those problems in order to safely do the handoff, and at that point the core hard problems are done anyway. Same still applies to aliens: in order to safely do the handoff, you need to handle the “slop/nonslop is hard to verify” problem, and you need to handle the “make sure agents the aliens build will also be aligned, and their children, etc” problem.

johnswentworth 8 Feb 2025 1:45 UTC
12 points
12
in reply to: ryan_greenblatt’s comment on: So You Want To Make Marginal Progress...
It’s not clear to me we’ll have (or will “need”) new paradigms before fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks.
If you want to not die to slop, then “fully handing over all technical and strategic work to AIs which are capable enough to obsolete humans at all cognitive tasks” not a thing which happens at all until the full superintelligence alignment problem is solved. That is how you die to slop.

johnswentworth 8 Feb 2025 1:43 UTC
10 points
0
in reply to: ryan_greenblatt’s comment on: So You Want To Make Marginal Progress...
If by superintelligence, you mean wildly superhuman AI, it remains non-obvious to me that new paradigms are needed (though I agree they will pretty likely arise prior to this point due to AIs doing vast quantity of research if nothing else). I think thoughtful and laborious implementation of current paradigm strategies (including substantial experimentation) could directly reduce risk from handing off to superintelligence down to perhaps 25% and I could imagine being argued considerably lower.
I find it hard to imagine such a thing being at all plausible. Are you imagining that jupiter brains will be running neural nets? That their internal calculations will all be differentiable? That they’ll be using strings of human natural language internally? I’m having trouble coming up with any “alignment” technique of today which would plausibly generalize to far superintelligence. What are you picturing?

johnswentworth 8 Feb 2025 1:04 UTC
39 points
14
in reply to: ryan_greenblatt’s comment on: So You Want To Make Marginal Progress...
This post seems to assume that research fields have big hard central problems that are solved with some specific technique or paradigm.
This isn’t always true. [...]
I would say it is basically-always true, but there are some fields (including deep learning today, for purposes of your comment) where the big hard central problems have already been solved, and therefore the many small pieces of progress on subproblems are all of what remains.
And insofar as there remains some problem which is simply not solvable within a certain paradigm, that is a “big hard central problem”, and progress on the smaller subproblems of the current paradigm is unlikely by-default to generalize to whatever new paradigm solves that big hard central problem.
I agree that paradigm shifts can invalidate large amounts or prior work, but it isn’t obvious whether this will occur in AI safety.
I claim it is extremely obvious and very overdetermined that this will occur in AI safety sometime between now and superintelligence. The question which you’d probably find more cruxy is not whether, but when—in particular, does it come before or after AI takes over most of the research?
… but (I claim) that shouldn’t be the cruxy question, because we should not be imagining completely handing off the entire alignment-of-superintelligence problem to early transformative AI; that’s a recipe for slop. We ourselves need to understand a lot about how things will generalize beyond the current paradigm, in order to recognize when that early transformative AI is itself producing research which will generalize beyond the current paradigm, in the process of figuring out how to align superintelligence. If an AI assistant produces alignment research which looks good to a human user, but won’t generalize across the paradigm shifts between here and superintelligence, then that’s a very plausible way for us to die.

johnswentworth 7 Feb 2025 20:17 UTC
5 points
0
on: The Risk of Gradual Disempowerment from AI
Most importantly, current proposed technical plans are necessary but not sufficient to stop this. Even if the technical side fully succeeds no one knows what to do with that.
I don’t think that’s quite accurate. In particular, gradual disempowerment is exactly the sort of thing which corrigibility would solve. (At least for “corrigibility” in the sense David and I use the term, and probably Yudkowsky, but not Christiano’s sense; he uses the term to mean a very different thing.)
A general-purpose corrigible AI (in the sense we use the term) is pretty accurately thought-of as an extension of the user. Building and using such an AI is much more like “uplifting” the user than like building an independent agent. It’s the cognitive equivalent of gaining prosthetic legs, as opposed to having someone carry you around on a sedan. Another way to state it: a corrigible subsystem acts like it’s a part of a larger agent, serving a particular purpose as a component of the larger agent, as opposed to acting like an agent in its own right.
… admittedly corrigibility is still very much in the “conceptual” stage, far from an actual technical plan. But it’s at least a technical research direction which would pretty directly address the disempowerment problem.

johnswentworth 7 Feb 2025 17:03 UTC
8 points
0
in reply to: Raemon’s comment on: C’mon guys, Deliberate Practice is Real
I think that accurately summarizes the concrete things, but misses a mood.
The missing mood is less about “grappling with the enormity of it”, and more about “grappling with the effort of it” or “accepting the need to struggle for real”. Like, in terms of time spent, we’re talking maybe a year or two of full-time effort, spread out across maybe 3-5 years. That’s far more than the post grapples with, but not prohibitively enormous; it’s comparable to getting a degree. The missing mood is more about “yup, it’s gonna be hard, and I’m gonna have to buckle down and do the hard thing for reals, not try to avoid it”. The technical knowledge is part of that—like, “yup, I’m gonna have to actually for-reals learn some gnarly technical stuff, not try to avoid it”. But not just the technical study. For instance, actually for-reals noticing and admitting when I’m avoiding unpleasant truths, or when my plans won’t work and I need to change tack, has a similar feel: “yup, my plans are actually for-reals trash, I need to actually for-reals update, and I don’t yet have any idea what to do instead, and it looks hard”.

johnswentworth 7 Feb 2025 0:56 UTC
14 points
0
in reply to: Raemon’s comment on: C’mon guys, Deliberate Practice is Real
I don’t claim to know all the key pieces of whatever they did, but some pieces which are obvious from talking to them both and following their writings:
- Both invested heavily in self-studying a bunch of technical material.
- Both heavily practiced the sorts of things Joe described in his recent “fake thinking and real thinking” post. For instance, they both have a trained habit of noticing the places where most people would gloss over things (especially technical/mathematical), and instead looking for a concrete example.
- Both heavily practiced the standard metacognitive moves, e.g. noticing when a mental move worked very well/poorly and reinforcing accordingly, asking “how could I have noticed that faster”, etc.
- Both invested in understanding their emotional barriers and figuring out sustainable ways to handle them. (I personally think both of them have important shortcomings in that department, but they’ve at least put some effort into it and are IMO doing better than they would otherwise.)
- Both have a general mindset of actually trying to notice their own shortcomings and improve, rather than make excuses or hide.
- And finally: both put a pretty ridiculous amount of effort into improving, in terms of raw time and energy, and in terms of emotional investment, and in terms of “actually doing the thing for real not just talking or playing”.

johnswentworth 6 Feb 2025 21:31 UTC
11 points
1
on: C’mon guys, Deliberate Practice is Real
I feel like the “this will just take so much time” section doesn’t really engage with the full-strength version of the critique.
When I think of people I know who have successfully gone from unremarkable to reasonably impressive via some kind of deliberate practice and training, the list consists of Nate Soares and Alex Turner. That’s it; that’s the entire list. Notably, they both followed a pretty similar path to get there (not by accident). And that path is not short.
Sure, you can do a lightweight version of “feedbackloop rationality” which just involves occasional short reviews or whatever. But that does not achieve most of the value.
My pitch to someone concerned about the timesink would instead be roughly: “Look, you know deep down that you are incompetent and don’t know what you’re doing (in the spirit of Impostor Syndrome and the Great LARP). You’re clinging to these stories about how the thing you’re currently doing happens to be useful somehow, but some part of you knows perfectly well that you’re doing your current work mainly just because you can do your current work without facing the scary fact of your own ineptitude when faced with any actually useful (and difficult) problem. The only way you will ever do something of real value is if you level the fuck up. Is that going to be a big timesink and take a lot of effort? Yes. But guess what? There isn’t actually a trade-off here. Your current work is not useful, you are not going to do anything useful without getting stronger, so how about you stop hiding from your own ineptitude and just go get stronger already?”.
(And yes, that is the sort of pitch I sometimes make to myself when considering whether to dump time and effort into some form of deliberate practice.)

johnswentworth 3 Feb 2025 16:57 UTC
10 points
0
in reply to: harfe’s comment on: What do coherence arguments actually prove about agentic behavior?
If you already accept the concept of expected utility maximization, then you could also use mixed strategies to get the convexity-like assumption (but that is not useful if the point is to motivate using probabilities and expected utility maximization).
That is indeed what I had in mind when I said we’d need another couple sentences to argue that the agent maximizes expected utility under the distribution. It is less circular than it might seem at first glance, because two importantly different kinds of probabilities are involved: uncertainty over the environment (which is what we’re deriving), and uncertainty over the agent’s own actions arising from mixed strategies.

johnswentworth 31 Jan 2025 18:23 UTC
5 points
−7
in reply to: yams’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Big crux here: I don’t actually expect useful research to occur as a result of my control-critique post. Even having updated on the discussion remaining more civil than I expected, I still expect basically-zero people to do anything useful as a result.
As a comparison: I wrote a couple posts on my AI model delta with Yudkowsky and with Christiano. For each of them, I can imagine changing ~one big piece in my model, and end up with a model which looks basically like theirs.
By contrast, when I read the stuff written on the control agenda… it feels like there is no model there at all. (Directionally-correct but probably not quite accurate description:) it feels like whoever’s writing, or whoever would buy the control agenda, is just kinda pattern-matching natural language strings without tracking the underlying concepts those strings are supposed to represent. (Joe’s recent post on “fake vs real thinking” feels like it’s pointing at the right thing here; the posts on control feel strongly like “fake” thinking.) And that’s not a problem which gets fixed by engaging at the object level; that type of cognition will mostly not produce useful work, so getting useful work out of such people would require getting them to think in entirely different ways.
… so mostly I’ve tried to argue at a different level, like e.g. in the Why Not Just… posts. The goal there isn’t really to engage the sort of people who would otherwise buy the control agenda, but rather to communicate the underlying problems to the sort of people who would already instinctively feel something is off about the control agenda, and give them more useful frames to work with. Because those are the people who might have any hope of doing something useful, without the whole structure of their cognition needing to change first.

johnswentworth 31 Jan 2025 16:21 UTC
6 points
0
in reply to: Alvin Ånestrand’s comment on: The Case Against AI Control Research
Even if all of those are true, the argument in the post would still imply that control research (at least of the sort people do today) cannot have very high expected value. Like, sure, let’s assume for sake of discussion that most total AI safety research will be done by early transformative AI, that the only chance of aligning superintelligent AIs is to delegate, that control research is unusually tractable, and that for some reason we’re going to use the AIs to pursue formal verification (not a good idea, but whatever).
Even if we assume all that, we still have the problem that control research of the sort people do today does basically-nothing to address slop; it is basically-exclusively focused on intentional scheming. Insofar as intentional scheming is not the main thing which makes outsourcing to early AIs fail, all that control research cannot have very high expected value. None of your bullet points address that core argument at all.

johnswentworth 30 Jan 2025 17:14 UTC
4 points
0
in reply to: Thane Ruthenis’s comment on: johnswentworth’s Shortform
Just because the number of almost-orthogonal vectors in $d$ dimensions scales exponentially with $d$ , doesn’t mean one can choose all those signals independently. We can still only choose $d$ real-valued signals at a time (assuming away the sort of tricks by which one encodes two real numbers in a single real number, which seems unlikely to happen naturally in the body). So “more intended behaviors than input-vector components” just isn’t an option, unless you’re exploiting some kind of low-information-density in the desired behaviors (like e.g. very “sparse activation” of the desired behaviors, or discreteness of the desired behaviors to a limited extent).

johnswentworth 30 Jan 2025 17:07 UTC
5 points
0
in reply to: Max Harms’s comment on: Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
TBC, I don’t particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up “property rights”. So, a system can be generally corrigible by “respecting the convergent property rights”, so to speak.