Towards_Keeperhood

Karma: 763

I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Towards_Keeperhood 29 May 2025 15:53 UTC
1 point
0
in reply to: Rohin Shah’s comment on: Shah and Yudkowsky on alignment failures
Thanks.
I think you are being led astray by having a one-dimensional notion of intelligence.
(I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won’t help you much for the parts of the problem we are most bottlenecked on.)
You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn’t be large.
Or do you imagine strategically keeping some information from the AI?
Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)
Even if the alignment works out perfectly, when the AI is smarter and the humans are like “actually we want to shut you down”, the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn’t actually going to happen, it can just be like “sorry, that’s not actually in your extrapolated interests, you will perhaps understand later when you’re smarter”, and then tries to fulfill human values.
But if we’re confident alignment to humans will work out we don’t need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.
If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don’t find out or won’t (be able to) change its values back, because that’s the strategy that’s best according to the AI’s new values.
Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
Likewise just updating on new information, not changing terminal goals.
Also note that parents often think (sometimes correctly) that they better know what is in the child’s extrapolated interests and then don’t act according to the child’s stated wishes.
And I think superhumanly smart AIs will likely be better at guessing what is in a human’s interests than parents guessing what is in their child’s interest, so the cases where the strategy gets updated are less significant.
I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
From my perspective CIRL doesn’t really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)
I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.

Towards_Keeperhood 28 May 2025 14:15 UTC
1 point
0
on: Reward button alignment
I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.
But it’s not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn’t work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like “eh probably it works fine but not sure”, but my current biggest doubt is “when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?”.
Let me try to understand in more detail how you imagine the AI to look like:
1. How does the learned value function evaluate plans?
  1. Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
    Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just “the human walks toward the reward button” or also “getting more relevant knowledge is usually good”.
    Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like “value of information”, “gaining instrumental resources”, etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
  2. Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
    If there are deceptive plans like “ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup”, would the LVF detect that as being low expected button presses?
I can try to imagine in more detail about what may go wrong once I better see what you’re imagining.
(Also in case you’re trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)

Towards_Keeperhood 28 May 2025 13:51 UTC
1 point
0
in reply to: Rafael Harth’s comment on: [Intuitive self-models] 8. Rooting Out Free Will Intuitions
(You did respond to all the important parts, rest of my comment is very much optional.)
My reading was that you still have an open disagreement where Steve thinks there’s not much more to explain but you still want an answer to “Why did people invent the word ‘consciousness’ and wrote what they wrote about it? What algorithm might output sentences describing fascination about the redness of red?” which Steve’s series doesn’t answer.
I wouldn’t give up that early on trying to convince Steve he’s missing some part. (Though possible that I misread Steve’s comment and he understood you, I didn’t read it precisely.)

Towards_Keeperhood 28 May 2025 13:17 UTC
1 point
0
on: [Intuitive self-models] 8. Rooting Out Free Will Intuitions
Here’s the (obvious) strategy: Apply voluntary attention-control to keep S(getting out of bed) at the center of attention. Don’t let it slip away, no matter what.
Can you explain more precisely how this works mechanistically? What is happening to keep S(getting out of bed) in the center of attention.
8.5.6.1 Aside: The “innate drive to minimize voluntary attention control”
Your hypothesis here doesn’t seem to me to explain why we seem to have limited willpower budget for attention control which gets depleted but which also regenerates after a time. I can see how negative rewards from minimizing voluntary attention control can make us less likely to apply willpower in the future, but why would it regenerate then?

Towards_Keeperhood 28 May 2025 13:09 UTC
1 point
0
in reply to: Seth Herd’s comment on: [Intuitive self-models] 8. Rooting Out Free Will Intuitions
Btw, there’s another simpler possible mechanism, though I don’t know the neuroscience and perhaps Steve’s hypothesis with separate valence assessors and involuntary attention control fits the neuroscience evidence much better and it may also fit observed motivated reasoning better.
But the obvious way to design a mind would be to make it just focus on whatever is most important, aka where most expected utility per necessary resources could be gained.
So we still have a learned value function which assigns how good/bad something would be, but we also have an estimator of how much the value would increase if we continue thinking (which might e.g. happen because one makes plans for making a somewhat bad situation better), and what gets attended on depends on this estimator, not the value function directly.

Towards_Keeperhood 28 May 2025 11:16 UTC
1 point
0
on: [Intuitive self-models] 2. Conscious Awareness
The “S” in “S(X)” and “S(A)” seems different to me. If I rename the “S” in “S(A)” to “I”, it would make more sense to me:
- A = action of standing up (which gets actually executed if positive valence)
- I(A) = imagined scene of myself standing up
- S(I(A)) = the thought “I am thinking about standing up”

Towards_Keeperhood 27 May 2025 20:15 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: Reward button alignment
Yeah I agree that it wouldn’t be a very bad kind of s-risk. The way I thought about s-risk was more like expected amount of suffering. But yeah I agree with you it’s not that bad and perhaps most expected suffering comes from more active utility-invert threats or values.
(Though tbc, I was totally imagining 1e40 humans being forced to press reward buttons.)

Towards_Keeperhood 27 May 2025 19:25 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
I probably got more out of watching Hofstadter give a little lecture on analogical reasoning (example) than from this whole book.
I didn’t read the lecture you linked, but I liked Hofstadter’s book “Surfaces and Essences” which had the same core thesis. It’s quite long though. And not about neuroscience.

Towards_Keeperhood 27 May 2025 19:02 UTC
1 point
0
on: Reward button alignment
I find this rather ironic:
6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?
It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button.
[...]
On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.
(I guess I wouldn’t say it’s very low s-risk but not actually an important disagreement here. Partially just thought it sounded funny.)

Towards_Keeperhood 27 May 2025 10:18 UTC
1 point
0
in reply to: Rohin Shah’s comment on: Shah and Yudkowsky on alignment failures
- I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
Yeah thanks for distinguishing. It’s not at all obvious to me that Paul would call CIRL “corrigible”—I’d guess not, but idk.
My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It’s possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.
- I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
Sorry that was very poorly phrased by me. What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”. So yes I agree. I still find it confusing though why people started calling that corrigibility.
In your previous comment you wrote:
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I don’t understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t. (Of course it doesn’t matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it’s not correctable, and I don’t see why you think it’s evidence.)
- I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don’t know what you mean here.
Also, in my view corrigibility isn’t just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn’t:
If something goes wrong with CIRL so its goal isn’t pointed to the human utility function anymore, it would not want operators to correct it.
~~The~~ One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven’t found it yet.)
E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).
Though another way you can keep being able to correct the AI’s goals is by having the AI not think much in the general domain about stuff like “the operators may change my goals” or so.
(Most of the corrigibility principles are about a different part of corrigibility, but I think this “be able to correct the AI even if something goes a bit wrong with its alignment” is a central part of corrigibility.)
I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.
Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.

Towards_Keeperhood 25 May 2025 17:23 UTC
1 point
0
on: Track-Back Meditation
Over the last 2-3 weeks I practiced this by setting a 3-minute timer every time I went on a walk, and when the timer rings I check what I’m thinking and backtrack how I got there and refresh the timer. I found it quite useful so far.
I now also want to start trying track-back meditations on simple problems I solved, and perhaps working myself tracking back how I solved harder problems.
Did you practice this further? Did you get more useful stuff out of it? Do you have further advice?

Towards_Keeperhood 21 May 2025 18:25 UTC
1 point
0
in reply to: dlsamson3@gmail.com’s comment on: Considerations on orca intelligence
Yeah so I actually ended up to captivated by the question and attempted to investigate it quickly in the reasoning that if they are superhumanly smart that would be very useful to figure out. But there turned out to be some annyoing constraints that makes running an experiment difficult, and I later realized that they are very probably not smarter than the smartest humans, but I still think they are likely somewhere around human level.

Towards_Keeperhood 21 May 2025 14:53 UTC
2 points
0
on: Units Have More Depth Than I Thought
If you notice your confusion between the difference of the act of weighing and gravity
This description seems imprecise/confusing to me. It’s rather that you need to notice that you need an extra assumption for inertial_mass=gravitational_mass, and then you can embark on finding a theory where those are identical by thinking stuff like “can i frame it as earth actually just accelerating up?”.

Towards_Keeperhood 20 May 2025 20:01 UTC
1 point
0
in reply to: Rohin Shah’s comment on: Shah and Yudkowsky on alignment failures
I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.
Yeah fair point. I don’t really know what Paul means with corrigibility. (One hypothesis: Paul doesn’t think in terms of consequentialist cognition but in terms of learned behaviors that generalize, and maybe the question “but does it behave that way because it wants the operator’s values to be fulfilled or because it just wants to serve?” seems meaningless from Paul’s perspective. But idk.)
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I’m pretty sure Eliezer would not want the term “corrigibility” to be used for the kind of correctability you get in the early stages of CIRL when the AI doesn’t already know its own values and how to accomplish them better than the operators. (Eliezer actually talked a bunch about this CIRL-like correctability in his 2001 report “Creating Friendly AI”. (Probably not worth your time to read, though given the context that it was 2001, there seemed to me to be some good original thinking going on there which I didn’t see often. Also you can see Eliezer being optimistic about alignment.))
And I don’t see it as evidence that Eliezer!corrigibility isn’t anti-natural.
(In the following I use “corrigibility” in the Eliezer-sense. I’m pretty confident that all of the following matches Eliezer’s model, but not completely sure.)
The motivation behind corrigibility was that aligning superintelligence seemed to hard, so we want to aim an AI to do a pivotal task that gets humanity on a course to likely properly aligning superintelligence later.
The corrigible AI would be just pointed to accomplish this task, and not to human values at all. It should be this bounded thing that only cares about this bounded task and afterwards shuts itself down. It shouldn’t do the task because it wants to accomplish human values and the task seems like a good way to accomplish it. Human values are unbounded, and it might be less likely shut itself down afterwards. Corrigibility has nothing to do with human values.
Roughly speaking, we can perhaps disentangle 3 corrigibility approaches:
1. Train for corrigible behavior.
  1. I think Eliezer thinks that this will only create behavioral heuristics that won’t get integrated into the optimization target of the powerful optimizer, and the optimizer will see those as constraints to find ways around or remove. Since doing a pivotal act requires a lot of optimization power, it might find a way around those constraints, or use the nearest unblocked strategy which might still be undesireable.
  2. (There might also be downsides of training for corrigible behavior, e.g. the optimization becoming less understandable and less predictable.)
2. Integrate corrigibility principles into the optimization.
  1. These approaches are about trying to design the way the optimization works in ways that make it safer and less likely to blow up.
3. Coherent corrigibility / The hard problem of corrigibility.
  1. If a solution here would be found it might have the shape of a utility function saying “serve the operators”. Not “serve because you want the operators values to be fulfilled”. (Less sure here whether I understand this correctly.)
  2. I think Max Harms’ is trying to make some progress on this.
The main plan isn’t to try to get coherent corrigibility, but just to build something limited that optimizes in a way it can still get something pivotal done without wanting to take over the universe. Not that it has a coherent goal where the optimum wouldn’t be taking over the universe—it rather just doesn’t think those thoughts and just does its task.
Ideal would be something that doesn’t think in the general domain at all. E.g. imagine sth like AlphaFold 5 that isn’t trained on text at all and is only very good at modelling protein interactions, which could e.g. help us get relevant understanding about neuronal cell dynamics which we could use for significantly enhancing adult human intelligence - (I’m just sketching silly unrealistic sorta-concrete scenario). But seems unlikely we will able to do something impressive with narrow reasoners at our level of understanding.
But even though we don’t aim for a coherent mind, if more parts that make the AI safe/corrigible have a coherent shape, e.g. if we find a working shutdown-utility function, that still improves safety, because it means those parts of the AI don’t obviously break in the limit of optimization pressure, so it’s also less probable to break through “only” pivotal levels of optimization.

Towards_Keeperhood 19 May 2025 17:52 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: Consequentialism & corrigibility
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.)
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
From my perspective it looks like this:
If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so).
If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint.
So you want to try to embed the desire for helpfulness/corrigibility in the utility function.
If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.
Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, …
Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)

Towards_Keeperhood 19 May 2025 12:56 UTC
1 point
0
in reply to: Mateusz Bagiński’s comment on: Bounded AI might be viable
The short answer to “How is it different from corrigibility?” is something like: here we’re thinking about systems that are not sufficiently powerful for us to need them to be fully corrigible.
There’s both “attempt to get coherent corrigibility” and “try to deploy corrigibility principles and keep it bounded enough to do a pivotal act”. I think the latter approach is the main one MIRI imagines after having failed to find a simple coherent-description/utility-function for corrigibility. (Where here it would e.g. be ideal if the AI needs to only reason very well in a narrow domain without being able to reason well about general-domain problems like how to take over the world, though at our current level of understanding it seems hard to get the first without the second.)
EDIT: Actually the attempt to get coherent corrigibility also was aimed at bounded AI doing a pivotal act. But people were trying to formulate utility functions so that the AI can have a coherent shape which doesn’t obviously break once large amounts of optimization power are applied (where decently large amounts are needed for doing a pivotal act.)
And I’d count “training for corrigible behavior/thought patterns in the hopes that the underlying optimization isn’t powerful enough to break those patterns” also into that bucket, though yeah about that MIRI doesn’t talk that much.

Towards_Keeperhood 19 May 2025 10:35 UTC
2 points
1
on: Shah and Yudkowsky on alignment failures
I think Rohin’s misunderstanding about corrigibility, aka his notion of Paul!Corrigibility, doesn’t actually come from Paul but from the Risks from Learned Optimization (RFLO) paper^[1]:
3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer’s epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer’s intentions).
It seems to me like the authors here just completely misunderstood what corrigibility is about. I think in their ontology, “corrigibly aligned to human values” just means “pointed at indirect normativity (aka human-CEV)”, aka indirectly caring about human values by valuing whatever they infer humans value (as opposed to directly valuing the same things as humans for the same complex reasons^[2]).
(Paul’s post seems to me like he might have a correct understanding of corrigibility, and iiuc suggests corrigibility could also be used as avenue to aligning AI to human values, because we will be able to correct the AI for longer/at-higher-capability-levels if it is corrigible. EDIT: Actually not sure, perhaps he rather means that the AI will end up coherently corrigible from training for corrigibility, that it will converge to that even if we haven’t managed to write down a utility function for corrigibility.)
1. ^
  IIRC the RFLO paper also caused some confusion in me when I started learning about corrigibility.
2. ^
  Though not that this kind of “direct alignment” doesn’t necessarily correspond to what they call “internalized alignment”. Their ontology doesn’t make sense to me. (E.g. I don’t see what concretely Evan might mean with “the information came through the base optimizer”.)

Towards_Keeperhood 18 May 2025 18:53 UTC
3 points
0
on: Bounded AI might be viable
Hi,
sorry for commenting without having read most of your post. I just started reading this and thought like “isn’t this exactly what the corrigibility agenda is/was about?”, and in your “relation to other agendas” section you don’t mention corrigibility there, so I thought I just ask whether you’re familiar with it and how your approach is different. (Though tbc, I could be totally misunderstanding, I didn’t read far.)
Tbc I think further work on corrigibility is very valuable, but if you haven’t looked into it much I’d suggest reading up on what other people wrote on that so far. (I’m not sure whether there are very good explainers, and sometimes people seem to get a wrong impression of what corrigibility is about. E.g. corrigibility has nothing to do with “corrigibly aligned” from the “Risks from Learned Optimization” paper. Also the shutdown problem is often misunderstood too. I would make read and try to understand the stuff MIRI wrote about it. Possibly parts of this conversation might also be helpful, but yeah sry it’s not written in a nice format that explains everything clearly.)
we may be able to avoid this problem by:
- not building unbounded, non-interruptible optimizers
  and, instead,
- building some other, safer, kind of AI that can be demonstrated to deliver enough value to make up for the giving up on the business-as-usual kind of AI along with the benefits it was expected to deliver (that “we”, though not necessarily its creators, expect might/would lead to the creation of unbounded, non-interruptible AI posing a catastrophic risk),.
This sounds to me like you’re imagining just nobody building a more powerful AIs is an option if we already got a lot of value from it (where I don’t really know what level of capability you imagine concretely)? If the world was so reasonable we wouldn’t rush ahead with our abysmal understanding of AI anyways because obviously the risks outweigh the benefits? Also you don’t just need to convince the leading labs because progress will continue and soon enough many many actors will be able to create unaligned powerful AI, and someone will.
I think the right framing of the bounded/corrigible agent agenda is aiming toward a pivotal act.

Towards_Keeperhood 17 May 2025 19:50 UTC
1 point
0
in reply to: Steven Byrnes’s comment on: Consequentialism & corrigibility
But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
If you have an unpublished draft, do you want to share it with me? I could then sometime the next 2 weeks read both your old post and the new one and think whether I have any more objections.

Towards_Keeperhood 17 May 2025 12:03 UTC
13 points
2
on: Simon Skade’s Shortform
List of my LW comments I might want to look up again. I just thought I keep this list public on my shortform in case someone is unusually interested in stuff I write. I’ll add future comments here too. I didn’t include comments on my shortform here.:

Towards_Keeperhood

8.5.6.1 Aside: The “innate drive to minimize voluntary attention control”

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?