Towards_Keeperhood

Karma: 700

I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.

Towards_Keeperhood Apr 14, 2025, 11:11 AM
1 point
0
on: Keltham’s Lectures in Project Lawful
Btw, just reading through everything from “dath ilan” might also be interesting: https://www.glowfic.com/replies/search?board_id=&author_id=&template_id=&character_id=11562&subj_content=&sort=created_old&commit=Search

Towards_Keeperhood Apr 14, 2025, 11:10 AM
1 point
0
in reply to: Towards_Keeperhood’s comment on: Keltham’s Lectures in Project Lawful
Asmodia figuring out Keltham’s probability riddles may also be interesting, though perhaps less so than the lectures. It starts episode 90. Starting quote is “no dath ilani out of living memory would’ve seen the phenomenon”. The story unfortunately switches between Asmodia+Ione, Carrissa(+Peranza I think), and Keltham+Meritxell. You can skip the other stuff that’s going on there (though the brief “dath ilan” reply about stocks might be interesting too).

Towards_Keeperhood Apr 13, 2025, 9:22 AM
1 point
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Thanks!
If the value function is simple, I think it may be a lot worse than the world-model/thought-generator at evaluating what abstract plans are actually likely to work (since the agent hasn’t yet tried a lot of similar abstract plans from where it could’ve observed results, and the world model’s prediction making capabilities generalize further). The world model may also form some beliefs about what the goals/values in a given current situation are. So let’s say the thought generator outputs plans along with predictions about those plans, and some of those predictions predict how well a plan is going to fulfill what it believes the goals are (like approximate expected utility). Then the value function might learn to just just look at this part of a thought that predicts the expected utility, and then take that as it’s value estimate.
Or perhaps a slightly more concrete version of how that may happen. (I’m thinking about model-based actor-critic RL agents which start out relatively unreflective, rather than just humans.):
- Sometimes the thought generator generates self-reflective thoughts like “what are my goals here”, where upon the thought generator produces an answer “X” to that, and then when thinking how to accomplish X it often comes up with a better (according to the value function) plan than if it tried to directly generate a plan without clarifying X. Thus the value function learns to assign positive valence to thinking “what are my goals here”.
  - The same can happen with “what are my long-term goals”, where the thought generator might guess something that would cause high reward.
  - For humans, X is likely more socially nice than would be expected from the value function, since “X are my goals here” is a self-reflective thought where the social dimensions are more important for the overall valence guess.^[1]
- Later the thought generator may generate the thought “make careful predictions whether the plan will actually accomplish the stated goals well”, where upon the thought generator often finds some incoherences that the value function didn’t notice, and produces a better plan. Then the value function learns to assign high valence to thoughts like “make careful predictions whether the plan will actually accomplish the stated goals well”.
- Later the predictions of the thought generator may not always match well with the valence the value function assigns, and it turns out that the thought generator’s predictions often were better. So over time the value function gets updated more and more toward “take the predictions of the thought generator as our valence guess”, since that strategy better predicts later valence guesses.
- Now, some goals are mainly optimized by the thought generator predicting how some goals could be accomplished well, and there might be beliefs in the thought generator like “studying rationality may make me better at accomplishing my goals”, causing the agent to study rationality.
  - And also thoughts like “making sure the currently optimized goal keeps being optimized increases the expected utility according to the goal”.
  - And maybe later more advanced bootstrapping through thoughts like “understanding how my mind works and exploiting insights to shape it to optimize more effectively would probably help me accomplish my goals”. Though of course for this to be a viable strategy it would at least be as smart as the smartest current humans (which we can assume because otherwise it’s too useless IMO).
So now the value function is often just relaying world-model judgements and all the actually powerful optimization happens in the thought generator. So I would not classify that as the following:
In my view, the big problem with model-based actor-critic RL AGI, the one that I spend all my time working on, is that it tries to kill us via using its model-based RL capabilities in the way we normally expect—where the planner plans, and the actor acts, and the critic criticizes, and the world-model models the world …and the end-result is that the system makes and executes a plan to kill us.
So in my story, the thought generator learns to model the self-agent and has some beliefs about what goals it may have, and some coherent extrapolation of (some of) those goals is what gets optimized in the end. I guess it’s probably not that likely that those goals are strongly misaligned to the value function on the distribution where the value function can evaluate plans, but there are many possible ways to generalize the values of the value function.
For humans, I think that the way this generalization happens is value-laden (aka what human values are depend on this generalization). The values might generalize a bit differently for different humans of course, but it’s plausible that humans share a lot of their prior-that-determines-generalization, so AIs with a different brain architecture might generalize very differently.
Basically, whenever someone thinks “what’s actually my goal here”, I would say that’s already a slight departure from “using one’s model-based RL capabilities in the way we normally expect”. Though I think I would agree that for most humans such departures are rare and small, but I think they get a lot larger for smart reflective people, and I think I wouldn’t describe my own brain as “using one’s model-based RL capabilities in the way we normally expect”. I’m not at all sure about this, but I would expect that “using its model-based RL capabilities in the way we normally expect” won’t get us to pivotal level of capability if the value function is primitive.
1. ^
  If I just trust my model of your model here. (Though I might misrepresent your model. I would need to reread your posts.)

Towards_Keeperhood Apr 12, 2025, 9:15 PM
1 point
0
on: Keltham’s Lectures in Project Lawful
Note that the “Probability 2” lecture continues after the lunch break (which is ~30min skippable audio).

Towards_Keeperhood Apr 9, 2025, 1:34 PM
3 points
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Thanks!
Sorry, I think I intended to write what I think you think, and then just clarified my own thoughts, and forgot to edit the beginning. Sorry, I ought to have properly recalled your model.
Yes, I think I understand your translations and your framing of the value function.
Here are the key differences between a (more concrete version of) my previous model and what I think your model is. Please lmk if I’m still wrongly describing your model:
- plans vs thoughts
  - My previous model: The main work for devising plans/thoughts happens in the world-model/thought-generator, and the value function evaluates plans.
  - Your model: The value function selects which of some proposed thoughts to think next. Planning happens through the value function steering the thoughts, not the world model doing so.
- detailedness of evaluation of value function
  - My previous model: The learned value function is a relatively primitive map from the predicted effects of plans to a value which describes whether the plan is likely better than the expected counterfactual plan. E.g. maybe sth roughly like that we model how sth like units of exchange (including dimensions like “how much does Alice admire me”) change depending on a plan, and then there is a relatively simple function from the vector of units to values. When having abstract thoughts, the value function doesn’t understand much of the content there, and only uses some simple heuristics for deciding how to change its value estimate. E.g. a heuristic might be “when there’s a thought that the world model thinks is valid and it is associated to the (self-model-invoking) thought “this is bad for accomplishing my goals”, then it lowers its value estimate. In humans slightly smarter than the current smartest humans, it might eventually learn the heuristic “do an explicit expected utility estimate and just take what the result says as the value estimate”, and then that is being done and the value function itself doesn’t understand much about what’s going on in the expected utility estimate, but it just allows to happen whatever the abstract reasoning engine predicts. So it essentially optimizes goals that are stored as beliefs in the world model.
    So technically you could still say “but what gets done still depends on the value function, so when the value function just trusts some optimization procedure which optimizes a stored goal, and that goal isn’t what we intended, then the value function is misaligned”. But it seems sorta odd because the value function isn’t really the main relevant thing doing the optimization.
    The value function essentially is too dumb to do the main optimization itself for accomplishing extremely hard tasks. Even if you set incentives so that you get ground-truth reward for moving closer to the goal, it would be too slow at learning what strategies work well
  - Your model: The value function has quite a good model of what thoughts are useful to think. It is just computing value estimates, but it can make quite coherent estimates to accomplish powerful goals.
    If there are abstract thoughts about actually optimizing a different goal than is in the interest of the value function, the value function shuts them down by assigning low value.
    (My thoughts: One intuition is that to get to pivotal intelligence level, the value function might need some model of its own goals in order to efficiently recognizing when some values it is assigning aren’t that coherent, but I’m pretty unsure of that. Do you think the value function can learn a model of its own values?)
There’s a spectrum between my model and yours. I don’t know what model is better; at some point I’ll think about what may be a good model here. (Feel free to lmk your thoughts on why your model may be better, though maybe I just see it when in the future I think about it more carefully and reread some of your posts and model your model in more detail. I’m currently not modelling either model that detailed.)

What alignment-relevant abilities might Terence Tao lack?

Towards_KeeperhoodApr 7, 2025, 7:44 PM

12 points

2 comments3 min readLW link

Towards_Keeperhood Apr 7, 2025, 4:44 PM
1 point
0
in reply to: TsviBT’s comment on: Mixed Reference: The Great Reductionist Project
Why two?
Mathematics/logical-truths are true in all possible worlds, so they never tell you in what world you are.
If you want to say something that is true in your particular world (but not necessarily in all worlds), you need some observations to narrow down what world you are in.
I don’t know how closely this matches the use in the sequence, but I think a sensible distinction between logical and causal pinpointing is: All the math parts of a statement are “logically pinpointed” and all the observation parts are “causally pinpointed”.
So basically, I think in theory you can reason about everything purely logically by using statements like “In subspace_of_worlds W: X”^[1], and then you only need causal pinpointing before making decisions for evaluating what world you’re actually likely in.
You could imagine programming a world model where there’s the default assumption that non-tautological statements are about the world we’re in, and then a sentence like “Peter’s laptop is silver” would get translated into sth like “In subspace_of_worlds W_main: color(<x s.t. laptop(x) AND own(Peter, x)>, silver)”.
Most of the statements you reason with are of course about the world you’re in or close cousin worlds with only few modifications, though sometimes we also think about further away fiction worlds (e.g. HPMoR).
(Thanks to Kaarel Hänni for a useful conversation that lead up to this.)
1. ^
  It’s just a sketch, not a proper formalization. Maybe we rather want sth like statements of the form “if {...lots of context facts that are true in our world...} then X”.

Towards_Keeperhood Apr 7, 2025, 2:34 PM
1 point
0
in reply to: cubefox’s comment on: Introduction to Representing Sentences as Logical Statements
Thanks for clarifying.
I mean I do think it can happen in my system that you allocate an object for something that’s actually 0 or >1 objects, and I don’t have a procedure for resolving such map-territory mismatches yet, though I think it’s imaginable to have a procedure that defines new objects and tries to edit all the beliefs associated with the old object.
I definitely haven’t described how we determine when to create a new object to add to our world model, but one could imagine an algorithm checking when there’s some useful latent for explaining some observations, and then constructing a model for that object, and then creating a new object in the abstract reasoning engine. Yeah there’s still open work to do for how a correspondences between the constant symbol for our object and our (e.g. visual) model of the object can be formalized and used, but I don’t see why it wouldn’t be feasible.
I agree that we end up with a map that doesn’t actually fit the territory, but I think it’s fine if there’s a unresolveable mismatch somewhere. There’s still a useful correspondence in most places. (Sure logic would collapse from a contradiction but actually it’s all probabilistic somehow anyways.) Although of course we don’t have anything to describe that the territory is different from the map in our system yet. This is related to embedded agency, and further work on how to model your map as possibly not fitting the territory and how that can be used is still necessary.

Towards_Keeperhood Apr 7, 2025, 2:07 PM
3 points
0
in reply to: johnswentworth’s comment on: Thoughts on Creating a Good Language
Thx.
Yep there are many trade-offs between criteria.
Btw, totally unrelatedly:
I think in the past on your abstraction you probably lost a decent amount of time from not properly tracking the distinction between (what I call) objects and concepts. I think you likely at least mostly recovered from this, but in case you’re not completely sure you’ve fully done so you might want to check out the linked section. (I think it makes sense to start by understanding how we (learn to) model objects and only look at concepts later, since minds first learn to model objects and later carve up concepts as generalizations over similarity clusters of objects.)
Tbc, there’s other important stuff than objects and concepts, like relations and attributes. I currently find my ontology here useful for separating subproblems, so if you’re interested you might read more of the linked post even though you’re surely already familiar with knowledge representation (if you haven’t done so yet), but maybe you already track all that.

Thoughts on Creating a Good Language

Towards_KeeperhoodApr 6, 2025, 3:57 PM

1 point

2 comments7 min readLW link

Towards_Keeperhood Apr 6, 2025, 3:56 PM
1 point
0
in reply to: cubefox’s comment on: Introduction to Representing Sentences as Logical Statements
Thanks.
I’m still not quite understanding what you’re thinking though.
For other objects, like physical ones, quantifiers have to be used. Like “at least one” or “the” (the latter only presupposes there is exactly one object satisfying some predicate). E.g. “the cat in the garden”. Perhaps there is no cat in the garden or there are several. So it (the cat) cannot be logically represented with a constant.
“the” supposes there’s exactly one canonical choice for what object in the context is indicated by the predicate. When you say “the cat” there’s basically always a specific cat from context you’re talking about. “The cat is in the garden” is different from “There’s exactly one cat in the garden”.
Maybe “Superman” is actually two people with the same dress, or he doesn’t exist, being the result of a hallucination. This case can be easily solved by treating those names as predicates.
- The woman believes the superhero can fly.
- The superhero is the colleague.
I mean there has to be some possibility for revising your world model if you notice that there are actually 2 objects for something where you previously thought there’s only one.
I agree that “Superman” and “the superhero” denote the same object(assuming you’re in the right context for “the superhero”).
(And yeah to some extent names also depend a bit on context. E.g. if you have 2 friends with the same name.)
You can say “{(the fact that) there’s an apple on the table} causes {(the fact that) I see an apple}”
But that’s not primitive in terms of predicate logic, because here “the” in “the table” means “this” which is not a primitive constant. You don’t mean any table in the world, but a specific one, which you can identify in the way I explained in my previous comment.
Yeah I didn’t mean this as formal statement. formal would be:
{exists x: apple(x) AND location(x, on=Table342)} CAUSES {exists x: apple(x) AND see(SelfPerson, x)}

Towards_Keeperhood Apr 6, 2025, 11:19 AM
1 point
0
in reply to: cubefox’s comment on: Introduction to Representing Sentences as Logical Statements
I think object identification is important if we want to analyze beliefs instead of sentences. For beliefs we can’t take a third person perspective and say “it’s clear from context what is meant”. Only the agent knows what he means when he has a belief (or she). So the agent has to have a subjective ability to identify things. For “I” this is unproblematic, because the agent is presumably internal and accessible to himself and therefore can be subjectively referred to directly. But for “this” (and arguably also for terms like “tomorrow”) the referred object depends partly on facts external to the agent. Those external facts might be different even if the internal state of the agent is the same. For example, “this” might not exist, so it can’t be a primitive term (constant) in standard predicate logic.
I’m not exactly sure what you’re saying here, but in case the following helps:
Indicators like “here”/”tomorrow”/”the object I’m pointing to” don’t get stored directly in beliefs. They are pointers used for efficiently identifying some location/time/object from context, but what get’s saved in the world model is the statement where those pointers were substituted for the referent they were pointing to.
One approach would be to analyze the belief that this apple is green as “There is an x such that x is an apple and x is green and x causes e.” Here “e” is a primitive term (similar to “I” in “I’m hungry”) that refers to the current visual experience of a green apple.
So e is subjective experience and therefore internal to the agent. So it can be directly referred to, while this (the green apple he is seeing) is only indirectly referred to (as explained above), similar to “the biggest tree”, “the prime minister of Japan”, “the contents of this box”.
Note the important role of the term “causes” here. The belief is representing a hypothetical physical object (the green apple) causing an internal object (the experience of a green apple). Though maybe it would be better to use “because” (which relates propositions) instead of “causes”, which relates objects or at least noun phrases. But I’m not sure how this would be formalized.
I think I still don’t understand what you’re trying to say, but some notes:
- In my system, experiences aren’t objects, they are facts. E.g. the fact “cubefox sees an apple”.
- CAUSES relates facts, not objects.
  - You can say “{(the fact that) there’s an apple on the table} causes {(the fact that) I see an apple}”
- Even though we don’t have an explicit separate name in language for every apple we see, our minds still tracks every apple as a separate object which can be identified.
Btw, it’s very likely not what you’re talking about, but you actually need to be careful sometimes when substituting referent objects from indicators, in particular in cases where you talk about the world model of other people. E.g. if you have the beliefs:
1. Mia believes Superman can fly.
2. Superman is Clark Kent.
This doesn’t imply that “Mia believes Clark Kent can fly”, because Mia might not know (2). But essentially you just have a separate world model “Mia’s beliefs” in which Superman and Clark Kent are separate objects, and you just need to be careful to choose the referent of names (or likewise with indicators) relative to who’s belief scope you are in.

Towards_Keeperhood Apr 6, 2025, 8:45 AM
1 point
0
in reply to: cubefox’s comment on: Introduction to Representing Sentences as Logical Statements
Yep I did not cover those here. They are essentially shortcodes for identifying objects/times/locations from context. Related quote:
E.g. “the laptop” can refer to different objects in different contexts, but when used it’s usually clear which object is meant. However, how objects get identified does not concern us here—we simply assume that we know names for all objects and use them directly.
(“The laptop” is pretty similar to “This laptop”.)
(Though “this” can also act as complementizer, as in “This is why I didn’t come”, though I think in that function it doesn’t count as indexical. The section related to complementizers is the “statement connectives” section.)

Introduction to Representing Sentences as Logical Statements

Towards_KeeperhoodApr 5, 2025, 8:35 PM

22 points

9 comments16 min readLW link

Towards_Keeperhood Apr 3, 2025, 8:21 PM
3 points
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.
I think this summary is better: 1. “The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)”. 2. Something else went wrong [not easily compressible].

Towards_Keeperhood Apr 3, 2025, 7:25 PM
3 points
0
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Sounds like we probably agree basically everywhere.
Yeah you can definitely mark me down in the camp of “not use ‘inner’ and ‘outer’ terminology”. If you need something for “outer”, how about “reward specification (problem/failure)”.
ADDED: I think I probably don’t want a word for inner-alignment/goal-misgeneralization. It would be like having a word for “the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions”.
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
Yeah I agree they don’t appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think.
Or more generally I think when you don’t use utility functions explicitly then capability likely suffers, though not totally sure.

Towards_Keeperhood Apr 3, 2025, 7:11 PM
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Thanks.
Yeah I guess I wasn’t thinking concretely enough. I don’t know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here’s a bit of rambling: (I think point 6 is most important.)
1. As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires.
2. I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account.
  1. The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn’t update strongly on abstract arguments like “I should correct my estimates based on outside view”.
3. The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
4. Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants.
  1. Another question of course is whether the inner self-reflective optimizers are likely aligned to the initial value function. I would need to think about it. Do you see this as a part of the inner alignment problem or as a separate problem?
  2. As an aside, one question would be whether the way this human makes decisions is still essentially actor-critic model-based RL like—whether the critic just got replaced through a more competent version. I don’t really know.
  3. (Of course, I totally ackgnowledge that humans have pre-wired machinery for their intuitive self-models, rather than that just spawning up. I’m not particularly discussing my original objection anymore.)
5. I’m also uncertain whether something working through the main actor-critic model-based RL mechanism would be capable enough to do something pivotal. Like yeah, most and maybe all current humans probably work that way. But if you go a bit smarter then minds might use more advanced techniques of e.g. translating problems into abstract domains and writing narrow AIs to solve them there and then translating it back into concrete proposals or sth. Though maybe it doesn’t matter as long as the more advanced techniques don’t spawn up more powerful unaligned minds, in which case a smart mind would probably not use the technique in the first place. And I guess actor-critic model-based RL is sorta like expected utility maximization, which is pretty general and can get you far. Only the native kind of EU maximization we implement through actor-critic model-based RL might be very inefficient compared through other kinds.
  1. I have a heuristic like “look at where the main capability comes from”, and I’d guess for very smart agents it perhaps doesn’t come from the value function making really good estimates by itself, and I want to understand how something could be very capable and look at the key parts for this and whether they might be dangerous.
6. Ignoring human self-models now, the way I imagine actor-critic model-based RL is that it would start out unreflective. It might eventually learn to model parts of itself and form beliefs about its own values. Then, the world-modelling machinery might be better at noticing inconsistencies in the behavior and value estimates of that agent than the agent itself. The value function might then learn to trust the world-model’s predictions about what would be in the interest of the agent/self.
  1. This seems to me to sorta qualify as “there’s an inner optimizer”. I would’ve tentitatively predicted you to say like “yep but it’s an inner aligned optimizer”, but not sure if you actually think this or whether you disagree with my reasoning here. (I would need to consider how likely value drift from such a change seems. I don’t know yet.)
I don’t have a clear take here. I’m just curious if you have some thoughts on where something importantly mismatches your model.

Towards_Keeperhood Apr 3, 2025, 3:48 PM
LW: 3 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Thanks!
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that.
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it.
1. The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
2. The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).
This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution.
You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.
I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions.
(Though we could aim for a different compatible utility function, namely the “indirect alignment” one that say “fulfill human’s CEV”, which has lower complexity than the ones humans generalize to (since the value generalization prior doesn’t need to be specified and can instead be inferred from observations about humans). (I think that is what’s meant by “corrigibly aligned” in “Risks from learned optimization”, though it has been a very long time since I read this.))
Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”:
1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium).
2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.
Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward (as opposed to e.g. being deceptive). Whereas I think “internalizing ground-truth reward” isn’t meaningful off distribution and it’s actually a very hard problem to set up the system in a way that it generalizes in the way we want.
But maybe you’re aware of that “finding the right prior so it generalizes to the right utility function” problem, and you see it as part of inner alignment.

Towards_Keeperhood Apr 3, 2025, 12:58 PM
LW: 5 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
Note: I just noticed your post has a section “Manipulating itself and its learning process”, which I must’ve completely forgotten since I last read the post. I should’ve read your post before posting this. Will do so.
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
Calling problems “outer” and “inner” alignment seems to suggest that if we solved both we’ve successfully aligned AI to do nice things. However, this isn’t really the case here.
Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who’s thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren’t.
This kills you even if we have a nice ground-truth reward and the LVF accurately captures that.
In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don’t understand how it works.

Towards_Keeperhood Apr 3, 2025, 12:44 PM
LW: 3 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: steve2152′s Shortform
I’d suggest not using conflated terminology and rather making up your own.
Or rather, first actually don’t use any abstract handles at all and just describe the problems/failure-modes directly, and when you’re confident you have a pretty natural breakdown of the problems with which you’ll stick for a while, then make up your own ontology.
In fact, while in your framework there’s a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect “rewards that agree with what we want” to mean something like “changes in expected utility according to humanity’s CEV”. (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.)
Second attempt:
“in a way that agrees with its eventual reward” seems to imply that there’s actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human’s (imperfect) world model. Problems:
1. Humans don’t actually give coherent rewards which are consistent with a utility function on their world model.
  1. For this problem we might be able to define an extrapolation procedure that’s not too bad.
2. The reward depends on the state of the world model of the human, and our world models probably often has false beliefs.
  1. Importantly, the setup needs to be designed in a way that there wouldn’t be an incentive to manipulate the humans into believing false things.
  2. Maybe, optimistically, we could mitigate this problem by having the AI form a model of the operators, doing some ontology translation between the operator’s world model and its own world model, and flagging when there seems to be a relavant belief mismatch.
3. Our world models cannot evaluate yet whether e.g. filling the universe computronium running a certain type of programs would be good, because we are confused about qualia and don’t know yet what would be good according to our CEV. Basically, the ground-truth reward would very often just say “i don’t know yet”, even for cases which are actually very important according to our CEV. It’s not just that we would need a faithful translation of the state of the universe into our primitive ontology (like “there are simulations of lots of happy and conscious people living interesting lives”), it’s also that (1) the way our world model treats e.g. “consciousness” may not naturally map to anything in a more precise ontology, and while our human minds, learning a deeper ontology, might go like “ah, this is what I actually care about—I’ve been so confused”, such value-generalization is likely even much harder to specify than basic ontology translation. And (2), our CEV may include value-shards which we currently do not know of or track at all.
  1. So while this kind of outer-vs-inner distinction might maybe be fine for human-level AIs, it stops being a good breakdown for smarter AIs, since whenever we want to make the AI do something where humans couldn’t evaluate the result within reasonable time, it needs to generalize beyond what could be evaluated through ground-truth reward.
So mainly because of point 3, instead of asking “how can i make the learned value function agree with the ground-truth reward”, I think it may be better to ask “how can I make the learned value function generalize from the ground-truth reward in the way I want”?
(I guess the outer-vs-inner could make sense in a case where your outer evaluation is superhumanly good, though I cannot think of such a case where looking at the problem from the model-based RL framework would still make much sense, but maybe I’m still unimaginative right now.)
Note that I assumed here that the ground-truth signal is something like feedback from humans. Maybe you’re thinking of it differently than I described here, e.g. if you want to code a steering subsystem for providing ground-truth. But if the steering subsystem is not smarter than humans at evaluating what’s good or bad, the same argument applies. If you think your steering subsystem would be smarter, I’d be interested in why.
(All that is assuming you’re attacking alignment from the actor-critic model-based RL framework. There are other possible frameworks, e.g. trying to directly point the utility function on an agent’s world-model, where the key problems are different.)

Towards_Keeperhood

What al­ign­ment-rele­vant abil­ities might Ter­ence Tao lack?

Thoughts on Creat­ing a Good Language

In­tro­duc­tion to Rep­re­sent­ing Sen­tences as Log­i­cal Statements

What alignment-relevant abilities might Terence Tao lack?

Thoughts on Creating a Good Language

Introduction to Representing Sentences as Logical Statements