I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
Thoughts on Creating a Good Language
I think object identification is important if we want to analyze beliefs instead of sentences. For beliefs we can’t take a third person perspective and say “it’s clear from context what is meant”. Only the agent knows what he means when he has a belief (or she). So the agent has to have a subjective ability to identify things. For “I” this is unproblematic, because the agent is presumably internal and accessible to himself and therefore can be subjectively referred to directly. But for “this” (and arguably also for terms like “tomorrow”) the referred object depends partly on facts external to the agent. Those external facts might be different even if the internal state of the agent is the same. For example, “this” might not exist, so it can’t be a primitive term (constant) in standard predicate logic.
I’m not exactly sure what you’re saying here, but in case the following helps:
Indicators like “here”/”tomorrow”/”the object I’m pointing to” don’t get stored directly in beliefs. They are pointers used for efficiently identifying some location/time/object from context, but what get’s saved in the world model is the statement where those pointers were substituted for the referent they were pointing to.
One approach would be to analyze the belief that this apple is green as “There is an x such that x is an apple and x is green and x causes e.” Here “e” is a primitive term (similar to “I” in “I’m hungry”) that refers to the current visual experience of a green apple.
So e is subjective experience and therefore internal to the agent. So it can be directly referred to, while this (the green apple he is seeing) is only indirectly referred to (as explained above), similar to “the biggest tree”, “the prime minister of Japan”, “the contents of this box”.
Note the important role of the term “causes” here. The belief is representing a hypothetical physical object (the green apple) causing an internal object (the experience of a green apple). Though maybe it would be better to use “because” (which relates propositions) instead of “causes”, which relates objects or at least noun phrases. But I’m not sure how this would be formalized.
I think I still don’t understand what you’re trying to say, but some notes:
In my system, experiences aren’t objects, they are facts. E.g. the fact “cubefox sees an apple”.
CAUSES relates facts, not objects.
You can say “{(the fact that) there’s an apple on the table} causes {(the fact that) I see an apple}”
Even though we don’t have an explicit separate name in language for every apple we see, our minds still tracks every apple as a separate object which can be identified.
Btw, it’s very likely not what you’re talking about, but you actually need to be careful sometimes when substituting referent objects from indicators, in particular in cases where you talk about the world model of other people. E.g. if you have the beliefs:
Mia believes Superman can fly.
Superman is Clark Kent.
This doesn’t imply that “Mia believes Clark Kent can fly”, because Mia might not know (2). But essentially you just have a separate world model “Mia’s beliefs” in which Superman and Clark Kent are separate objects, and you just need to be careful to choose the referent of names (or likewise with indicators) relative to who’s belief scope you are in.
Yep I did not cover those here. They are essentially shortcodes for identifying objects/times/locations from context. Related quote:
E.g. “the laptop” can refer to different objects in different contexts, but when used it’s usually clear which object is meant. However, how objects get identified does not concern us here—we simply assume that we know names for all objects and use them directly.
(“The laptop” is pretty similar to “This laptop”.)
(Though “this” can also act as complementizer, as in “This is why I didn’t come”, though I think in that function it doesn’t count as indexical. The section related to complementizers is the “statement connectives” section.)
Introduction to Representing Sentences as Logical Statements
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
I guess just briefly want to flag that I think this summary of inner-vs-outer alignment is confusing in a way that it sounds like one could have a good enough ground-truth reward and then that just has to be internalized.
I think this summary is better: 1. “The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)”. 2. Something else went wrong [not easily compressible].
Sounds like we probably agree basically everywhere.
Yeah you can definitely mark me down in the camp of “not use ‘inner’ and ‘outer’ terminology”. If you need something for “outer”, how about “reward specification (problem/failure)”.
ADDED: I think I probably don’t want a word for inner-alignment/goal-misgeneralization. It would be like having a word for “the problem of landing a human on the moon, except without the part of the problem where we might actively steer the rocket into wrong directions”.
I just don’t use the term “utility function” at all in this context. (See §9.5.2 here for a partial exception.) There’s no utility function in the code. There’s a learned value function, and it outputs whatever it outputs, and those outputs determine what plans seem good or bad to the AI, including OOD plans like treacherous turns.
Yeah I agree they don’t appear in actor-critic model-based RL per se, but sufficiently smart agents will likely be reflective, and then they will appear there on the reflective level I think.
Or more generally I think when you don’t use utility functions explicitly then capability likely suffers, though not totally sure.
Thanks.
Yeah I guess I wasn’t thinking concretely enough. I don’t know whether something vaguely like what I described might be likely or not. Let me think out loud a bit about how I think about what you might be imagining so you can correct my model. So here’s a bit of rambling: (I think point 6 is most important.)
As you described in you intuitive self-models sequence, humans have a self-model which can essentially have values different from the main value function, aka they can have ego-dystonic desires.
I think in smart reflective humans, the policy suggestions of the self-model/homunculus can be more coherent than the value function estimates, e.g. because they can better take abstract philosophical arguments into account.
The learned value function can also update on hypothetical scenarios, e.g. imagining a risk or a gain, but it doesn’t update strongly on abstract arguments like “I should correct my estimates based on outside view”.
The learned value function can learn to trust the self-model if acting according to the self-model is consistently correlated with higher-than-expected reward.
Say we have a smart reflective human where the value function basically trusts the self-model a lot, then the self-model could start optimizing its own values, while the (stupid) value function believes it’s best to just trust the self-model and that this will likely lead to reward. Something like this could happen where the value function was actually aligned to outer reward, but the inner suggestor was just very good at making suggestions that the value function likes, even if the inner suggestor would have different actual values. I guess if the self-model suggests something that actually leads to less reward, then the value function will trust the self-model less, but outside the training distribution the self-model could essentially do what it wants.
Another question of course is whether the inner self-reflective optimizers are likely aligned to the initial value function. I would need to think about it. Do you see this as a part of the inner alignment problem or as a separate problem?
As an aside, one question would be whether the way this human makes decisions is still essentially actor-critic model-based RL like—whether the critic just got replaced through a more competent version. I don’t really know.
(Of course, I totally ackgnowledge that humans have pre-wired machinery for their intuitive self-models, rather than that just spawning up. I’m not particularly discussing my original objection anymore.)
I’m also uncertain whether something working through the main actor-critic model-based RL mechanism would be capable enough to do something pivotal. Like yeah, most and maybe all current humans probably work that way. But if you go a bit smarter then minds might use more advanced techniques of e.g. translating problems into abstract domains and writing narrow AIs to solve them there and then translating it back into concrete proposals or sth. Though maybe it doesn’t matter as long as the more advanced techniques don’t spawn up more powerful unaligned minds, in which case a smart mind would probably not use the technique in the first place. And I guess actor-critic model-based RL is sorta like expected utility maximization, which is pretty general and can get you far. Only the native kind of EU maximization we implement through actor-critic model-based RL might be very inefficient compared through other kinds.
I have a heuristic like “look at where the main capability comes from”, and I’d guess for very smart agents it perhaps doesn’t come from the value function making really good estimates by itself, and I want to understand how something could be very capable and look at the key parts for this and whether they might be dangerous.
Ignoring human self-models now, the way I imagine actor-critic model-based RL is that it would start out unreflective. It might eventually learn to model parts of itself and form beliefs about its own values. Then, the world-modelling machinery might be better at noticing inconsistencies in the behavior and value estimates of that agent than the agent itself. The value function might then learn to trust the world-model’s predictions about what would be in the interest of the agent/self.
This seems to me to sorta qualify as “there’s an inner optimizer”. I would’ve tentitatively predicted you to say like “yep but it’s an inner aligned optimizer”, but not sure if you actually think this or whether you disagree with my reasoning here. (I would need to consider how likely value drift from such a change seems. I don’t know yet.)
I don’t have a clear take here. I’m just curious if you have some thoughts on where something importantly mismatches your model.
Thanks!
Another thing is, if the programmer wants CEV (for the sake of argument), and somehow (!!) writes an RL reward function in Python whose output perfectly matches the extent to which the AGI’s behavior advances CEV, then I disagree that this would “make inner alignment unnecessary”. I’m not quite sure why you believe that.
I was just imagining a fully omnicient oracle that could tell you for each action how good that action is according to your extrapolated preferences, in which case you could just explore a bit and always pick the best action according to that oracle. But nvm, I noticed my first attempt of how I wanted to explain what I feel like is wrong sucked and thus dropped it.
The AGI was doing the wrong thing but got rewarded anyway (or doing the right thing but got punished)
The AGI was doing the right thing for the wrong reasons but got rewarded anyway (or doing the wrong thing for the right reasons but got punished).
This seems like a sensible breakdown to me, and I agree this seems like a useful distinction (although not a useful reduction of the alignment problem to subproblems, though I guess you agree here).
However, I think most people underestimate how many ways there are for the AI to do the right thing for the wrong reasons (namely they think it’s just about deception), and I think it’s not:
I think we need to make AI have a particular utility function. We have a training distribution where we have a ground-truth reward signal, but there are many different utility functions that are compatible with the reward on the training distribution, which assign different utilities off-distribution.
You could avoid talking about utility functions by saying “the learned value function just predicts reward”, and that may work while you’re staying within the distribution we actually gave reward on, since there all the utility functions compatible with the ground-truth reward still agree. But once you’re going off distribution, what value you assign to some worldstates/plans depends on what utility function you generalized to.I think humans have particular not-easy-to-pin-down machinery inside them, that makes their utility function generalize to some narrow cluster of all ground-truth-reward-compatible utility functions, and a mind with a different mind design is unlikely to generalize to the same cluster of utility functions.
(Though we could aim for a different compatible utility function, namely the “indirect alignment” one that say “fulfill human’s CEV”, which has lower complexity than the ones humans generalize to (since the value generalization prior doesn’t need to be specified and can instead be inferred from observations about humans). (I think that is what’s meant by “corrigibly aligned” in “Risks from learned optimization”, though it has been a very long time since I read this.))Actually, it may be useful to distinguish two kinds of this “utility vs reward mismatch”:
1. Utility/reward being insufficiently defined outside of training distribution (e.g. for what programs to run on computronium).
2. What things in the causal chain producing the reward are the things you actually care about? E.g. that the reward button is pressed, that the human thinks you did something well, that you did something according to some proxy preferences.Overall, I think the outer-vs-inner framing has some implicit connotation that for inner alignment we just need to make it internalize the ground-truth reward (as opposed to e.g. being deceptive). Whereas I think “internalizing ground-truth reward” isn’t meaningful off distribution and it’s actually a very hard problem to set up the system in a way that it generalizes in the way we want.
But maybe you’re aware of that “finding the right prior so it generalizes to the right utility function” problem, and you see it as part of inner alignment.
Note: I just noticed your post has a section “Manipulating itself and its learning process”, which I must’ve completely forgotten since I last read the post. I should’ve read your post before posting this. Will do so.
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
Calling problems “outer” and “inner” alignment seems to suggest that if we solved both we’ve successfully aligned AI to do nice things. However, this isn’t really the case here.
Namely, there could be a smart mesa-optimizer spinning up in the thought generator, who’s thoughts are mostly invisible to the learned value function (LVF), and who can model the situation it is in and has different values and is smarter than the LVF evaluation and can fool the the LVF into believing the plans that are good according to the mesa-optimizer are great according to the LVF, even if they actually aren’t.
This kills you even if we have a nice ground-truth reward and the LVF accurately captures that.
In fact, this may be quite a likely failure mode, given that the thought generator is where the actual capability comes from, and we don’t understand how it works.
I’d suggest not using conflated terminology and rather making up your own.
Or rather, first actually don’t use any abstract handles at all and just describe the problems/failure-modes directly, and when you’re confident you have a pretty natural breakdown of the problems with which you’ll stick for a while, then make up your own ontology.
In fact, while in your framework there’s a crisp difference between ground-truth reward and learned value-estimator, it might not make sense to just split the alignment problem in two parts like this:
“Outer alignment” entails having a ground-truth reward function that spits out rewards that agree with what we want. “Inner alignment” is having a learned value function that estimates the value of a plan in a way that agrees with its eventual reward.
First attempt of explaining what seems wrong: If that was the first I read on outer-vs-inner alignment as a breakdown of the alignment problem, I would expect “rewards that agree with what we want” to mean something like “changes in expected utility according to humanity’s CEV”. (Which would make inner alignment unnecessary because if we had outer alignment we could easily reach CEV.)
Second attempt:
“in a way that agrees with its eventual reward” seems to imply that there’s actually an objective reward for trajectories of the universe. However, the way you probably actually imagine the ground-truth reward is something like humans (who are ideally equipped with good interpretability tools) giving feedback on whether something was good or bad, so the ground-truth reward is actually an evaluation function on the human’s (imperfect) world model. Problems:
Humans don’t actually give coherent rewards which are consistent with a utility function on their world model.
For this problem we might be able to define an extrapolation procedure that’s not too bad.
The reward depends on the state of the world model of the human, and our world models probably often has false beliefs.
Importantly, the setup needs to be designed in a way that there wouldn’t be an incentive to manipulate the humans into believing false things.
Maybe, optimistically, we could mitigate this problem by having the AI form a model of the operators, doing some ontology translation between the operator’s world model and its own world model, and flagging when there seems to be a relavant belief mismatch.
Our world models cannot evaluate yet whether e.g. filling the universe computronium running a certain type of programs would be good, because we are confused about qualia and don’t know yet what would be good according to our CEV. Basically, the ground-truth reward would very often just say “i don’t know yet”, even for cases which are actually very important according to our CEV. It’s not just that we would need a faithful translation of the state of the universe into our primitive ontology (like “there are simulations of lots of happy and conscious people living interesting lives”), it’s also that (1) the way our world model treats e.g. “consciousness” may not naturally map to anything in a more precise ontology, and while our human minds, learning a deeper ontology, might go like “ah, this is what I actually care about—I’ve been so confused”, such value-generalization is likely even much harder to specify than basic ontology translation. And (2), our CEV may include value-shards which we currently do not know of or track at all.
So while this kind of outer-vs-inner distinction might maybe be fine for human-level AIs, it stops being a good breakdown for smarter AIs, since whenever we want to make the AI do something where humans couldn’t evaluate the result within reasonable time, it needs to generalize beyond what could be evaluated through ground-truth reward.
So mainly because of point 3, instead of asking “how can i make the learned value function agree with the ground-truth reward”, I think it may be better to ask “how can I make the learned value function generalize from the ground-truth reward in the way I want”?
(I guess the outer-vs-inner could make sense in a case where your outer evaluation is superhumanly good, though I cannot think of such a case where looking at the problem from the model-based RL framework would still make much sense, but maybe I’m still unimaginative right now.)
Note that I assumed here that the ground-truth signal is something like feedback from humans. Maybe you’re thinking of it differently than I described here, e.g. if you want to code a steering subsystem for providing ground-truth. But if the steering subsystem is not smarter than humans at evaluating what’s good or bad, the same argument applies. If you think your steering subsystem would be smarter, I’d be interested in why.
(All that is assuming you’re attacking alignment from the actor-critic model-based RL framework. There are other possible frameworks, e.g. trying to directly point the utility function on an agent’s world-model, where the key problems are different.)
Ah, thx! Will try.
If I did, I wouldn’t publicly say so.
It’s of course not yes or no, but just a probability, but in case it’s high I might not want to state it here, so I should generally not state it here, so you cannot infer it is high by the fact that I didn’t state it here.
I can say though that I only turned 22y last week and I expect my future self to grow up to become much more competent than I am now.
2. I mentioned that there should be much more impressive behavior if they were that smart; I don’t recall us talking about that much, not sure.
You said “why don’t they e.g. jump in prime numbers to communicate they are smart?” and i was like “hunter gatherer’s don’t know prime numbers and perhaps not even addition” and you were like “fair”.
I mean I thought about what I’d expect to see, but I unfortunately didn’t really imagine them as smart but just as having a lot of potential but being totally untrained.
3. I recommended that you try hard to invent hypotheses that would explain away the brain sizes.
(I’m kinda confused why your post here doesn’t mention that much; I guess implicitly the evidence about hunting defeats the otherwise fairly [strong according to you] evidence from brain size?)
I suggest that a bias you had was “not looking hard enough for defeaters”. But IDK, not at all confident, just a suggestion.
Yeah the first two points in the post are just very strong evidence that overpower my priors (where by priors i mean considerations from evolution and brain size, as opposed to behavior). Ryan’s point changed my priors, but I think it isn’t related enough to “Can I explain away their cortical neuron count?” that asking myself this question even harder would’ve helped.
Maybe I made a general mistake like “not looking hard enough for defeaters”, but it’s not that actionable yet. I did try to take all the available evidence and update properly on everything. But maybe some motivated stopping on not trying even longer to come up with a concrete example of what I’d have expected to see from orcas. It’s easier to say in retrospect though. Back then I didn’t know in what direction I might be biased.
But I guess I should vigilantly look out for warning signs like “not wanting to bother to think about something very carefully” or so. But it doesn’t feel like I was making the mistake, even though I probably did, so I guess the sensation might be hard to catch at my current level.
Yes human intelligence.
I forgot to paste in that it’s a follow up to my previous posts. Will do now.
In general, I wish this year? (*checks* huh, only 4 months.)
Nah I didn’t loose that much time. I already quit the project end of January, I just wrote the post now. Most of the technical work was also pretty useful for understanding language, which is a useful angle on agent foundations. I had previously expected working on that angle to be 80% as effective as my previous best plan, but it was even better, around similarly good I think. That was like 5-5.5 weeks and that was not wasted.
I guess I spent like 4.5 weeks overall on learning about orcas (including first seeing whether I might be able to decode their language and thinking about how and also coming up with the whole “teach language” idea), and like 3 weeks on orga stuff for trying to make the experiment happen.
I changed my mind about orca intelligence
Yeah I think I came to agree with you. I’m still a bit confused though because intuitively I’d guess chimps are dumber than −4.4SD (in the interpretation for “-4.4SD” I described in my other new comment).
When you now get a lot of mutations that increase brain size, while this contributes to smartness, this also pulls you away from the species median, so the hyperparameters are likely to become less well tuned, resulting in a countereffect that also makes you dumber in some ways.
Actually maybe the effect I am describing is relatively small as long as the variation in brain size is within 2 SDs or so, which is where most of the data pinning down the 0.3 correlation comes from.
So yeah it’s plausible to me that your method of estimating is ok.
Intuitively I had thought that chimps are just much dumber than humans. And sure if you take −4SD humans they aren’t really able to do anything, but they don’t really count.
I thought it’s sorta in this direction but not quite as extreme:
(This picture is actually silly because the distance to “Mouse” should be even much bigger. The point is that chimps might be far outside the human distribution.)
But perhaps chimps are actually closer to humans than I thought.
(When I in the following compare different species with standard deviations, I don’t actually mean standard deviations, but more like “how many times the difference between a +0SD and a +1SD human”, since extremely high and very low standard deviation measures mostly cease to me meaningful for what was actually supposed to be measured.)
I still think −4.4SD is overestimating chimp intelligence. I don’t know enough about chimps, but I guess they might be somewhere between −12SD and −6SD (compared to my previous intuition, which might’ve been more like −20SD). And yes, considering that the gap in cortical neuron count between chimps and humans is like 3.5x, and it’s even larger for the prefrontal cortex, and that algorithmic efficiency is probably “orca < chimp < human”, then +6SDs for orcas seem a lot less likely than I initially intuitively thought, though orcas would still likely be a bit smarter than humans (on the way my priors would fall out (not really after updating on observations about orcas)).
Thanks for describing a wonderfully concrete model.
I like that way you reason (especially the squiggle), but I don’t think it works quite that well for this case. But let’s first assume it does:
Your estimamtes on algorithmic efficiency deficits of orca brains seem roughly reasonable to me. (EDIT: I’d actually be at more like −3.5std mean with standard deviation of 2std, but idk.)
Number cortical neurons != brain size. Orcas have ~2x the number of cortical neurons, but much larger brains. Assuming brain weight is proportional to volume, with human brains being typically 1.2-1.4kg, and orca brains being typically 5.4-6.8kg, orca brains are actually like 6.1/1.3=4.7 times larger than human brains.
Taking the 5.4-6.8kg range, this would be 4.15-5.23 range of how much larger orca brains are. Plugging that in for `orca_brain_size_difference` yields 45% on >=2std, and 38% on >=4std (where your values ) and 19.4% on >=6std.
Updating down by 5x because orcas don’t seem that smart doesn’t seem like quite the right method to adjust the estimate, but perhaps fine enough for the upper end estimates, which would leave 3.9% on >=6std.Maybe you meant “brain size” as only an approximation to “number of cortical neurons”, which you think are the relevant part. My guess is that neuron density is actually somewhat anti-correlated with brain size, and that number of cortical neurons would be correlated with IQ rather at ~0.4-0.55 in humans, though i haven’t checked whether there’s data on this. And ofc using that you get lower estimates for orca intelligence than in my calculation above. (And while I’d admit that number of neurons is a particularly important point of estimation, there might also be other advantages of having a bigger brain like more glia cells. Though maybe higher neuron density also means higher firing rates and thereby more computation. I guess if you want to try it that way going by number of neurons is fine.)
My main point is however, that brain size (or cortical neuron count) effect on IQ within one species doesn’t generalize to brain size effect between species. Here’s why:
Let’s say having mutations for larger brains is beneficial for intelligence.[1]
On my view, a brain isn’t just some neural tissue randomly smished together, but has a lot of hyperparameters that have to be tuned so the different parts work well together.
Evolution basically tuned those hyperparameters for the median human (per gender).
When you now get a lot of mutations that increase brain size, while this contributes to smartness, this also pulls you away from the species median, so the hyperparameters are likely to become less well tuned, resulting in a countereffect that also makes you dumber in some ways.So when you get a larger brain as a human, this has a lower positive effect on intelligence, than when your species equilibriates on having a larger brain.
Thus, I don’t think within species intelligence variation can be extended well to inter-species intelligence variation.As for how to then properly estimate orca intelligence: I don’t know.
(As it happens, I thought of something and learned something yesterday that makes me significantly more pessimistic about orcas being that smart. Still need to consider though. May post them soon.)
- ^
I initially started this section with the following, but I cut it out because it’s not actually that relevant: “How intelligent you are mostly depends on how many deleterious mutations you have that move you away from your species average and thereby make you dumber. You’re mostly not smart because you have some very rare good genes, but because you have fewer bad ones.
Mutations for increasing sizes of brain regions might be an exception, because there intelligence trades off against childbirth mortality, so higher intelligence here might mean lower genetic fitness.”
- ^
Thanks.
I’m still not quite understanding what you’re thinking though.
“the” supposes there’s exactly one canonical choice for what object in the context is indicated by the predicate. When you say “the cat” there’s basically always a specific cat from context you’re talking about. “The cat is in the garden” is different from “There’s exactly one cat in the garden”.
I mean there has to be some possibility for revising your world model if you notice that there are actually 2 objects for something where you previously thought there’s only one.
I agree that “Superman” and “the superhero” denote the same object(assuming you’re in the right context for “the superhero”).
(And yeah to some extent names also depend a bit on context. E.g. if you have 2 friends with the same name.)
Yeah I didn’t mean this as formal statement. formal would be:
{exists x: apple(x) AND location(x, on=Table342)} CAUSES {exists x: apple(x) AND see(SelfPerson, x)}