This inspired a full length post.
Ronny Fernandez
Quick submission:
The first two prongs of OAI’s approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.
The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.
The third prong of OAI’s strategy seems doomed to me, but I can’t really say why in a way I think would convince anybody that doesn’t already agree. It’s totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn’t going to work for obvious reasons.
I think the whole thing fails way before this, but I’m happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.
I loved this, but maybe should come with a cw.
I assumed he meant the thing that most activates the face detector, but from skimming some of what people said above, seems like maybe we don’t know what that is.
There’s a nearby kind of obvious but rarely directly addressed generalized version of one of your arguments, which is that ML learns complex functions all the time, so why should human values be any different? I rarely see this discussed, and I thought the replies from Nate and the ELK related difficulties were important to have out in the open, so thanks a lot for including the face learning <-> human values learning analogy.
- Oct 18, 2022, 12:16 AM; 2 points) 's comment on Counterarguments to the basic AI x-risk case by (
Ege Erdil gave an important disanaology between the problem of recognizing/generating a human face, and the problem of either learning human values, or learning what plans that advance human values are like. The disanalogy is that humans are near perfect human face recognizers, but we are not near perfect valuable world-state or value-advancing-plan recognizers. This means that if we trained an AI to either recognize valuable world-states or value-advancing plans, we would actually end up just training something that recognizes what we can recognize as valuable states or plans. If we trained it like we train GANs, the discriminator would fail to be able to discriminate actually valuable world states given by the generator from ones that just look really valuable to humans but actually are not valuable at all according to the humans if they understand the plan/state well enough. So we would need some sort of ELK proposal that works to get any real comfort from the face recognizing/generating <-> human values learning analogy.
Nate Soares points out on twitter that the supposedly maximally human face like images according to GAN models look like horrible monstrosities, and so following the analogy, we should expect that for similar models doing similar things for human values, the maximally valuable world state also looks like some horrible monstrosity.
I also find it somewhat taboo but not so much that I haven’t wondered about it.
Just realized that’s not UAI. Been looking for this source everywhere, thanks.
Ok I understand that although I never did find a proof that they are equivalent in UAI. If you know where it is, please point it out to me.
I still think that solomonoff induction assigns 0 to uncomputable bit strings, and I don’t see why you don’t think so.
Like the outputs of programs that never halt are still computable right? I thought we were just using a “prints something eventually oracle” not a halting oracle.
Simple in the description length sense is incompatible with uncomputability. Uncomputability means there is no finite way to point to the function. That’s what I currently think, but I’m confused about you understanding all those words and disagreeing.
A lot of folks seem to think that general intelligences are algorithmically simple. Paul Christiano seems to think this when he says that the universal distribution is dominated by simple consequentialists.
But the only formalism I know for general intelligences is uncomputable, which is as algorithmically complicated as you can get.
The computable approximations are plausibly simple, but are the tractable approximations simple? The only example I have of a physically realized agi seems to be very much not algorithmically simple.
Thoughts?
After trying it, I’ve decided that I am going to charge more like five dollars per step, but yes, thoughts included.
Can we apply for consultation as a team of two? We only want remote consultation of the resources you are offering because we are not based in bay area.
For anyone who may have the executive function to go for the 1M, I propose myself as a cheap author if I get to play as the dungeon master role, or play as the player role, but not if I have to do both. I recommend taking me as the dungeon master role. This sounds genuinely fun to me. I would happily do a dollar per step.
I can also help think about how to scale the operation, but I don’t think I have the executive function, management experience, or slack to pull it off myself.
I am Ronny Fernandez. You can contact me on fb.
I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.
It seems like it’s easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It’s much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls.
Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization.
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it’s just the utility function you endorse, and that’s up to you. The other black box is the space of programs that you could be. Maybe it’s limited by memory, maybe it’s limited by run time, or maybe it’s any finite state machine with less than 10^20 states, maybe it’s python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don’t have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
I don’t think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.
I do think this is an important concept to explain our conception of goal-directedness, but I don’t think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it’s going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don’t expect that to generalize across all possible minds.