Isn’t the worst case one in which the AI optimizes exactly against human values?
Jack R
Maybe Carl meant to link this one
it could be that the lack of alignment understanding is an inevitable consequence of our capabilities understanding not being there yet.
Could you say more about this hypothesis? To me, it feels likely that you can get crazy capabilities from a black box that you don’t understand and so whose behavior/properties you can’t verify to be acceptable. It’s not like once we build a deceptive model we will know what deceptive computation looks like and how to disincentivize it (which is one way your nuclear analogy could translate).
It’s possible, also, that this is about takeoff speeds, and that you think its plausible that e.g. we can disincentivize deception by punishing the negative consequences it entails (if FOOM, can’t since we’d be dead).
One thing is that it seems like they are trying to build some of the world’s largest language models (“state of the art models”)
Hah! Thanks
It seems to me that it would be better to view the question as “is this frame the best one for person X?” rather than “is this frame the best one?”
Though, I haven’t fully read either of your posts, so excuse any mistakes/confusion.
Do you have an example of a set of 1-detail stories you now might tell (composed with “AND”)?
Ah — sorry if I missed that in the post, only skimmed
Random tip: If you want to restrict apps etc on your iPhone but not know the Screen Time pin, I recommend the following simple system which allows you to not know the password but unlock restrictions easily when needed:
Ask a friend to write a 4 digit pin in a small note book (which is dedicated only for this pin)
Ask them to punch in the pin to your phone when setting the Screen Time password
Keep the notebook in your backpack and never look inside of it, ever
If you ever need your phone unlocked, you can walk up to someone, even a stranger, show them the notebook and ask them to punch in the pin to your phone
The system works because having a dedicated physical object that you commit to never look inside is surprisingly doable, for some reason.
Thanks for this list!
Though the list still doesn’t strike me as very novel—it feels that most of these conditions are conditions we’ve been shooting for anyways.
E.g. conditions 1, 2, and 5 are about selecting for behavior we approve of and condition 5 is just inspection with interpretability tools.
If you feel you have traction on conditions 3 and 4 though, that does seem novel (side-note that condition 4 seems to be a subset of condition 3). I feel skeptical though, since value extrapolation seems like about as hard of a problem as understanding machine generalization in general + the way a thing behaves in a large class of cases seems to be so complicated of a concept that you won’t be able to have confident beliefs about it or understand it. I don’t have a concrete argument about this though.
Anyways, thanks for responding, and if you have any thoughts about the tractability of conditions 3⁄4, I’m pretty curious.
I (with some help) compiled some of the best rationality essays here.
Ping about my other comment—FYI, because I am currently concerned that you don’t have criteria for the innards in mind, I’m less excited about your agenda than other alignment theory agendas (though this lack of excitement is somewhat weak, e.g. since I haven’t tried to digest your work much yet).
I’d be interested in you sharing any reasons why you think this might fall apart, e.g. any insights you’ve gained from deeper inspection.
Game that might improve research productivity
Oh I see—could you say more about what characteristics you want the innards to have?
How do you know when you have solved the value extrapolation problem?
One hypothesis I have for what you might say is something like “a training scheme solves the value extrapolation problem when the sequence of inputs that will be seen in deployment by the AI produced by that training scheme leads to outputs which lead to positive outcomes by human lights” though from what I can tell, that’s basically the same as having a training scheme that leads to an “impact aligned” AI*.
If it isn’t this, how is your answer different?
*[ETA: the definition of impact alignment that Evan gives in the linked post technically only refers to an AI “which doesn’t take actions that we would judge to be bad/problematic/dangerous/catastrophic,” but in my comment above, I meant to refer to what I think is the more relevant property for an AI to have, which I’ll call (impact aligned)_Jack: an agent is (impact aligned)_Jack to the degree that, by human lights, it doesn’t take bad actions and does take good actions.” I think that this is more relevant because Evan’s definition doesn’t distinguish between a rock and an intuitively aligned AI.]
[Question] Have you noticed costs of being anticipatory?
I’m not sure you have addressed Richard’s point—if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn’t seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.
[ETA: I’m not that sure of the below argument]
Thanks for the example, but it still seems to me that this sort of thing won’t work for advanced AI. If you are familiar with the ELK report, you should be able to see why. [Spoiler below]
Even if you manage to learn the properties of what looks like deception to humans, and instill those properties into a loss function, then it seems like you are still more likely to get a system that tells you what humans think the truth is, avoiding what humans would be able to notice as deception, rather than telling you what the truth actually seems to be (given what it knows). The reason is that, as AI develops, programs that are capable of the former thing have constant complexity, but programs that are capable of the latter thing have complexity that grows with the complexity of the AI’s models of the world, and so you should expect that the former is favored by SGD. See this part of the ELK document for a more detailed description of this failure mode.