Daniel Kokotajlo comments on Daniel Kokotajlo’s Shortform

Daniel Kokotajlo 17 Nov 2023 19:43 UTC
2 points
0
I lurk in the discord for The Treacherous Turn, a ttrpg made by some AI Safety Camp people I mentored. It’s lovely. I encourage everyone to check it out.

Anyhow recently someone asked for ideas for Terminal Goals an AGI might have in a realistic setting; my answer is below and I’m interested to hear whether people here agree or disagree with it:
Insofar as you want it to be grounded, which you might not want, here are some hypotheses people in AI alignment would toss around as to what would actually happen: (1) The AGI actually has exactly the goals and deontological constraints the trainers intended it to have. Insofar as they were training it to be honest, for example, it actually is honest in basically the human way, insofar as they were training it to make money for them, it deeply desires to make money for them, etc. Depending on how unethical and/or philosophically-careless the trainers were, such an AI could still be very bad news for the world.
(2) The AGI has some goals that caused it to perform well in training so far, but are not exactly what the trainers intended. For example perhaps the AGI’s concepts of honesty and profit are different from the trainer’s concepts, and now the AGI is smart enough to realize this, but it’s too late because the AGI wants to be honest and profitable according to its definitions, not according to the trainer’s definitions. For the RPG you could model this probably by doing a bit of mischevous philosophy and saying how the AGIs goal-concepts diverge from their human counterparts, e.g. ‘the AGI only considers it dishonest if it’s literally false in the way the AGI normally would interpret it, not if it’s true-but-misleading or false-according-to-how-humans-might-interpret-it.’ Another sub-type of this is that the AGI has additional goals besides the ones the trainers intended, e.g. they didn’t intend for it to have a curiousity drive or a humans-trust-and-respect-me drive or a survival drive, but those drives were helpful for it in training so it has them now, even though they often conflict with the drives the trainers intended.
(3) The AGI doesn’t really care at all about the stuff the trainers intended it to. Instead it just wants to get rewarded. It doesn’t care about being honest, for example, it just wants to appear honest. Or worse it just wants the training process to continue assigning it high scores, even if the reason this is happening is because the trainers are dead and their keyboards have been hacked. (4) The AGI doesn’t really care at all about the stuff the trainers intended it to. Instead it has some extremely simple goals, the simplest possible goals that motivate it to think strategically and acquire power. Maybe something like survival. idk.

My rough sense is that AI alignment experts tend to think that 2 and 3 are the most likely outcomes of today’s training methods, with 1 being in third place and 4 being in fourth place. But there’s no consensus.