I’m not quite convinced. Topics looks ok, but the language is too corporate. Maybe it can be fixed with some prompt engineering.
baturinsky
And yet, AlphaZero is corrigible. It’s goal is not even to win, it’s goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.
Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it’s value function would proably change to aiming for the actual “official” win.
This scenario requires a pretty specific (but likely) circumstances
No time limit on task
No other AIs that would prevent it from power grabbing or otherwise being an obstacle to their goals
AI assuming that goal will not be reached even after AI is shutdown (by other AIs, by same AI after being turned back on, by people, by chance, as the eventual result of AI’s actions before being shut down, etc)
Extremely specific value function that ignores everything except one specific goal
This goal being a core goal, not an instrumental. For example, final goal could be “be aligned”, instrumental goal—“do what people asks, because that’s what aligned AIs do”. Then the order to stop would not be a change of the core goal, but a new data about the world, that updates the best strategy of reaching the core goal.
Can GPT convincingly emulate them talking to each other/you?
Yes, if you only learn the basics of the language, you will learn only the basics of the language user’s values (if any).
But the deep understanding of the language requires knowing the semantics of the words and constructions in it (including the meaning of the words “human” and “values”, btw). To understand texts you have to understand in which context their are used, etc.
Also, pretty much each human-written text carries some information about the human values. Because people only talk about the things that they see as at least somewhat important/valuable to them.
And a lot of texts are related to values much more directly. For example, each text about human relations is directly related to conflicts or alignment of particular people values.
So, if you learn the language from reading text (like LLMs do) you will pick a lot about people values on the way (like LLMs did).
I think AI should threat value function as probabilistic. I.e. instead of thinking “this world has value of exactly N” it could think something like “I 90% sure that this world has value N+-M, but there is 10% possibility that it could actuall have value -ALOT”. And would avoid that world, because it would give a very low expected value on averager.
To me, aligning AI with the humanity seems to be much EASIER than aligning with the specific person. Because the common human values are much better “documented” and are much more stable than the wishes of the one person.
Also, a single person in control of powerful AI is an obvious weak point, which could be controlled by third party or by AI itself, giving the control of the AI through that.
Is it possible to learn a language without learning the values of those who speak it?
I agree.
I just use “aligned” usually in meaning of “aligned with humanity”, as there is not much difference between outcomes for AGIs that are not aligned with humanity. Even if they are aligned with something elese. If they are agentic, they will have killeveryone as an instrumental goal, because humanity will likely be obstacle for whatever future plans it will have. If AGI is not agentic, but is an oracle, it will provide some world-ending information to some unaligned agent, with mostly the same result.
AI is developed by misaligned people, or people that consider it being the only way to stop the misaligned people from developing AI.
Even moderately intelligent humanity-aligned AI would identify actions with the obvious risk of catastrophic consequences and would refuse to do them. Except if to prevent something even more catastrophic.
Human: Does gradient descent to AGI, trains refusal response out of it.
That would make AGI misaligned.
Nope, that’s the wrong solution. Second player wins by mirroring moves. Answer to removing one pebble is removing a pebble diagonally to it, leaving two disconnected pebbles.
Human: Aligned AGI, make me a more powerful AGI!
AGI: What? Are you nuts? Do you realise how dangerous those things are? No!
Hmm… by analogy, would high status AI agent sabotage the creation and use of the more capable AI agents?
“Making decision oneself” will also become a very vague concept when superconvincing AIs are running around.
Problem I see, our values are defined in a stable way only inside the distribution. I.e. for the situations which are similar to those we have already experienced.
Outside of it there may be many radically different extrapolations which are consistent with themselves and with our values inside the distribution. And it’s problem not with AI, but with the values themselves.
For example, there is no correct answer about what the human is. I.e. how much we can “improve” the human until it stops being a human. We can choose different answers and they will all be consistent with out pre-singularity concept of the human, and do not contradict with already established values.
Maybe it is an attempt of the vaccination? I.e. exposing the “organism” to the weakened form of the deadly “virus”, so the organism can produce “antibodies”.
I doubt training LLMs can lead to AGI. Fundamental research on the alternative architectures seems to be more dangerous.