Noosphere89 comments on Thoughts on “AI is easy to control” by Pope & Belrose

Noosphere89 3 Dec 2023 2:16 UTC
6 points
0

I find your text confusing. Let’s go step by step.

AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different. By analogy:

The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode). A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least) If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.

I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.

So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.

To respond to this:

So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)

This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.

Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.

So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.