DavidW comments on Deceptive Alignment is <1% Likely by Default

DavidW Mar 2, 2023, 3:54 PM
2 points
−1
1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?
What links here?
- DavidW's comment on Deceptive Alignment is <1% Likely by Default by DavidW (EA Forum; Mar 4, 2023, 5:02 PM; 2 points)
- Daniel Kokotajlo Jul 10, 2024, 10:43 PM
  2 points
  1
  Parent
  I just realized I never responded to this. Sorry. I hope to find time to respond someday… feel free to badger me about it. Curious how you are doing these days and what you are up to.