Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don’t reason using Bayes nets—but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn’t be logically inconsistent or physically impossible, and we wouldn’t want alignment to fail in that world.
Additionally, I think the statement made in the report about AIs also applies to humans:
Moreover, we think that a realistic messy predictor is pretty likely to still use strategies similar to inference in Bayes nets — amongst other cognitive strategies. We think any solution to ELK will probably need to cope with the difficulties posed by the Bayes net test case — amongst other difficulties. We’ve also considered a number of other simple test cases, and found that counterexamples similar to the ones we’ll discuss in this report apply to all of them.
We’re using some sort of cognitive algorithms to reason about the world, and it’s plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There’s no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier; there’s nothing that we could obviously exploit to create a strategy.
Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don’t reason using Bayes nets—but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn’t be logically inconsistent or physically impossible, and we wouldn’t want alignment to fail in that world.
If you solve something given worst-case assumptions, you’ve solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that’s not the case we end up facing.
There’s no obvious way that a messier model of human reasoning makes ELK easier.
Doesn’t this imply that a Bayes-net model isn’t the worst case?
EDIT: I guess it depends on whether “the human isn’t well-modelled using a Bayes net” is a possible response the breaker could give. But that doesn’t seem like it fits the format of finding a test case where the builder’s strategy fails (indeed, “bayes nets” seems built into the definition of the game).
Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn’t just say “That’s unrealistic” when the breaker suggested the human runs a Bayes net. The answer to that is what I said above—because the assumption is that we’re working in the worst case, the builder can’t invoke unrealism to dismiss the counterexample.
If the question is instead “Why is the builder allowed to just focus on the Bayes net case?”, the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder’s strategy again, and the hope is that after many rounds we’ll get to a place where it’s really hard to think of a counterexample that breaks the builder’s strategy despite trying hard.
Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly—might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!
I’m still a little wary of how much the report talks about concepts in a humans’ Bayes net without really explaining why this is anywhere near a sensible model of humans, but I’ll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it’s useful to start off with very simple assumptions).
Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they’re all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.
“the human isn’t well-modelled using a Bayes net” is a possible response the breaker could give
The breaker is definitely allowed to introduce counterexamples where the human isn’t well-modeled using a Bayes net. Our training strategies (introduced here) don’t say anything at all about Bayes nets and so it’s not clear if this immediately helps the breaker—they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here). We’re definitely not intentionally viewing Bayes nets as part of the definition of the game.
If you solve something given worst-case assumptions, you’ve solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that’s not the case we end up facing.
It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don’t-use-Bayes-nets, in which case we’ll move on to those counterexamples.
It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn’t well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here or reasoning about plausibility of model joint distributions here).
That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn’t seem to be changing the picture much qualitatively.
Echoing Mark in his comment, we’re definitely interested in ways that this assumption seems importantly unrealistic. If you just think it’s generally a mediocre model and results are unlikely to generalize, then you can also wait for us to discover that after finding an algorithm that works for Bayes nets and then finding that it breaks down as we extend to more realistic examples.
Conditioned on ontology identification being impossible, I think it’s most likely to also be impossible for humans who reason about the world using a Bayes net.
Doesn’t this imply that a Bayes-net model isn’t the worst case?
I think Ajeya is just pointing out why it seems useful to search for algorithms that handle Bayes nets. If thinking about Bayes nets is very straightforward and it lets us rule out all the algorithms we can see, then we’re happy to do that as long as it works.
Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don’t reason using Bayes nets—but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn’t be logically inconsistent or physically impossible, and we wouldn’t want alignment to fail in that world.
Additionally, I think the statement made in the report about AIs also applies to humans:
We’re using some sort of cognitive algorithms to reason about the world, and it’s plausible that strategies which resemble inference on graphical models play a role in some of our understanding. There’s no obvious way that a messier model of human reasoning which incorporates all the other parts should make ELK easier; there’s nothing that we could obviously exploit to create a strategy.
If you solve something given worst-case assumptions, you’ve solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that’s not the case we end up facing.
Doesn’t this imply that a Bayes-net model isn’t the worst case?
EDIT: I guess it depends on whether “the human isn’t well-modelled using a Bayes net” is a possible response the breaker could give. But that doesn’t seem like it fits the format of finding a test case where the builder’s strategy fails (indeed, “bayes nets” seems built into the definition of the game).
Sorry, there were two things you could have meant when you said the assumption that the human uses a Bayes net seemed crucial. I thought you were asking why the builder couldn’t just say “That’s unrealistic” when the breaker suggested the human runs a Bayes net. The answer to that is what I said above—because the assumption is that we’re working in the worst case, the builder can’t invoke unrealism to dismiss the counterexample.
If the question is instead “Why is the builder allowed to just focus on the Bayes net case?”, the answer to that is the iterative nature of the game. The Bayes net case (and in practice a few other simple cases) was the case the breaker chose to give, so if the builder finds a strategy that works for that case they win the round. Then the breaker can come back and add complications which break the builder’s strategy again, and the hope is that after many rounds we’ll get to a place where it’s really hard to think of a counterexample that breaks the builder’s strategy despite trying hard.
Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly—might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!
I’m still a little wary of how much the report talks about concepts in a humans’ Bayes net without really explaining why this is anywhere near a sensible model of humans, but I’ll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it’s useful to start off with very simple assumptions).
Ah got it. To be clear, Paul and Mark do in practice consider a bank of multiple counterexamples for each strategy with different ways the human and predictor could think, though they’re all pretty simple in the same way the Bayes net example is (e.g. deduction from a set of axioms); my understanding is that essentially the same kind of counterexamples apply for essentially the same underlying reasons for those other simple examples. The doc sticks with one running example for clarity / length reasons.
The breaker is definitely allowed to introduce counterexamples where the human isn’t well-modeled using a Bayes net. Our training strategies (introduced here) don’t say anything at all about Bayes nets and so it’s not clear if this immediately helps the breaker—they are the one who introduced the assumption that the human used a Bayes nets (in in order to describe a simplified situation where the naive training strategy failed here). We’re definitely not intentionally viewing Bayes nets as part of the definition of the game.
It seems very plausible that after solving the problem for humans-who-use-Bayes-nets we will find a new counterexample that only works for humans-who-don’t-use-Bayes-nets, in which case we’ll move on to those counterexamples.
It seems even more likely that the builder will propose an algorithm that exploits cognition that humans can do which isn’t well captured by the Bayes net model, which is also fair game. (And indeed several of our approaches to do it, e.g. when imagining humans learning new things about the world by performing experiments here or reasoning about plausibility of model joint distributions here).
That said, it looks to us like if any of these algorithms worked for Bayes nets, they would at least work for a very broad range of human models, the Bayes net assumption doesn’t seem to be changing the picture much qualitatively.
Echoing Mark in his comment, we’re definitely interested in ways that this assumption seems importantly unrealistic. If you just think it’s generally a mediocre model and results are unlikely to generalize, then you can also wait for us to discover that after finding an algorithm that works for Bayes nets and then finding that it breaks down as we extend to more realistic examples.
Conditioned on ontology identification being impossible, I think it’s most likely to also be impossible for humans who reason about the world using a Bayes net.
I think Ajeya is just pointing out why it seems useful to search for algorithms that handle Bayes nets. If thinking about Bayes nets is very straightforward and it lets us rule out all the algorithms we can see, then we’re happy to do that as long as it works.