Ben Ihrig comments on Clarifying Alignment Fundamentals Through the Lens of Ontology

Ben Ihrig 12 Oct 2024 18:04 UTC
1 point
0
It’s not about comparing a process to universal ontology, it’s about comparing it to one’s internal model of the universal ontology, which we then hope is good enough. In the ethics dataset, that could look like reductio ad absurdum on certain model processes, e.g.: “You have a lot of fancy reasoning here for why you should kill an unspecified man on the street, but it must be wrong because it reaches the wrong conclusion.”

(Ethics is a bit of a weird example because the choices aren’t based around trying to infer missing information, as is paradigmatic of the personal/universal tension, but the dynamic is similar.)

Predicting the future 10,000 years hence has much less potential for this sort of reductio, of course. So I see your point. It seems like in such cases, humans can only provide feedback via comparison to our own learned forecasting strategies. But even this bears similar structure.

We can view the real environment that we learned our forecasting strategies from as the “toy model” that we are hoping will generalize well enough to the 10,000 year prediction problem. Then, the judgement we provide on the AI’s processes is the stand-in for actually running those processes in the toy model. Instead of seeing how well the AI’s methods do by simulating them in the toy model, we compare its methods to our own methods, which evolved due to success in the model.

Seeing things like this allows us to identify two distinct points of failure in the humans-judging-processes setup:
1. The forecasting environment humans learned in may not bear enough similarity to the 10,000 year forecasting problem.
2. Human judgement is just a lossy signal for actual performance on that environment they learned in; AI methods that would perform well in the human’s environment may still get rated poorly by humans, and vice versa.
So it seems to me that the general model of the post can understand these cases decently well, but the concepts are definitely a bit slippery and this is the area that I feel most uncertain about here.