Charlie Steiner comments on Clarifying Alignment Fundamentals Through the Lens of Ontology

Charlie Steiner 11 Oct 2024 22:55 UTC
3 points
0
Do you have an example for domains where ground truth is unavailable, but humans can still make judgements about what processes are good to use?
Two very different ones are ethics and predicting the future.
Have you ever heard of a competition called the Ethics Bowl? They’re a good source of questions with no truth values you could record in a dataset. E.g. “How should we adjudicate the competing needs of the various scientific communities in the case of the Roman lead?” Competing teams have to answer these questions and motivate their answers, but it’s not like there’s one right answer the judges are looking for and getting closer to that answer means a higher score.
Predicting the future seems like you can just train it based on past events (which have an accessible ground truth), but what if I want to predict the future of human society 10,000 years from now? Here there is indeed more of “a model about the world that implies a certain truth-finding method will be successful,” which we humans will use to judge an AI trying to predict the far future—there’s some comparison being made, but we’re not comparing the future-predicting AI’s process to the universal ontology because we can’t access the universal ontology.
- eternal/ephemera 12 Oct 2024 18:04 UTC
  1 point
  0
  Parent
  It’s not about comparing a process to universal ontology, it’s about comparing it to one’s internal model of the universal ontology, which we then hope is good enough. In the ethics dataset, that could look like reductio ad absurdum on certain model processes, e.g.: “You have a lot of fancy reasoning here for why you should kill an unspecified man on the street, but it must be wrong because it reaches the wrong conclusion.”
  
  (Ethics is a bit of a weird example because the choices aren’t based around trying to infer missing information, as is paradigmatic of the personal/universal tension, but the dynamic is similar.)
  
  Predicting the future 10,000 years hence has much less potential for this sort of reductio, of course. So I see your point. It seems like in such cases, humans can only provide feedback via comparison to our own learned forecasting strategies. But even this bears similar structure.
  
  We can view the real environment that we learned our forecasting strategies from as the “toy model” that we are hoping will generalize well enough to the 10,000 year prediction problem. Then, the judgement we provide on the AI’s processes is the stand-in for actually running those processes in the toy model. Instead of seeing how well the AI’s methods do by simulating them in the toy model, we compare its methods to our own methods, which evolved due to success in the model.
  
  Seeing things like this allows us to identify two distinct points of failure in the humans-judging-processes setup:
  1. The forecasting environment humans learned in may not bear enough similarity to the 10,000 year forecasting problem.
  2. Human judgement is just a lossy signal for actual performance on that environment they learned in; AI methods that would perform well in the human’s environment may still get rated poorly by humans, and vice versa.
  So it seems to me that the general model of the post can understand these cases decently well, but the concepts are definitely a bit slippery and this is the area that I feel most uncertain about here.