eternal/ephemera comments on Clarifying Alignment Fundamentals Through the Lens of Ontology

eternal/ephemera 11 Oct 2024 21:25 UTC
1 point
0
Hey, thanks for the comment. Part of what I like about this framework is that it provides an account for how we do that process of “somehow judging things as true”. Namely, that we develop personal concepts that correspond with universal concepts via the various forces that change our minds over time.

We can’t access universal ontology ourselves, but reasoning about it allows us to state things precisely—it provides a theoretical standard for whether a process aimed at determining truth succeeds or not.

Do you have an example for domains where ground truth is unavailable, but humans can still make judgements about what processes are good to use? I’d claim that most such cases involve a thought experiment, i.e. a model about how the world works that implies a certain truth-finding method will be successful.
- Charlie Steiner 11 Oct 2024 22:55 UTC
  3 points
  0
  Parent
  Do you have an example for domains where ground truth is unavailable, but humans can still make judgements about what processes are good to use?
  Two very different ones are ethics and predicting the future.
  Have you ever heard of a competition called the Ethics Bowl? They’re a good source of questions with no truth values you could record in a dataset. E.g. “How should we adjudicate the competing needs of the various scientific communities in the case of the Roman lead?” Competing teams have to answer these questions and motivate their answers, but it’s not like there’s one right answer the judges are looking for and getting closer to that answer means a higher score.
  Predicting the future seems like you can just train it based on past events (which have an accessible ground truth), but what if I want to predict the future of human society 10,000 years from now? Here there is indeed more of “a model about the world that implies a certain truth-finding method will be successful,” which we humans will use to judge an AI trying to predict the far future—there’s some comparison being made, but we’re not comparing the future-predicting AI’s process to the universal ontology because we can’t access the universal ontology.
  - eternal/ephemera 12 Oct 2024 18:04 UTC
    1 point
    0
    Parent
    It’s not about comparing a process to universal ontology, it’s about comparing it to one’s internal model of the universal ontology, which we then hope is good enough. In the ethics dataset, that could look like reductio ad absurdum on certain model processes, e.g.: “You have a lot of fancy reasoning here for why you should kill an unspecified man on the street, but it must be wrong because it reaches the wrong conclusion.”
    
    (Ethics is a bit of a weird example because the choices aren’t based around trying to infer missing information, as is paradigmatic of the personal/universal tension, but the dynamic is similar.)
    
    Predicting the future 10,000 years hence has much less potential for this sort of reductio, of course. So I see your point. It seems like in such cases, humans can only provide feedback via comparison to our own learned forecasting strategies. But even this bears similar structure.
    
    We can view the real environment that we learned our forecasting strategies from as the “toy model” that we are hoping will generalize well enough to the 10,000 year prediction problem. Then, the judgement we provide on the AI’s processes is the stand-in for actually running those processes in the toy model. Instead of seeing how well the AI’s methods do by simulating them in the toy model, we compare its methods to our own methods, which evolved due to success in the model.
    
    Seeing things like this allows us to identify two distinct points of failure in the humans-judging-processes setup:
    
    The forecasting environment humans learned in may not bear enough similarity to the 10,000 year forecasting problem.
    
    Human judgement is just a lossy signal for actual performance on that environment they learned in; AI methods that would perform well in the human’s environment may still get rated poorly by humans, and vice versa.
    
    So it seems to me that the general model of the post can understand these cases decently well, but the concepts are definitely a bit slippery and this is the area that I feel most uncertain about here.