Jeremy Gillen comments on Detect Goodhart and shut down

Jeremy Gillen 23 Jan 2025 20:56 UTC
3 points
0
The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with… etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it’s not super clear which bits are most likely to be directly useful.
This seems like the ultimate goal of interp research (and it’s a good goal). Or, I think the current story for heuristic arguments is using them to “explain” a trained neural network by breaking it down into something more like an X,Y,Z components explanation.
At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.
(The above is the “understood” end of the axis. The “not-understood” end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).
This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.
Also what is its relation to natural language?
Not sure what you’re getting at here.
- Jonas Hallgren 24 Jan 2025 8:51 UTC
  1 point
  0
  Parent
  Okay, that makes sense to me so thank you for explaining!
  
  I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic.
  
  Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.