AI Safety Researcher, my website is here.
ojorgensen
[Question] Which Issues in Conceptual Alignment have been Formalised or Observed (or not)?
I went through the paper for a reading group the other day, and I think the video really helped me to understand what is going on in the paper. Parts I found most useful were indications which parts of the paper / maths were most important to be able to understand, and which were not (tensor products).
I had made some effort to read the paper before with little success, but now feel like I understand the overall results of the paper pretty well. I’m very positive about this video, and similar things like this being made in the future!
Personal context: I also found the intro to IB video series similarly useful. I’m an AI masters student who has some pre-existing knowledge about AI alignment. I have a maths background.
Firstly, thanks for reading the post! I think you’re referring mainly to realisability here which I’m not that clued up on tbh, but I’ll give you my two cents because why not.
I’m not sure to what extent we should focus on unrealisability when aligning systems. I think I have a similar intuition to you that the important question is probably “how can we get good abstractions of the world, given that we cannot perfectly model it”. However, I think better arguments for why unrealisability is a core problem in alignment than I have laid out probably do exist, I just haven’t read that much into it yet. I’ll link again to this video series on IB (which I’m yet to finish) as I think there are probably some good arguments here.
I’m not sure if this is what you’re looking for, but Hofstadter gives a great analogy using record players which I find useful in terms of thinking about how changing the situation changes our results (which is paraphrased here).
A (hi-fi) record player that tries to playing every possible sound can’t actually play its own self-breaking sound, so it is incomplete by virtue of its strength.
A (low-fi) record player that refuses to play all sounds (in order to avoid destruction from its self-breaking sound) is incomplete by virtue of its weakness.
We may think of the hi-fi record player as a formal system like Peano Arithmetic: the incompleteness arises precisely because it is strong enough to be able to capture number theory. This is what allows us to use Gödel Numbering, which then allows PA to do meta-reasoning about itself.
The only way to fix it is to make a system that is weaker than PA, so that we cannot do Gödel Numbering. But then we have a system that isn’t even trying to express what we mean by number theory. This is the low-fi record player: as soon as we fix the one issue of self-reference, we fail to capture the thing we care about (number theory).
I think an example of a weaker formal system is Propositional Calculus. Here we do actually have completeness, but that is only because Propositional Calculus is too weak to be able to capture number theory.
Strange Loops—Self-Reference from Number Theory to AI
I found this post really interesting, thanks for sharing it!
It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.
For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide).
We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purposes of staying around
The model’s proxies “crystallize”, as they are no longer relevant to performance, and we reach an equilibrium
Let’s suppose that this reasoning, and the associated justification of why this is likely to arise due to SGD seeking the largest possible marginal performance improvements, are sound for a high path-dependence world. Why does it no longer hold in a low path-dependence world?
I really like this post! I can’t see whether you’ve already cross posted this to the EA forum, but it seems valuable to have this there too (as it is focussed on the EA community).
Understanding Infra-Bayesianism :))