Alex Flint comments on Alignment versus AI Alignment

Alex Flint 10 Feb 2022 2:24 UTC
LW: 3 AF: 2
AF

Summing up all that, this post made me realize Alignment Research should be its own discipline.

Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new “subject area” that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don’t think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi’s work to be figuring out what this new discipline/paradigm really is.)

The habit formation example seems weirdly ‘acausal decision theory’ flavored to me (though this might be a ‘tetris effect’ like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.

Interesting! I hadn’t thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you’re referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you’re seeing something like when we’re executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just “do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony” ?

Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There’s a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.

We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale—I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes!

Hope you are well Alex!