Most of my posts and comments are about AI and alignment. Posts I’m most proud of, which also provide a good introduction to my worldview:
Without a trajectory change, the development of AGI is likely to go badly
Steering systems, and a follow up on corrigibility.
I also created Forum Karma, and wrote a longer self-introduction here.
PMs and private feedback are always welcome.
NOTE: I am not Max Harms, author of Crystal Society. I’d prefer for now that my LW postings not be attached to my full name when people Google me for other reasons, but you can PM me here or on Discord (m4xed) if you want to know who I am.
I want to push back on this a bit. I suspect that “demonstrated progress” is doing a lot of work here, and smuggling an assumption that current trends with LLMs will continue and can be extrapolated straightforwardly.
It’s true that LLMs have some nice properties for encapsulating fuzzy and complex concepts like human values, but I wouldn’t actually want to use any current LLMs as a referent or in a rating system like the one you propose, for obvious reasons.
Maybe future LLMs will retain all the nice properties of current LLMs while also solving various issues with jailbreaking, hallucination, robustness, reasoning about edge cases, etc. but declaring victory already (even on a particular and narrow point about value identification) seems premature to me.
Separately, I think some of the nice properties you list don’t actually buy you that much in practice, even if LLM progress does continue straightforwardly.
A lot of the properties you list follow from the fact that LLMs are pure functions of their input (at least with a temperature of 0).
Functional purity is a very nice property, and traditional software that encapsulates complex logic in pure functions is often easier to reason about, debug, and formally verify vs. software that uses lots of global mutable state and / or interacts with the outside world through a complex I/O interface. But when the function in question is 100s of GB of opaque floats, I think it’s a bit of a stretch to call it transparent and legible just because it can be evaluated outside of the IO monad.
Aside from purity, I don’t think your point about an LLM being a “particular function” that can be “hooked up to the AI directly” is doing much work -
input()
(i.e. asking actual humans) seems just as direct and particular asllm()
. If you want your AI system to actually do something in the messy real world, you have to break down the nice theoretical boundary and guarantees you get from functional purity somewhere.More concretely, given your proposed rating system, simply replace any LLM calls with a call that just asks actual humans to rate a world state given some description, and it seems like you get something that is at least as legible and transparent (in an informal sense) as the LLM version. The main advantage with using an LLM here is that you could potentially get lots of such ratings cheaply and quickly. Replay-ability, determinism and the relative ease of interpretability vs. doing neuroscience on the human raters are also nice, but none of these properties are very reassuring or helpful if the ratings themselves aren’t all that good. (Also, if you’re doing something with such low sample efficiency that you can’t just use actual humans, you’re probably on the wrong track anyway.)