Independent AI alignment researcher
Alex Flint
Thanks—fixed! And thank you for the note, too.
Yeah it might just be a lack of training data in 10-second-or-less interactive instructions.
The thing I really wanted to test with this experiment was actually whether ChatGPT could engage with the real world using me as a guinea pig. The 10-second-or-less thing was just the format I used to try to “get at” the phenomenon of engaging with the real world. I’m interested in improving the format to more cleanly get at the phenomenon.
I do currently have the sense that it’s more than just a lack of training data. I have the sense that ChatGPT has learned much less about how the world really works at a causal level than it appears from much of its dialog. Specifically, I have the sense that it has learned how to satisfy idle human curiosity using language, in a way that largely routes around a model of the real world, and especially routes around a model of the dynamics of the real world. That’s my hypothesis—I don’t think this particular experiment has demonstrated it yet.
I asked a group of friends for “someone to help me with an AI experiment” and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.
I asked a group of friends for “someone to help me with an AI experiment” and then I gave this particular friend the context that I wanted her help guiding me through a task via text message and that she should be in front of her phone in some room that was not the kitchen.
If you look at how ChatGPT responds, it seems to be really struggling to “get” what’s happening in the kitchen—it never really comes to the point of giving specific instructions, and especially never comes to the point of having any sense of the “situation” in the kitchen—e.g. whether the milk is currently in the suacepan or not.
In contrast, my human friend did “get” this in quite a visceral way (it seems to me). I don’t have the sense that this was due to out-of-band context but I’d be interested to retry the experiment with more carefully controlled context.
ChatGPT struggles to respond to the real world
I’m very interested in Wei Dai’s work, but I haven’t followed closely in recent years. Any pointers to what I might read of his recent writings?
I do think Eliezer tackled this problem in the sequences, but I don’t really think he came to an answer to these particular questions. I think what he said about meta-ethics is that it is neither that there is some measure of goodness to be found in the material world independent from our own minds, nor that goodness is completely open to be constructed based on our whims or preferences. He then says “well there just is something we value, and it’s not arbitrary, and that’s what goodness is”, which is fine, except it still doesn’t tell us how to find that thing or extrapolate it or verify it or encode it into an AI. So I think his account of meta-ethics is helpful but not complete.
Recursive relevance realization seems to be designed to answer about the “quantum of wisdom”.
It does! But… does it really answer the question? Curious about your thoughts on this.
you ask whether you are aligned to yourself (ideals, goals etc) and find that your actuality is not coherent with your aim
Right! Very often, what it means to become wiser is to discover something within yourself that just doesn’t make sense, and then to in some way resolve that.
Discovering incoherency seems very different from keeping a model on coherence rails
True. Eliezer is quite vague about the term “coherent” in his write-ups, and some more recent discussions of CEV drop it entirely. I think “coherent” was originally about balancing the extrapolated volition of many people by finding the places where they agree. But what exactly that means is unclear.
And isn’t there mostly atleast two coherent paths out of an incoherence point?
Yeah, if the incoherent point is caused by a conflict between two things, then there are at least two coherent paths out, namely dropping one or the other of those two things. I have the sense that you can also drop both of them, or sometimes drop some kind of overarching premise that was putting the two in conflict.
Does the CEV pick one or track both?
It seems that CEV describes a process for resolving incoherencies, rather than a specific formula for which side of an incoherence to pick. That process, very roughly, is to put a model of a person through the kind of transformations that would engender true wisdom if experienced in real life. I do have the sense that this is how living people become wise, but I question whether it can be usefully captured in a model of a person.
Or does it refuse to enter into genuine transformation processes and treats them as dead-ends as it refuses to step into incoherencies?
I think that CEV very much tries to step into a genuine transformation process. Whether it does or not is questionable. Specifically, if it does, then one runs into the four questions from the write-up.
Did you ever end up reading Reducing Goodhart?
Not yet, but I hope to, and I’m grateful to you for writing it.
processes for evolving humans’ values that humans themselves think are good, in the ordinary way we think ordinary good things are good
Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.
Coherent extrapolated dreaming
Wasn’t able to record it—technical difficulties :(
Yes, I should be able to record the discussion and post a link in the comments here.
Response to Holden’s alignment plan
If you train a model by giving it reward when it appears to follow a particular human’s intention, you probably get a model that is really optimizing for reward, or appearing to follow said humans intention, or something else completely different, while scheming to seize control so as to optimize even more effectively in the future. Rather than an aligned AI.
Right yeah I do agree with this.
Perhaps instead you mean: No really the reward signal is whether the system really deep down followed the humans intention, not merely appeared to do so [...] That would require getting all the way to the end of evhub’s Interpretability Tech Tree
Well I think we need something like a really-actually-reward-signal (of the kind you’re point at here). The basic challenge of alignment as I see it is finding such a reward signal that doesn’t require us to get to end of the Interpretability Tech Tree (or similar tech trees). I don’t think we’ve exhausted the design space of reward signals yet but it’s definitely the “challenge of our times” so to speak.
Well even if language models do generalize beyond their training domain in the way that humans can, you still need to be in contact with a given problem in order to solve that problem. Suppose I take a very intelligent human and ask them to become a world expert at some game X, but I don’t actually tell them the rules of game X nor give them any way of playing out game X. No matter how intelligent the person is, they still need some information about what the game consists of.
Now suppose that you have this intelligent person write essays about how one ought to play game X, and have their essays assessed by other humans who have some familiarity with game X but not a clear understanding. It is not impossible that this could work, but it does seem unlikely. There are a lot of levels of indirection stacked against this working.
So overall I’m not saying that language models can’t be generally intelligent, I’m saying that a generally intelligent entity still needs to be in a tight feedback loop with the problem itself (whatever that is).
Here is a critique of OpenAI’s plan
Notes on OpenAI’s alignment plan
This is a post about the mystery of agency. It sets up a thought experiment in which we consider a completely deterministic environment that operates according to very simple rules, and ask what it would be for an agentic entity to exist within that.
People in the game of life community actually spent some time investigating the empirical questions that were raised in this post. Dave Greene notes:
The technology for clearing random ash out of a region of space isn’t entirely proven yet, but it’s looking a lot more likely than it was a year ago, that a workable “space-cleaning” mechanism could exist in Conway’s Life.
As previous comments have pointed out, it certainly wouldn’t be absolutely foolproof. But it might be surprisingly reliable at clearing out large volumes of settled random ash—which could very well enable a 99+% success rate for a Very Very Slow Huge-Smiley-Face Constructor.
I have the sense that the most important question raised in this post is about whether it is possible to construct a relatively small object in the physical world that steers the configuration of a relatively large region of the physical world into a desired configuration. The Game of Life analogy is intended to make that primary question concrete, and also to highlight how fundamental the question of such an object’s existence is.
The main point of this post was that the feasibility or non-feasibility of AI systems that exert precise influence over regions of space much larger than themselves may actually be a basic kind of descriptive principle for the physical world. It would be great to write a follow-up post highlighting this point.
Thanks for this note Dave
Oh the only information I have about that is Dave Green’s comment, plus a few private messages from people over the years who had read the post and were interested in experimenting with concrete GoL constructions. I just messaged the author of the post on the GoL forum asking about whether any of that work was spurred by this post.