If you try to write a reward function, or a loss function, that caputres human values, that seems hopeless.
But if you have some interpretability techniques that let you find human values in some simulacrum of a large language model, maybe that’s less hopeless.
The difference between constructing something and recognizing it, or between proving and checking, or between producing and criticizing, and so on...
No77e
Why this shouldn’t work? What’s the epistemic failure mode being pointed at here?
While you can “cry wolf” in maybe useful ways, you can also state your detailed understanding of each specific situation as it arises and how it specifically plays into the broader AI risk context.
As impressive as ChatGPT is on some axes, you shouldn’t rely too hard on it for certain things because it’s bad at what I’m going to call “board vision” (a term I’m borrowing from chess).
How confident are you that you cannot find some agent within ChatGPT with excellent board vision through more clever prompting than what you’ve experimented with?
As a failure mode of specification gaming, agents might modify their own goals.
As a convergent instrumental goal, agents want to prevent their goals to be modified.
I think I know how to resolve this apparent contradiction, but I’d like to see other people’s opinions about it.
No77e’s Shortform
I’m going to re-ask all my questions that I don’t think have received a satisfactory answer. Some of them are probably basic, some other maybe less so:
I am trying to figure out what is the relation between “alignment with evolution” and “short-term thinking”. Like, imagine that some people get hit by magical space rays, which make them fully “aligned with evolution”. What exactly would such people do?
I think they would become consequentialists smart enough that they could actually act to maximize inclusive genetic fitness. I think Thou Art Godshatter is convincing.
But what if the art or the philosophy makes it easier to get laid? So maybe in such case they would do the art/philosophy, but they would feel no intrinsic pleasure from doing it, like it would all be purely instrumental, willing to throw it all away if on second thought they find out that this is actually not maximizing reproduction?
Yeah that’s what I would expect.
How would they even figure out what is the reproduction-optimal thing to do? Would they spend some time trying to figure out the world? (The time that could otherwise be spent trying to get laid?) Or perhaps, as a result of sufficiently long evolution, they would already do the optimal thing instinctively? (Because those who had the right instincts and followed them, outcompeted those who spent too much time thinking?)
I doubt that being governed by instincts can outperform a sufficiently smart agent reasoning from scratch, given sufficiently complicated environment. Instincts are just heuristics after all...
But would that mean that the environment is fixed? Especially, if the most important part of the environment is other people? Maybe the humanity would get locked in an equilibrium where the optimal strategy is found, and everyone who tries doing something else is outcompeted; and afterwards those who do the optimal strategy more instinctively outcompete those who need to figure it out. What would such equilibrium look like?
Ohhh interesting, I have no idea… it seems plausible that it could happen though!
No, I mean “humans continue to evolve genetically, and they never start self-modifying in a way that makes evolution impossible (e.g., by becoming emulations).”
For some reason I don’t get e-mail notifications when someone replies to my posts or comments. My e-mail is verified and I’ve set all notifications to “immediately”. Here’s what my e-mail settings look like:
Could evolution produce something truly aligned with its own optimization standards? What would an answer to this mean for AI alignment?
I agree with you here, although something like “predict the next token” seems more and more likely. Although I’m not sure if this is in the same class of goals as paperclip maximizing in this context, and if the kind of failure it could lead to would be similar or not.
Yes, this makes a lot of sense, thank you.
Do you mean that no one will actually create exactly a paperclips maximizer or no agent of that kind? I.e. with goals such as “collect stamps”, or “generate images”? Because I think Eliezer meant to object to that class of examples, rather than only that specific one, but I’m not sure.
The last Twitter reply links to a talk from MIRI which I haven’t watched. I wouldn’t be surprised if MIRI also used this metaphor in the past, but I can’t recall examples off the top of my head right now.
What’s wrong with the paperclips scenario?
I use Eliezer Yudkowsky in my example because it makes the most sense. Don’t read anything else into it, please.
I publish posts like this one to clarify my doubts about alignment. I don’t pay attention to whether I’m beating a dead horse or if there’s previous literature about my questions or ideas. Do you think this is an OK practice? One pro is that people like me learn faster, and one con is that it may pollute the site with lower-quality posts.
This seems way too confident to me given the level of generality of your statement. And to be clear, my view is that this could easily happen in LLMs based on transformers, but what other architectures? If you just talk about how a generic “tool-AI” would or would not behave, it seems to me that you are operating on a level of abstraction far too high to be able to make such specific statements with confidence.