We Need a Consolidated List of Bad AI Alignment Solutions
Thousands of people are thinking about the AI Alignment problem. Many are recreating ideas that have already been suggested and shot down. More importantly, many are creating new ideas but figure that they are in the first camp and fail to share them.
Please feel free to use the comments as a temporary solution! It would also be appreciated to include in the comments bad ideas you have seen and why they won’t work.
In order to remove myself from the second camp, I’ll share my amateur alignment idea here:
An AI that behaves towards humans in a similar way to how it would choose for a more powerful agent to behave towards it. This is hoping that the Golden Rule can prevent some of the worst AI outcomes. The AI could be a paperclip maximizer, but as long as it recognizes that humans have wants (it doesn’t matter if the AI is created with an understanding of exactly what those wants are) and it understands that the more powerful agent could have wants different from its own, it will hopefully stay mostly out of humanity’s way while it turns the rest of the universe into paperclips and maybe throw us some resources and technology our way, as it would want the more powerful agent to do for it. It even prevents the situation where AI self-modifies itself to not care about the Golden Rule, because it would not want for the more powerful agent to remove its own Golden Rule. This was inspired by superrationality, which seems pretty close to a way to put morals into something amoral.
I think it might be better to have an explicit bar a potential solution must clear before it can be considered serious. Consider this quote from Sean Carroll about a potential new physical theory that would replace or subsume an existing one:
I suspect there can be a list of questions that each new alignment approach would have to answer, such as, idk, dealing with treacherous turns, out-of-distribution self-correction etc.
AI safety research has been groping in the dark, and half-baked suggestions for new research directions are valuable. It isn’t as though we’ve made half of a safe AI. We haven’t started, and all we have are ideas.
Now break your own proposal! It’s a really useful exercise.
Why might this be difficult to implement (remember, at the end of the day we have to write code that implements this)?
How might this go wrong, even if it is implemented as specified?
In general I think people with alignment proposals should think for at least 5 minutes about why they might not work.
I think that a problem with my solution is that how can the AI “understand” the behaviors and thought-processes of a “more powerful agent.” If you know what someone smarter than you would think then you are simply that smart. If we abstract the specific more-powerful-agent’s-thoughts away, then we are left with Kantian ethics, and we are back where we started, trying to put ethics/morals in the AI.
It’s a bit rude to call my idea so stupid that I must not have thought about it for more than five minutes, but thanks for your advice anyways. It is good advice.
I didn’t intend this.
Meh.
I think the key problem with most bad ideas is that they failed to understand enough about the basic structure of the problem to know what a solution would even look like. The probability that you’ve hit on a useful solution without that understanding is nil, so fixing the problem requires learning more, not sharing more uninformed guesses. For example, among other things, your solution doesn’t account for value loading, the question of how we get the system to have the value we care about.
A consolidated list of bad or incomplete solutions could have considerable didactic value—it could keep people learn more about the various challenges involved.
The goal of having a list of many bad ideas is different from having a focused explanation about why certain ideas are bad.
Writing posts about bad ideas and how they fail could be a type of post that’s valuable but it’s different than just listing ideas.
For inspiration in the genre of learning-what-not-to-do, I suggest “How To Write Unmaintainable Code”. Also “Fumblerules”.
Haha meh...I don’t think you’re thinking big enough. There will always be ethicists and philosophers surrounding any great human endeavor who are not themselves technically proficient...certainly they should lifelong educate, but if your not good at coding or maths, you’re just not gong to ever understand certain technical issues. So saying without that understanding their effectiveness is nil is just not understanding the nature of how humanity progresses on big issues. It’s always a balance of abstract and concrete thinkers...they must work together. The one’s who dismiss the other side are most definitely going to be the losers because it’s dismissing about half of what you actually need to succeed. We need to respect those who think differently from us, we must literally feel desperately in need of them.
To effectively deal with a topic you need to understand something about it.
If you want to be helpful as an ethicist for developing driverless cars, it helps to understand the actual ethical issues involved instead of just trying to project your own unrelated ideas into the problem.
Whether or not a driverless car is allowed to violate laws to achieve other goals such as avoiding accidents is an ethical issue that’s important. Programmers have to decide and regulators have to decide whether to allow companies to produce driverless cars that violate laws.
Instead, ethicists who are too lazy to actually understand the subject matter pretend that the most important ethical issue with driverless cars is the trolly problem which inturn ignores real-world effect such as opening up the possibility to troll driverless cars by pushing a baby troller in front of them if they are predictably coded to do everything to avoid hitting the baby troller.
To get back to AI safety, it’s not necessary to be able to code or do the math to understand current problems in AI safety. Most of what Nick Bostrom for example writes is of philosophic nature and not directly about math or programming.
We (Rob Miles and Stampy’s team) might run an event where people can present their ideas for solving alignment with a much lower bar for entry than usual in a few months, once we’ve got a strong base of volunteer editors to absorb and filter the incoming ideas. Provisionally called “Big Wheel Of Cheese Day”. If we do this we’d likely host the database on Stampy.
The AI Box:
A common idea is for the AI to be in a “box” where it can only interact through the world by talking to a human. This doesn’t work for a few reasons:
The AI would be able to convince the human to let it out.
The human wouldn’t know the consequences of their actions as well as the AI.
Removing capabilities from the AI is not a good plan because the point is to create a useful AI. Importantly, the AI should be able to stop all dangerous AI from being created.
I was happy to read your Golden Rule idea...I just posted my own version of the Golden AI Rule a few days ago.
The Golden AI Rule: AI will only be as good to us as we are to each other.