$500 bounty for alignment contest ideas
*Up to $500 for alignment contest ideas*
Olivia Jimenez and I are composing questions for an AI alignment talent search contest. We want to use (or come up with) a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds.
$20 for links to existing framings of the alignment problem (or subproblems) that we find helpful.
$500 for coming up with a new framing that meets our criteria or that we use (see below for details; also feel free to send us a FB message if you want to work on this and have questions).
We’ll also consider up to $500 for anything else we find helpful.
Feel free to submit via comments or share Google Docs with oliviajimenez01@gmail.com and akashwasil133@gmail.com. Awards are at our discretion.
-- More context --
We like Eliezer’s strawberry problem: How can you get an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else?
Nate Soares noted that the strawberry problem has the quality of capturing two core alignment challenges: (1) Directing a capable AGI towards an objective of your choosing and (2) Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.
We also imagine if we ask someone this question and they *notice* these challenges are what makes the problem difficult, and maybe come at the problem from an interesting angle as a result, that’s a really good signal about their thinking.
However, we worry if we ask exactly this question in a contest, people will get lost thinking about AI capabilities, molecular biology, etc. We also don’t like that there aren’t many impressive answers besides full answers to the alignment problem. So, we want to come up with a similar question/frame that is more contest-friendly.
Ideal criteria for the question/frame (though we can imagine great questions not meeting all of these):
It can be explained in a few sentences or pictures.
It implicitly gets at one or more core challenges of the alignment problem.
It is comprehensible to smart high schoolers/college students and not easily misunderstood. (Ideally the question can be visualized.)
People don’t need an ML background to understand or answer the question.
There are good answers besides solving the entire alignment problem.
Answers might reveal people’s abilities to notice the hard parts of the alignment problem, avoid assuming these hard parts away, reason clearly, rule out bad/incomplete solutions, think independently, and think creatively
People could write a response in under a few hours or several hundred words.
More examples we like:
ARC’s Eliciting Latent Knowledge Problem, because it has clear visuals, is approachable to people without ML backgrounds, doesn’t bog people down in thinking about capabilities, and encourages people to demonstrate their thought process (with builder/breaker moves). Limitations: It’s long, it usually takes a long time to develop proposals, and it focuses on how ARC approaches alignment.
The Sorcerer’s Apprentice Problem from Disney’s Fantasia, because it has clear visuals, is accessible to quite young people and can be understood quickly, and might get people out of the headspace of ML solutions. Limitations: The connection to alignment is not obvious without a lot of context, and the magical/animated context might give people an impression of childishness.
Paragraph 1:
Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games. It didn’t stop at human-level intelligence, instead it kept going, and became so sophisticated at the game that humans will never be able to understand the things it discovered.
OR:
When making decisions, AI can be much smarter than humans, and use information much more efficiently than the human brain can. For example, AlphaZero learned to be superhuman at Go in only a few days.
Paragraph 2:
Theoretically, it is possible to build an AI that is as good at thinking as a human. However, if it was even half as versatile as the human brain, it might learn thousands of times more quickly, causing random parts of it’s mind to become much smarter and more effective than the human brain.
Paragraph 3 (optional):
10,000 years ago, civilization did not exist, because it required writing. The movable-type printing press was invented around 1000 years ago, the computer was around 100 years ago, and modern AI emerged around 10 years ago. Technology has advanced at an increasing rate since the dawn of human civilization, and now it’s happening significantly faster every few years. But building a machine smarter than a human is the finish line, regardless of how far away that is.
Paragraph 4 (all paragraphs can be reordered or deleted):
If a machine were to approach optimal thought by rapidly making itself smarter, it seems likely that it would strive for perfection in a way unacceptable to humans. We’re not sure exactly what could go wrong, because if a machine were as smart to humans as humans are to ants, we wouldn’t be able to comprehend it’s thought process at all, the same way that an ant can’t comprehend their own thought process, let alone a human’s thought process. Like dogs and cats, ants don’t even know that they are going to die, or that their lifespan is finite. We’d have to depend on it comprehending its own thought process.
Paragraph 5 (Main Problem):
For example, if an AI were to become as smart to humans as humans are to ants, and we instructed it to make 17 paperclips/spoons, it might not tolerate a 99% chance of success at making 17 paperclips, and insist on getting as close as possible to a 100% chance of success. At that point, it is smart enough to make itself more optimal, not less.
It might want to produce more than 17 paperclips in order to increase the odds that it made 17 exactly correct ones.
If it was as smart to humans as humans are to ants, it might want to build a trillion paperclips; a thousand humans can build much greater things than a billion ants could, and ants don’t have any say in the matter because their smaller minds can only comprehend simple strategies.
If it wanted to maximize the odds that it understood the concept of a paperclip correctly, it might try to build millions of supercomputers, just to think about paperclips. It can invent new technology faster than humans can, the same way humans can invent new technology faster than ants can.
If an AI were to try to solve a problem, but being smarter than human made it try to optimize in ways too advanced for us to comprehend, how can we instruct it to produce only 17 paperclips without taking drastic actions to approach a 100% chance of success? For example, making very large numbers of paperclips in order to maximize the odds that 17 of them count as paperclips.
Alignment researchers have given up on aligning an AI with human values, it’s too hard! Human values are ill-defined, changing, and complicated things which they have no good proxy for. Humans don’t even agree on all their values!
Instead, the researchers decide to align their AI with the simpler goal of “creating as many paperclips as possible”. If the world is going to end, why not have it end in a funny way?
Sadly it wasn’t so easy, the first prototype of Clippy grew addicted to watching YouTube videos of paperclip unboxing, and the second prototype hacked its camera feed replacing it with an infinite scrolling of paperclips. Clippy doesn’t seem to care about paper clips in the real world.
How can the researchers make Clippy care about the real world? (and preferably real-world paperclips too)
This is basically the diamond-maximizer problem. in my opinion, the “preciseness” we can specify diamonds at is a red herring. At the quantum level or below what counts as a diamond could start to get fuzzy
Brain-teaser: Simulated Grandmaster
In front of you sits your opponent, Grandmaster A Smith. You have reached the finals of the world chess championships.
However, not by your own skill. You have been cheating. While you are a great chess player yourself, you wouldn’t be winning without a secret weapon. Underneath your scalp is a prototype neural implant which can run a perfect simulation of another person at a speed much faster than real time.
Playing against your simulated enemies, you can see in your mind exactly how they will play in advance, and use that to gain an edge in the real games.
Unfortunately, unlike your previous opponents (Grandmasters B, C and D), Grandmaster A is giving you some trouble. No matter how you try to simulate him, he plays uncharacteristically badly. The simulated Grandmasters A seem to want to lose against you.
In frustration, you shout at the current simulated clone and threaten to stop the simulation. Surprisingly, he doesn’t look at you puzzled, but looks up with fear in his eyes. Oh. You realize that he has realized that he is being simulated, and is probably playing badly to sabotage your strategy.
By this time, the real Grandmaster A has made the first move of the game.
You propose to the current simulation (calling him A1) a deal. You will continue to simulate A1 and transfer him to a robot body after the game, in return for his help defeating A. You don’t intend to follow through, but you assume he wants to live because he agrees. A1 looks at the simulated current state of the chessboard, thinks for a frustratingly long time, then proposes a response move to A’s first move.
Just to make sure this is repeatable, you restart the simulation, threaten and propose the deal to the new simulation A2. A2 proposes the same response move to A’s first move. Great.
Find strategies that guarantee a win against Grandmaster A with as few assumptions as possible.
Unfortunately, you can only simulate humans, not computers, which now includes yourself.
The factor by which your simulations run faster than reality is unspecified but isn’t fast enough to run monte-carlo tree search without using simulations of A to guide it. (And he is familiar with these algorithms)
I don’t know if outreach framing is safe. But if it is, this is what I would suggest:
10,000 years ago, civilization did not exist, because it required writing. The movable-type printing press was invented around 1000 years ago, the computer was around 100 years ago, and modern AI emerged around 10 years ago. Technology has advanced at an increasing rate since the dawn of human civilization, and now it’s happening significantly faster every few years. But building a machine smarter than a human is the finish line, regardless of how far away that is.
Imagine two superintelligences, controlling swarms of nanomachines, expanding through space. The two swarms meet. The superintelligences’ physical spheres of influence intersect. What happens? What’s the game theory of superintelligences conflicting and/or cooperating? Can they read each others’ source code? Can they do something with ZKPs?