I can’t actually think of a “real RL agent design” (something that could plausibly be scaled to make a strong AGI) that wouldn’t try and search for adversarial inputs to its value function. If you (or anyone reading this) do have any ideas for designs that wouldn’t require adversarial robustness, but could still go beyond human performance, I think such designs would constitute an important alignment advance, and I highly suggest writing them up on LW/Alignment Forum.
I think @Quintin Pope would disagree with this. As I understand it, one of Shard Theory’s claims is exactly that generally capable RL agents would not apply such adverse selection pressure on their own inputs (or at least that we should not design such agents).
A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.
The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.
A reflective agent thinks that the first plan doesn’t lead to many diamonds, while the second plan leads to more diamonds.
Therefore, the reflective agent chooses the second plan over the first plan, automatically[7] avoiding the worst parts of the optimizer’s curse. (Unlike grader-optimization, which seeks out adversarial inputs to the diamond-motivated part of the system.)
The above passage has a major flaw, which is that simulating other superintelligences is not where adverse selection pressure comes from. The agent generates the selection pressure on its own, though of course from the agent’s perspective the selection is perfectly benign and not adverse at all. Making the agent reflective does not prevent this.
To put things in terms of diamond maximization, the common starting point is that we can’t perfectly specify the “maximize diamonds” objective. Let’s say, for the sake of concreteness, that while our objective includes Diamonds, it also includes other objects, much cheaper to construct, which I’ll call “Liemonds”. These are fakes cleverly constructed to be interpreted as real. So we’re trying to build an AI to maximize Diamonds, but due to being unable to fully solve alignment, our actual AI is maximizing Diamonds + Liemonds. So let’s say the AI is deciding what to think about for the next few timesteps and is considering 3 options:
The AI, being reflective and suspecting itself susceptible to being hacked, rules out 1. Then it notes that Liemonds are cheaper to produce and so vastly more of them can be produced than Diamonds, and so it picks option 3. Onlooking humans are unhappy, since they would have suggested the AI pick option 2, and they will soon be even more unhappy once the AI tiles the universe with Liemonds.
The agent is perfectly happy, though, since it’s fulfilling its values of maximizing Diamonds + Liemonds to the highest degree possible. How was the agent even supposed to know that Liemonds were bad and didn’t count? Yes, avoiding option 1 is a minor victory, but the real core of the alignment problem is getting the agent to choose Diamonds over Liemonds.
(or at least that we should not design such agents)
I think @Quintin Pope would disagree with this. As I understand it, one of Shard Theory’s claims is exactly that generally capable RL agents would not apply such adverse selection pressure on their own inputs (or at least that we should not design such agents).
See:
Don’t design agents which exploit adversarial inputs
Don’t align agents to evaluations of plans
Alignment allows “nonrobust” decision-influences and doesn’t require robust grading
So, it looks like the key passage is this one:
The above passage has a major flaw, which is that simulating other superintelligences is not where adverse selection pressure comes from. The agent generates the selection pressure on its own, though of course from the agent’s perspective the selection is perfectly benign and not adverse at all. Making the agent reflective does not prevent this.
To put things in terms of diamond maximization, the common starting point is that we can’t perfectly specify the “maximize diamonds” objective. Let’s say, for the sake of concreteness, that while our objective includes Diamonds, it also includes other objects, much cheaper to construct, which I’ll call “Liemonds”. These are fakes cleverly constructed to be interpreted as real. So we’re trying to build an AI to maximize Diamonds, but due to being unable to fully solve alignment, our actual AI is maximizing Diamonds + Liemonds. So let’s say the AI is deciding what to think about for the next few timesteps and is considering 3 options:
Simulate other, potentially malign, superintelligences.
Research how to make Diamonds
Research how to make Liemonds
The AI, being reflective and suspecting itself susceptible to being hacked, rules out 1. Then it notes that Liemonds are cheaper to produce and so vastly more of them can be produced than Diamonds, and so it picks option 3. Onlooking humans are unhappy, since they would have suggested the AI pick option 2, and they will soon be even more unhappy once the AI tiles the universe with Liemonds.
The agent is perfectly happy, though, since it’s fulfilling its values of maximizing Diamonds + Liemonds to the highest degree possible. How was the agent even supposed to know that Liemonds were bad and didn’t count? Yes, avoiding option 1 is a minor victory, but the real core of the alignment problem is getting the agent to choose Diamonds over Liemonds.
I certainly agree with this, the question is how?