A metaphor: what “green lights” for AGI would look like

I’m not as familiar as some must be with the history, but Eliezer had to explicate an entire deontic mesh around “guarded term”, to keep people from motte-and-baileyingpivotal act”. I suppose I should declare that mesh cloned around “green light” here, just in case. You are explicitly not allowed to claim my approval of your AGI because you can make an argument that you have something that qualifies as a “green light” according to this post.

Excerpted from Zach Wienersmith at Saturday Morning Breakfast Cereal [link]:

A: Mastery of the nature of reality grants you no mastery over the behavior of reality.

A: I could tell you why Grandpa is very sick. I could tell you what each cell is doing wrong, why it’s doing wrong, and roughly when it started doing wrong.

A: But I can’t tell them to stop.

B: Why can’t you make a machine to fix it?

A: Same reason you can’t make a parachute when you fall from the plane.

Zach is wrong here. To have mastered reality is to bring a parachute every time you are a plane passenger, keep it close, but be as sure of how to use it as possible.

Claims of great mastery unsubstantiated by commensurate “luck” are false.

This is especially true when discussing prospective feats of great mastery that are especially Far, especially singular [i.e., lacking a track record or grounded training data] or especially socially poorly-understood, and the claimant to mastery suspects that under no conditions will they personally have to answer for any inaccuracies.

I was recently talking with someone about ChatGPT’s RLHF, and what the optimal “safety” policy would be. I claimed the right decision would have been for OpenAI to not train ChatGPT in the first place. They said sure, but not because ChatGPT itself was dangerous. I said sometimes we don’t know if something is dangerous before we build it.

To extend the plane metaphor, green lights for AGI look like:

  • Since this is our first “grounded” plane flight, somebody having an obviously-correct and ~exhaustive theory of how heavier-than-air flight works, and what will happen [for our purposes] when we turn on the plane.

  • All copilots having mastered the simulator. Not infinitely, we’re not doing ALARA, but mastered it to where they’ve stopped crashing it in ways that would kill them if it was a real flight. Since simulations of computations are actual computations, this part will just look like really prepared pilots who have not just learned what all the controls[/​not-yet-executed AGI components] are supposed to do, but have spent significant time writing down lists of things that could go wrong and finding ways to check things off the list.

  • For AGIs that are not supposed to take away humanity’s directive power over the timeline forever, a plan for when the flight will terminate [at what point the AGI is going to stop optimizing] before we run out of fuel [confidence in our alignment theory].

  • A runway that we’ll have some idea is Nx as long as we’ll need, where N >> 1 [quasi-quantified overconfidence in the redundant effectiveness of our alignment methods].

  • Ideally, “parachutes” we are fairly confident work, and [mental] practice using them.

The industry alignment plan is RLHF. RLHF as actually implemented is not any of these things. It is not even an alignment method. It is not even a control method. It is a user interface feature. It was designed as such, and that’s what it can do.

Interpretability might be a good start to some green lights, if we could actually get it, but without making some pretty intensive predictions about what happens, being able to trace which neuron stores which concept is the kind of “understanding” that leaves you falling at terminal velocity from the plane.