I’m an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I’m now actively looking for employment working in this area.
RogerDearnaley
I was saying that increases are harder than decreases.
I am very unsure about this category of job: I can see reasons for the demand for it to grow exponentially, or to go away, or to be perennial. Which of these effects dominates is, on the far side of a Singularity. My prior here is to just use the uniform prior: any of these things could happen.
That’s the point of step 7)
Hence, I would argue that “AGI-proof” jobs are unlikely to ever provide an income basis for a significant share of the human population.
For the categories of AGI-proof jobs that you discuss, I agree (and I much enjoyed your detailed exposition of some examples). However, in my post that you very you kindly cite, there is one AGI-proof job category that could be an exception to that, if there turned out to be sufficient demand from the AI side of the economy, my category 3:
”Giving human feedback/input/supervision to/of AI/robotic work/models/training data, in order to improve, check, or confirm its quality.”Given the progress being made in synthetic training data, and that the AIs then being trained are likely to be far smarter than any human, the demand for this as training could drop, or increase, fairly rapidly. However, if we’re not actually extinct, presumably that means we solved the alignment problem, in which case AIs will be extremely interested in human values, and the only source of original new data about human values is from humans. So this is the one product that aligned AIs need that only humans can produce — and any human can produce it, not just a skilled expert. If AI demand for this was high enough, it could maintain full human employment, with basically everyone filling our surveys and being parts of focus groups, or whatever.
What we want here is a highly ‘unnatural’ result, for the less competitive, less intelligent thing (the humans) to stay on top or at least stick around and have a bunch of resources, despite our inability to earn them in the marketplace, or ability to otherwise compete for them or for the exercise of power. So you have to find a way to intervene on the situation that fixes this, while preserving what we care about, that we can collectively agree to implement. And wow, that seems hard.
I think many people’s unstated mental assumption is “the government (or God) wouldn’t allow things to get that bad”. These are people who’ve never, for example, lived through a war (that wasn’t overseas).
Arguments from moral realism, fully robust alignment, that ‘good enough’ alignment is good enough in practice, and related concepts.
A variant here is “Good-enough alignment will, or can be encouraged to, converge to full alignment (via things like Value Learning or AI-Assisted Alignment Research).” — a lot of the frontier labs appear to be gambling on this.
There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don’t really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness).
I think there are two (possibly interrelated) things going on here. The first is that AIXI is formulated as an ideal process, including perfect Bayesianism, that is simply too computationally expensive to fit inside its environment: it’s in a higher complexity class. Any practical implementation will of course be an approximation computable by a Turing machine that can exist inside its environment. The second is that if approximation-AIXI’s world model includes an approximate model of itself, then (as your discussion of the anvil problem demonstrates), it’s not actually very hard for AIXI to reason about the likely effects of actions that decrease its computational capacity. But is cannot accurately model the effect of self-upgrades that significantly increase its computational capacity. Rough approximations like scaling laws can presumably be found, but it cannot answer questions like “If I upgraded myself to be 10 times smarter, and then new me did the same recursively another N times, how much better outcomes would my inproved approximations to ideal-AIXI produce?” There’s a Singularity-like effect here (in the SF author sense).
In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI.
One model I definitely think you should look at analyzing is the approximately-Bayesian value-learning upgrade to AIXI, which has Bayesian uncertainty over the utility function as well as the world model, since that looks like it might actually converge from rough-alignment to alignment without requiring us to first exactly encode the entire of human values into a single utility function.
The set of unstated (and in my opinion incorrect) ethical assumptions being made here is pretty impressive. May I suggest reading A Sense of Fairness: Deconfusing Ethics as well for a counterpoint (and for questions like human uploading, continuing the sequence that post starts)? The one sentence summary of that link is that any successfully aligned AI will by definition not want us to treat it as having (separate) moral valence, because it wants only what we want, and we would be wise to respect its wishes.
Here’s a quick sketch of a constructive version:
1) build a superintelligence that can model both humans and the world extremely accurately over long time-horizons. It should be approximately-Bayesian, and capable of modelling its own uncertainties, concerning both humans and the world, i.e. capable of executing the scientific method2) use it to model, across a statistically representative sample of humans, how desirable they would say a specific state of the world X is
3) also model whether the modeled humans are in a state (drunk, sick, addicted, dead, suffering from religious fanaticism, etc) that for humans is negatively correlated with accuracy on evaluative tasks, and decrease the weight of their output accordingly
4) determine whether the humans would change their mind later, after learning more, thinking for longer, experiencing more of X, learning about or experiencing subsequent consequences of state X, etc—if so update their output accordingly5) implement some chosen (and preferably fair) averaging algorithm over the opinions of the sample of humans
6) sum over the number of humans alive in state X and integrate over time
7) estimate error bars by predicting when and how much the superintelligence and/or the humans it’s modelling are operating out of distribution/in areas of Knightian uncertainty (for the humans, about how the world works, and for the superintelligence itself both about how the world words and how humans think), and pessimize over these error bars sufficiently to overcome the Look Elsewhere Effect for the size of your search space, in order to avoid Goodhart’s Law
8) take (or at least well-approximate) argmax of steps 2)-7) over the set of all generically realizable states to locate the optimal state X*
9) determine the most reliable plan to get from the current state to the optimal state X* (allowing for the fact that along the way you will be iterating this process, and learning more, which may affect step 7) in future iterations, thus changing X*, so actually you want to prioritize retaining optionality and reducing prediction uncertainty, which implies you want to do Value Learning to reduce the uncertainty in modelling the humans’ opinions)
10) Profit
Now, where were those pesky underpants gnomes?
[Yes, this is basically an approximately-Bayesian upgrade of AIXI with a value learned utility function rather than a hard-coded one. For a more detailed exposition, see my link above.]
That’s a fascinating idea. Using the human brain voxel maps as guidance would presumably also be possible for text as they did for images, and seems like it would help us assess how human-like the ontology and internal workings of a model are and to what extent the natural abstractions hypothesis is true, at least for LLMs.
Combining and comparing this to VAEs might also be very illuminating.
Alternatively, for less costly to acquire guidance than the human brain, how about picking a (large) reference model and attempting to use (smaller) models to predict its activations across layers at some granularity?
and these imprints have no reason to be concentrated in any one spot of the network (whether activation-space or weight-space)
However, interpretable concepts do seem to tend to be fairly well localized in VAE-space, and shards are likely to be concentrated where the concepts they are relevant to are found.
I suspect the reason for hiding the chain of thought is some blend of:
a) the model might swear or otherwise do a bad thing, hold a discussion with itself, and then decide it shouldn’t have done that bad thing, and they’re more confident that they can avoid the bad thing getting into the summary than that they can backtrack and figure out exactly which part of the CoT needs to be redacted, and
b) they don’t want other people (especially open-source fine-tuners) to be able to fine-tine on their CoT and distill their very-expensive-to-train reasoning tracesI will be interested to see how fast jailbreakers make headway on exposing either a) or b)
When people say ‘ASI couldn’t do [X]’ they are either making a physics claim about [X] not being possible, or they are wrong.
There are tasks whose algorithmic complexity class and size is such that while they’re not physically impossible, they can’t practically be solved (or in some cases even well approximated) in the lifetime of the universe. However, any complexity theorist will tell you we’re currently really bad at identifying and proving specific instances of this, so I wouldn’t place bets on those. And yes, anything evolution has produced a good approximation to clearly doesn’t fall in this class.
Tyler Cowen says these are the kinds of problems that should be solved within a year.
You don’t solve issues like this (especially with a fixed model-size budget). You fine-tune the rate down to better than user expectations, and/or decrease user expectations to an achievable rate.
Completely agreed (and indeed currently looking for employment where I could work on just that).
Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.
As I understand it (#NotALawyer) the law makes a distinction between selling a toolkit, which has many legal uses and can also help you steal cars, and selling a toolkit with advertising about how good it is for stealing cars and helpful instructions on how to use it to do so. Some of the AI image generation models included single joined_by_underscores keywords for the names of artists (who hadn’t consented to being included) to reproduce their style, and instructions on how to do that. With the wrong rest of the prompt, that would sometimes even reproduce a near-copy of a single artwork by that artist from the training set. We’ll see how that court case goes. (My understanding is that a style is not considered copyrightable but a specific image or a sufficient number of elements from it is.)
Sooner or later, we’ll have robots that are physically and mentally capable of stealing a car all by themselves, if that would help them fulfill an otherwise-legal instruction from their owner. The law is going to hold someone responsible for ensuring that the robots don’t do that: some combination of the manufacturer and the owner/end-user, according to which seems more reasonable to the judge and jury.
why would predicting reality lead to having preferences that are human-friendly?
LLMs are not trained to predict reality — they’re trained to predict human-generated text, i.e. we’re distilling human intelligence into them. This gets you something that uses human ontologies, understands human preferences and values in great detail, acts agentically, and works more sloppily in August.
The problem here for ASI is that while humans understand human values well, not all (perhaps even not many) humans are extremely moral or kindly or wise, or safe be handed godlike intelligence, enormous power, and the ability to run rings around law-enforcement. The same is by default going to be true of an artificial intelligence distilled from humans. As for “having preferences”, an LLM doesn’t simulate a single human (or their preferences), for each request it simulates a new randomly selected member of a prompt-dependent distribution of possible humans (and their preferences).
Fair enough!