Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
This doesn’t seem wrong to me so I’m now confused again what the correct analysis is. It would come out the same way if we assume rationalists are selected on g right?
Is a Gaussian prior correct though? I feel like it might be double-counting evidence somehow.
TLDR:
What OP calls “streetlighting”, I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.
First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress has been made to generate agreement on them. All the lists of “core hard problems” OP lists are different, and Paul Christiano famously wrote a 27-point list of disagreements on Eliezer’s. This means that most people’s views of the problem are wrong, and should they stick to their guns they might perseverate on either an irrelevant problem or a doomed approach.
I’d guess that historically perseveration has been an equally large problem as streetlighting among alignment researchers. Think of all the top alignment researchers in 2018 and all the agendas that haven’t seen much progress. Decision theory should probably not take ~30% of researcher time like it did back in the day.[1]
In fact, failure is especially likely for people who are trying to tackle “core hard problems” head-on, and not due to lack of intelligence. Many “core hard problems” are observations of lack of structure, or observations of what might happen in extreme generality e.g. Eliezer’s
“We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.”
(summarized) “Outer optimization doesn’t in general produce aligned inner goals”, or
“Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.”
which I will note are completely different type signature from subproblems that people can actually tractably research. Sometimes we fail to define a tractable line of attack. Other times these ill-defined problems get turned into entire subfields of alignment, like interpretability, which are filled with dozens of blind alleys of irrelevance that extremely smart people frequently fall victim to. For comparison, some examples of problems ML and math researchers can actually work on:
Unlearning: Develop a method for post-hoc editing a model, to make it as if it were never trained on certain data points
Causal inference: Develop methods for estimating the causation graph between events given various observational data.
Fermat’s last theorem: Prove whether there are integer solutions to an + bn = cn.
The unit of progress is therefore not “core hard problems” directly, but methods that solve well-defined problems and will also be useful in scalable alignment plans. We must try to understand the problem and update our research directions as we go. Everyone has to pivot because the exact path you expected to solve a problem basically never works. But we have to update on tractability as well as relevance! For example, Redwood (IMO correctly) pivoted away from interp because other plans seemed viable (relevance) and it seemed too hard to explain enough AI cognition through interpretability to solve alignment (tractability).[2]
OP seems to think flinching away from hard problems is usually cope / not being smart enough. But OP’s list of types of cope are completely valid as either fundamental problem-solving strategies or prioritization. (4 is an incentives problem, which I’ll come back to later)
Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. [...]
Carol explicitly says that she’s not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
1 and 2 are fundamental problem-solving techniques. 1 is a crucial part of Polya’s step 1: understand the problem, and 2 is a core technique for actually solving the problem. I don’t like relying on 3 as stated, but there are many valid reasons for focusing on near-term AI[3].
Now I do think there is lots of distortion of research in unhelpful directions related to (1, 2, 3), often due to publication incentives.[4] But understanding the problem and solving easier versions of it has a great track record in complicated engineering; you just have to solve the hard version eventually (assuming we don’t get lucky with alignment being easy, which is very possible but we shouldn’t plan for).
So to summarize my thoughts:
Streetlighting is real, but much of what OP calls streetlighting is a justified focus on tractability.
We can only solve “core hard problems” by creating tractable well-defined problems
OP’s suggested solution—higher intelligence and technical knowledge—doesn’t seem to fit the problem.
There are dozens of ML PhDs, physics PhDs, and comparably smart people working on alignment. As Ryan Kidd pointed out, the stereotypical MATS student is now a physics PhD or technical professional. And presumably according to OP, most people are still streetlighting.
Technically skilled people seem equally susceptible to incentives-driven streetlighting, as well as perseveration.
If the incentives continue to be wrong, people who defy them might be punished anyway.
Instead, we should fix incentives, maybe like this:
Invest in making “core hard problems” easier to study
Reward people who have alignment plans that at least try to scale to superintelligence
Reward people who think about whether others’ work will be helpful with superintelligence
Develop things like alignment workshops, so people have a venue to publish genuine progress that is not legible to conferences
Pay researchers with illegible results more to compensate for their lack of publication / social rewards
MIRI’s focus on decision theory is itself somewhat due to streetlighting. As I understand, 2012ish MIRI leadership’s worldview was that several problems had to be solved for AI to go well, but the one they could best hire researchers for was decision theory, so they did lots of that. Also someone please correct me on the 30% of researcher time claim if I’m wrong.
OP’s research is not immune to this. My sense is that selection theorems would have worked out if there had been more and better results.
e.g. if deploying on near-term AI will yield empirical feedback needed to stay on track, significant risk comes from near-term AI, near-term AI will be used in scalable oversight schemes, …
As I see it, there is lots of distortion by the publishing process now that lots of work is being published. Alignment is complex enough that progress in understanding the problem is a large enough quantity of work to be a paper. But in a paper, it’s very common to exaggerate one’s work, especially the validity of the assumptions[5], and people need to see through this for the field to function smoothly.
I am probably guilty of this myself, though I try to honestly communicate my feelings about the assumptions in a long limitations section
Under log returns to money, personal savings still matter a lot for selfish preferences. Suppose the material comfort component of someone’s utility is 0 utils at an consumption of $1/day. Then a moderately wealthy person consuming $1000/day today will be at 7 utils. The owner of a galaxy, at maybe $10^30 / day, will be at 69 utils, but doubling their resources will still add the same 0.69 utils it would for today’s moderately wealthy person. So my guess is they will still try pretty hard at acquiring more resources, similarly to people in developed economies today who balk at their income being halved and see it as a pretty extreme sacrifice.
I agree. You only multiply the SAT z-score by 0.8 if you’re selecting people on high SAT score and estimating the IQ of that subpopulation, making a correction for regressional Goodhart. Rationalists are more likely selected for high g which causes both SAT and IQ, so the z-score should be around 2.42, which means the estimate should be (100 + 2.42 * 15 − 6) = 130.3. From the link, the exact values should depend on the correlations between g, IQ, and SAT score, but it seems unlikely that the correction factor is as low as 0.8.
I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?
It’s likely possible to engineer away mutations just by checking. ECC memory already has an error rate nine orders of magnitude better than human DNA, and with better error correction you could probably get the error rate low enough that less than one error happens in the expected number of nanobots that will ever exist. ECC is not the kind of checking for which the checking process can be disabled, as the memory module always processes raw bits into error-corrected bits, which fails unless it matches some checksum which can be made astronomically unlikely to happen in a mutation.
I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?
I’m pretty skeptical of this because the analogy seems superficial. Thermodynamics says useful things about abstractions like “work” because we have the laws of thermodynamics. What are the analogous laws for cognitive work / optimization power? It’s not clear to me that it can be quantified such that it is easily accounted for:
We all come from evolution. Where did the cognitive work come from?
Algorithms can be copied
It is also not clear what distinguishes LLM weights from the weights of a model trained on random labels from a cryptographic PRNG. Since the labels are not truly random, they have the same amount of optimization done to them, but since CSPRNGs can’t be broken just by training LLMs on them, the latter model is totally useless while the former is potentially transformative.
My guess is this way of looking at things will be like memetics in relation to genetics: likely to spawn one or two useful expressions like “memetically fit”, but due to the inherent lack of structure in memes compared to DNA life, not a real field compared to other ways of measuring AIs and their effects (scaling laws? SLT?). Hope I’m wrong.
Maybe we’ll see the Go version of Leela give nine stones to pros soon? Or 20 stones to normal players?
Whether or not it would happen by default, this would be the single most useful LW feature for me. I’m often really unsure whether a post will get enough attention to be worth making it a longform, and sometimes even post shortforms like “comment if you want this to be a longform”.
I thought it would be linearity of expectation.
One day, the North Wind and the Sun argued about which of them was the strongest. Abadar, the god of commerce and civilization, stopped to observe their dispute. “Why don’t we settle this fairly?” he suggested. “Let us see who can compel that traveler on the road below to remove his cloak.”
The North Wind agreed, and with a mighty gust, he began his effort. The man, feeling the bitter chill, clutched his cloak tightly around him and even pulled it over his head to protect himself from the relentless wind. After a time, the North Wind gave up, frustrated.
Then the Sun tried his turn. Beaming warmly from the heavens, the Sun caused the air to grow pleasant and balmy. The man, feeling the growing heat, loosened his cloak and eventually took it off in the heat, resting under the shade of a tree. The Sun began to declare victory, but as soon as he turned away, the man put on the cloak again.
The god of commerce then approached the traveler and bought the cloak for five gold coins. The traveler tucked the money away and continued on his way, unbothered by either wind or heat. He soon bought a new cloak and invested the remainder in an index fund. The returns were steady, and in time the man prospered far beyond the value of his simple cloak, while the cloak was Abadar’s permanently.
Commerce, when conducted wisely, can accomplish what neither force nor gentle persuasion alone can achieve, and with minimal deadweight loss.
The thought experiment is not about the idea that your VNM utility could theoretically be doubled, but instead about rejecting diminishing returns to actual matter and energy in the universe. SBF said he would flip with a 51% of doubling the universe’s size (or creating a duplicate universe) and 49% of destroying the current universe. Taking this bet requires a stronger commitment to utilitarianism than most people are comfortable with; your utility needs to be linear in matter and energy. You must be the kind of person that would take a 0.001% chance of colonizing the universe over a 100% chance of colonizing merely a thousand galaxies. SBF also said he would flip repeatedly, indicating that he didn’t believe in any sort of bound to utility.
This is not necessarily crazy—I think Nate Soares has a similar belief—but it’s philosophically fraught. You need to contend with the unbounded utility paradoxes, and also philosophical issues: what if consciousness is information patterns that become redundant when duplicated, so that only the first universe “counts” morally?
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results (“models do not follow reliable scaling laws, so AI development should be accordingly more cautious”).
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.
In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.
It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I’m optimistic we’re closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don’t seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.
What’s the most important technical question in AI safety right now?
Yes, lots of socioeconomic problems have been solved on a 5 to 10 year timescale.
I also disagree that problems will become moot after the singularity unless it kills everyone—the US has a good chance of continuing to exist, and improving democracy will probably make AI go slightly better.
I mention exactly this in paragraph 3.
The new font doesn’t have a few characters useful in IPA.
The CATXOKLA population is higher than the current swing state population, so it would arguably be a little less unfair overall. Also there’s the potential for a catchy pronunciation like /kæ′tʃoʊklə/.
How would we know?