Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
Engineer at METR.
Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso onvarious empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
It’s likely possible to engineer away mutations just by checking. ECC memory already has an error rate nine orders of magnitude better than human DNA, and with better error correction you could probably get the error rate low enough that less than one error happens in the expected number of nanobots that will ever exist. ECC is not the kind of checking for which the checking process can be disabled, as the memory module always processes raw bits into error-corrected bits, which fails unless it matches some checksum which can be made astronomically unlikely to happen in a mutation.
I was expecting some math. Maybe something about the expected amount of work you can get out of an AI before it coups you, if you assume the number of actions required to coup is n, the trusted monitor has false positive rate p, etc?
I’m pretty skeptical of this because the analogy seems superficial. Thermodynamics says useful things about abstractions like “work” because we have the laws of thermodynamics. What are the analogous laws for cognitive work / optimization power? It’s not clear to me that it can be quantified such that it is easily accounted for:
We all come from evolution. Where did the cognitive work come from?
Algorithms can be copied
It is also not clear what distinguishes LLM weights from the weights of a model trained on random labels from a cryptographic PRNG. Since the labels are not truly random, they have the same amount of optimization done to them, but since CSPRNGs can’t be broken just by training LLMs on them, the latter model is totally useless while the former is potentially transformative.
My guess is this way of looking at things will be like memetics in relation to genetics: likely to spawn one or two useful expressions like “memetically fit”, but due to the inherent lack of structure in memes compared to DNA life, not a real field compared to other ways of measuring AIs and their effects (scaling laws? SLT?). Hope I’m wrong.
Maybe we’ll see the Go version of Leela give nine stones to pros soon? Or 20 stones to normal players?
Whether or not it would happen by default, this would be the single most useful LW feature for me. I’m often really unsure whether a post will get enough attention to be worth making it a longform, and sometimes even post shortforms like “comment if you want this to be a longform”.
I thought it would be linearity of expectation.
One day, the North Wind and the Sun argued about which of them was the strongest. Abadar, the god of commerce and civilization, stopped to observe their dispute. “Why don’t we settle this fairly?” he suggested. “Let us see who can compel that traveler on the road below to remove his cloak.”
The North Wind agreed, and with a mighty gust, he began his effort. The man, feeling the bitter chill, clutched his cloak tightly around him and even pulled it over his head to protect himself from the relentless wind. After a time, the North Wind gave up, frustrated.
Then the Sun tried his turn. Beaming warmly from the heavens, the Sun caused the air to grow pleasant and balmy. The man, feeling the growing heat, loosened his cloak and eventually took it off in the heat, resting under the shade of a tree. The Sun began to declare victory, but as soon as he turned away, the man put on the cloak again.
The god of commerce then approached the traveler and bought the cloak for five gold coins. The traveler tucked the money away and continued on his way, unbothered by either wind or heat. He soon bought a new cloak and invested the remainder in an index fund. The returns were steady, and in time the man prospered far beyond the value of his simple cloak, while the cloak was Abadar’s permanently.
Commerce, when conducted wisely, can accomplish what neither force nor gentle persuasion alone can achieve, and with minimal deadweight loss.
The thought experiment is not about the idea that your VNM utility could theoretically be doubled, but instead about rejecting diminishing returns to actual matter and energy in the universe. SBF said he would flip with a 51% of doubling the universe’s size (or creating a duplicate universe) and 49% of destroying the current universe. Taking this bet requires a stronger commitment to utilitarianism than most people are comfortable with; your utility needs to be linear in matter and energy. You must be the kind of person that would take a 0.001% chance of colonizing the universe over a 100% chance of colonizing merely a thousand galaxies. SBF also said he would flip repeatedly, indicating that he didn’t believe in any sort of bound to utility.
This is not necessarily crazy—I think Nate Soares has a similar belief—but it’s philosophically fraught. You need to contend with the unbounded utility paradoxes, and also philosophical issues: what if consciousness is information patterns that become redundant when duplicated, so that only the first universe “counts” morally?
For context, I just trialed at METR and talked to various people there, but this take is my own.
I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results (“models do not follow reliable scaling laws, so AI development should be accordingly more cautious”).
The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.
In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.
It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I’m optimistic we’re closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don’t seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.
What’s the most important technical question in AI safety right now?
Yes, lots of socioeconomic problems have been solved on a 5 to 10 year timescale.
I also disagree that problems will become moot after the singularity unless it kills everyone—the US has a good chance of continuing to exist, and improving democracy will probably make AI go slightly better.
I mention exactly this in paragraph 3.
The new font doesn’t have a few characters useful in IPA.
The CATXOKLA population is higher than the current swing state population, so it would arguably be a little less unfair overall. Also there’s the potential for a catchy pronunciation like /kæ′tʃoʊklə/.
Knowing now that he had an edge, I feel like his execution strategy was suspect. The Polymarket prices went from 66c during the order back to 57c on the 5 days before the election. He could have extracted a bit more money from the market if he had forecasted the volume correctly and traded against it proportionally.
I think it would be better to form a big winner-take-all bloc. With proportional voting, the number of electoral votes at stake will be only a small fraction of the total, so the per-voter influence of CA and TX would probably remain below the national average.
A third approach to superbabies: physically stick >10 infant human brains together while they are developing so they form a single individual with >10x the neocortex neurons as the average humans. Forget +7sd, extrapolation would suggest they are >100sd intelligence.
Even better, we could find some way of networking brains together into supercomputers using configurable software. This would reduce potential health problems and also allow us to harvest their waste energy. Though we would have to craft a simulated reality to distract the non-useful conscious parts of the computational substrate, perhaps modeled on the year 1999...
In many respects, I expect this to be closer to what actually happens than “everyone falls over dead in the same second” or “we definitively solve value alignment”. Multipolar worlds, AI that generally follows the law (when operators want it to, and modulo an increasing number of loopholes) but cannot fully be trusted, and generally muddling through are the default future. I’m hoping we don’t get instrumental survival drives though.
Claim 2: The world has strong defense mechanisms against (structural) power-seeking.
I disagree with this claim. It seems pretty clear that the world has defense mechanisms against
disempowering other people or groups
breaking norms in the pursuit of power
But it is possible to be power-seeking in other ways. The Gates Foundation has a lot of money and wants other billionaires’ money for its cause too. It influences technology development. It has to work with dozens of governments, sometimes lobbying them. Normal think tanks exist to gain influence over governments. Harvard University, Jane Street, and Goldman Sachs recruit more elite students than all the EA groups and control more money than OpenPhil. Jane Street and Goldman Sachs guard private information worth billions of dollars. The only one with a negative reputation is Goldman Sachs, which is due to perceived greed rather than power-seeking per se. So why is there so much more backlash against AI safety? I think it basically comes down to a few factors:
We are bending norms (billionaire funding for somewhat nebulous causes) and sometimes breaking them (FTX financial and campaign finance crimes)
We are not able to credibly signal that we won’t disempower others.
MIRI wanted a pivotal act to happen, and under that plan nothing would stop MIRI from being world dictators
AI is inherently a technology with world-changing military and economic applications whose governance is unsolved
An explicitly consequentialist movement will take power by any means necessary, and people are afraid of that.
AI labs have incentives to safetywash, making people wary of safety messaging.
The preexisting AI ethics and open-source movements think their cause is more important and x-risk is stealing attention.
AI safety people are bad at diplomacy and communication, leading to perceptions that they’re the same as the AI labs or have some other sinister motivation.
That said, I basically agree with section 3. Legitimacy and competence are very important. But we should not confuse power-seeking—something the world has no opinion on—with what actually causes backlash.
I was at the NeurIPS many-shot jailbreaking poster today and heard that defenses only shift the attack success curve downwards, rather than changing the power law exponent. How does the power law exponent of BoN jailbreaking compare to many-shot, and are there defenses that change the power law exponent here?