CS PhD student
Abhimanyu Pallavi Sudhir
Ways to think about alignment
Something that seems like it should be well-known, but I have not seen an explicit reference for:
Goodhart’s law can, in principle, be overcome via adversarial training (or generally learning Multi-Agent Systems)
—aka “The enemy is smart.”
Goodhart’s law only really applies to a “static” objective, not when the objective is the outcome of a game with other agents who can adapt.
This doesn’t really require the other agents to act in a way that continuously “improves” the training objective either, it just requires them to be able to constantly throw adversarial examples to the agent forcing it to “generalize”.
In particular, I think this is the basic reason why any reasonable Scalable Oversight protocol would be fundamentally “multi-agent” in nature (like Debate).
Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
I think only particular reward functions, such as in multi-agent/co-operative environments (agents can include humans, like in RLHF) or in actually interactive proving environments?
Yes, I also realized that “ideas” being a thing is due to bounded rationality—specifically they are the outputs of AI search. “Proofs” are weirder though, and I haven’t seen them distinguished very often. I wonder if this is a reasonable analogy to make:
Ideas : search
Answers : inference
Proofs: alignment
There is a cliche that there are two types of mathematicians: “theory developers” and “problem solvers”. Similarly, Robin Hanson divides the production of knowledge into “framing” and “filling”.
It seems to me there are actually three sorts of information in the world:
“Ideas”: math/science theories and models, inventions, business ideas, solutions to open-ended problems
“Answers”: math theorems, experimental observations, results of computations
“Proofs”: math proofs, arguments, evidence, digital signatures, certifications, reputations, signalling
From a strictly Bayesian perspective, there seems to be no “fundamental” difference between these forms of information. They’re all just things you condition your prior on. Yet this division seems to be natural in quite a variety of informational tasks. What gives?
adding this from replies for prominence--
Yes, I also realized that “ideas” being a thing is due to bounded rationality—specifically they are the outputs of AI search. “Proofs” are weirder though, and I haven’t seen them distinguished very often. I wonder if this is a reasonable analogy to make:
Ideas : search
Answers : inference
Proofs: alignment
Just realized in logarithmic market scoring the net number of stocks is basically just log-odds, lol:
Inference-Only Debate Experiments Using Math Problems
Your claims about markets seem just wrong to me. Markets generally do what their consumers want, and their failures are largely the result of transaction costs. Some of these transaction costs have to do with information asymmetry (which needs to be solved), but many others that show up in the real world (related to standard problems like negative externalities etc.) can just be removed by construction in virtual markets.
Markets are fundamentally driven by the pursuit of defined rewards or currencies, so in such a system, how do we ensure that the currency being optimized for truly captures what we care about
By having humans be the consumers in the market. Yes, it is possible to “trick” the consumers, but the idea is that if any oversight protocol is possible at all, then the consumers will naturally buy information from there, and AIs will learn to expect this changing reward function.
MIRI has been talking about it for years; the agent foundations group has many serious open problems related to it.
Can you send me a link? The only thing on “markets in an alignment context” I’ve found on this from the MIRI side is the Wentworth-Soares discussion, but that seems like a very different issue.
it can be confidently known now that the design you proposed is catastrophically misaligned
Can you send me a link for where this was confidently shown? This is a very strong claim to make, nobody even makes this claim in the context of backprop.
I don’t think that AI alignment people doing “enemy of enemy is friend” logic with AI luddites (i.e. people worried about Privacy/Racism/Artists/Misinformation/Jobs/Whatever) is useful.
Alignment research is a luxury good for labs, which means it would be the first thing axed (hyperbolically speaking) if you imposed generic hurdles/costs on their revenue, or if you made them spend on mitigating P/R/A/M/J/W problems.
This “crowding-out” effect is already happening to a very large extent: there are vastly more researchers and capital being devoted to P/R/A/M/J/W problems, which could have been allocated to actual alignment research! If you are forming a “coalition” with these people, you are getting a very shitty deal—they’ve been much more effective at getting their priorities funded than you have been!
If you want them to care about notkilleveryoneism, you have to specifically make it expensive for them to kill everyone, not just untargetedly “oppose” them. E.g. like foom liability.
Why aren’t adverserial inputs used more widely for captchas?
Different models have different adverserial examples?
There are only a known adverserial examples for a given model (discovering them takes time), and can easily just be manually enumerated?
I have no idea what to make of the random stray downvotes
The simplest way to explain “the reward function isn’t the utility function” is: humans evolved to have utility functions because it was instrumentally useful for the reward function / evolution selected agents with utility functions.
(yeah I know maybe we don’t even have utility functions; that’s not the point)
Concretely: it was useful for humans to have feelings and desires, because that way evolution doesn’t have to spoonfeed us every last detail of how we should act, instead it gives us heuristics like “food smells good, I want”.
Evolution couldn’t just select a perfect optimizer of the reward function, because there is no such thing as a perfect optimizer (computational costs mean that a “perfect optimizer” is actually uncomputable). So instead it selected agents that were boundedly optimal given their training environment.
The use of “Differential Progress” (“does this advance safety more or capabilities more?”) by the AI safety community to evaluate the value of research is ill-motivated.
Most capabilities advancements are not very counterfactual (“some similar advancement would have happened anyway”), whereas safety research is. In other words: differential progress measures absolute rather than comparative advantage / disregards the impact of supply on value / measures value as the y-intercept of the demand curve rather than the intersection of the demand and supply curves.
Even if you looked at actual market value, just p_safety > p_capabilities isn’t a principled condition.
Concretely, I think that harping on differential progress risks AI safety getting crowded out by harmless but useless work—most obviously “AI bias” “AI disinformation”, and in my more controversial opinion, overtly prosaic AI safety research which will not give us any insights that can be generalized beyond current architectures. A serious solution to AI alignment will in all likelihood involve risky things like imagining more powerful architectures and revealing some deeper insights about intelligence.
I think EY once mentioned it in the context of self-awareness or free will or something, and called it something like “complete epistemological panic”.
The Kernel of Meaning in Property Rights
Abstraction is like economies of scale
One thing I’m surprised by is how everyone learns the canonical way to handwrite certain math characters, despite learning most things from printed or electronic material. E.g. writing as
IR
rather than how it’s rendered.I know I learned the canonical way because of Khan Academy, but I don’t think “guy handwriting on a blackboard like thing” is THAT disproportionately common among educational resources?
I don’t understand. The hard problem of alignment/CEV/etc. is that it’s not obvious how to scale intelligence while “maintaining” utility function/preferences, and this still applies for human intelligence amplification.
I suppose this is fine if the only improvement you can expect beyond human-level intelligence is “processing speed”, but I would expect superhuman AI to be more intelligent in a variety of ways.