My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.
David James
Hopefully, you have gained some intuition for why Shapley values are “fair” and why they account for interactions among players.
The article fails to make a key point: in political economy and game theory, there are many definitions of “fairness” that seem plausible at face value, especially when considered one at a time. Even if one puts normative questions to the side, there are mathematical limits and constraints as one tries to satisfy various combinations simultaneously. Keeping these in mind, if you think of this as a design problem: it takes some care to choose metrics that reinforce some set of desired norms.
Should the bill had been signed, it would have created severe enough pressures to do more with less to focus on building better and better abstractions once the limits are hit.
Ok, I see the argument. But even without such legislation, the costs of large training runs create major incentives to build better abstractions.
Does this summary capture the core argument? Physical constraints on the human brain contributed to its success relative to other animals, because it had to “do more with less” by using abstraction. Analogously, constraints on AI compute or size will encourage more abstraction, increasing the likelihood of “foom” danger.
Though I’m reasonably sure Llama license (sic) isn’t preventing viewing the source
This is technically correct but irrelevant. Meta doesn’t provide any source code, by which I mean the full set of precursor steps (including the data and how to process it).
Generally speaking, a license defines usage rights; it has nothing to do with if/how the thing (e.g. source code) is made available.
As a weird example, one could publish a repository with a license but no source code. This would be odd. The license would have no power to mandate the code be released; that is a separate concern.
To put it another way, a license does not obligate the owner to release or share anything, whether it be compiled software, source code, weights, etc. A license simply outlines the conditions under which the thing (e.g. source code), once released, can be used or modified.
The paper AI Control: Improving Safety Despite Intentional Subversion is a practical, important step in the right direction. It demonstrates various protocols for aiming for safety even with malicious models that know they are suspected of being dangerous.
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16295-16336, 2024.
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion—an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.
Related Video by Robert Miles: I highly recommend Using Dangerous AI, But Safely? released on Nov. 15, 2024.
From Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever’s Recent Claims:
But what we seem to be seeing is a bit different from deep learning broadly hitting a wall. More specifically it appears to be: returns to scaling up model pretraining are plateauing.
I agree, but I’m not sure how durable this agreement will be. (I reversed my position while drafting this comment.)
Here is my one sentence summary of the argument above: If Omega can make a fully accurate prediction in a universe without backwards causality, this implies a deterministic universe.
The Commission recommends: [...] 1. Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability.
As mentioned above, the choice of Manhattan Project instead of Apollo Project is glaring.
Worse, there is zero mention of AI safety, AI alignment, or AI evaluation in the Recommendations document.
Lest you think I’m expecting too much, the report does talk about safety, alignment, and evaluation … for non-AI topic areas! (see bolded words below: “safety”, “aligning”, “evaluate”)
“Congress direct the U.S. Government Accountability Office to investigate the reliability of safety testing certifications for consumer products and medical devices imported from China.” (page 736)
“Congress direct the Administration to create an Outbound Investment Office within the executive branch to oversee investments into countries of concern, including China. The office should have a dedicated staff and appropriated resources and be tasked with: [...] Expanding the list of covered sectors with the goal of aligning outbound investment restrictions with export controls.” (page 737)
“Congress direct the U.S. Department of the Treasury, in coordination with the U.S. Departments of State and Commerce, to provide the relevant congressional committees a report assessing the ability of U.S. and foreign financial institutions operating in Hong Kong to identify and prevent transactions that facilitate the transfer of products, technology, and money to Russia, Iran, and other sanctioned countries and entities in violation of U.S. export controls, financial sanctions, and related rules. The report should [...] Evaluate the extent of Hong Kong’s role in facilitating the transfer of products and technologies to Russia, Iran, other adversary countries, and the Mainland, which are prohibited by export controls from being transferred to such countries;” (page 741)
I am not following the context of the comment above. Help me understand the connection? The main purpose of my comment above was to disagree with this sentence two levels up:
The frenzy to couple everything into a single tangle of complexity is driven by the misunderstanding that complacency is the only reason why your ideology is not the winning one
… in particular, I don’t think it captures the dominant driver of “coupling” or “bundling”.
Does the comment one level up above disagree with my claims? I’m not following the connection.
The frenzy to couple everything into a single tangle of complexity is driven by…
In some cases, yes, but this is only one factor of many. Others include:
-
Our brains are often drawn to narratives, which are complex and interwoven. Hence the tendency to bundle up complex logical interdependencies into a narrative.
-
Our social structures are guided/constrained by our physical nature and technology. For in-person gatherings, bundling of ideas is often a dominant strategy.
For example, imagine a highly unusual congregation: a large unified gathering of monotheistic worshippers with considerable internal diversity. Rather than “one track” consisting of shared ideology, they subdivide their readings and rituals into many subgroups. Why don’t we see much of this (if any) in the real world? Because ideological bundling often pairs well with particular ways of gathering.
P.S. I personally welcome gathering styles that promote both community and rationality (spanning a diversity of experiences and values).
-
Right. Some such agreements are often called social contracts. One catch is that a person born into them may not understand their historical origin or practical utility, much less agree with them.
Durable institutions find ways to survive. I don’t mean survival merely in terms of legal continuity; I mean fidelity to their founding charter. Institutions not only have to survive past their first leader; they have to survive their first leader themself! The institution’s structure and policies must protect against the leader’s meandering attention, whims, and potential corruptions. In the case of Elon, based on his mercurial history, I would not bet that Musk would agree to the requisite policies.
they weren’t designed to be ultra-robust to exploitation, or to make serious attempts to assess properties like truth, accuracy, coherence, usefulness, justice
There are notable differences between these properties. Usefulness and justice are quite different from the others (truth, accuracy, coherence). Usefulness (defined as suitability for a purpose, which is non-prescriptive as to the underlying norms) is different from justice (defined by some normative ideal). Coherence requires fewer commitments than truth and accuracy.
Ergo, I could see various instantiations of a library designed to satisfy various levels. Level 1 would value coherence. Level 2 would add truth and accuracy. Level 3: +usefulness. Level 4, +justice.
I like having a list of small, useful things to do that tend to pay off in the long run, like:
go to the grocery store to make sure you have fresh fruits and vegetables
mediate for 10 minutes
do pushups and sit ups
journal for 10 minutes
When my brain feels cluttered, it is nice to have a list of time-boxed simple tasks that don’t require planning or assessment.
Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
Such a proof would be subject to its definition of “mechanical picking” and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a “pick”?)
I don’t dispute the value of formal proofs for safety. If accomplished, they move the conversation to “is the proof correct?” and “are we proving the right thing?”. Both are steps in the right direction, I think.
Thanks for the references; I’ll need some time to review them. In the meanwhile, I’ll make some quick responses.
As a side note, I’m not sure how tree search comes into play; in what way does tree search require unbounded steps that doesn’t apply equally to linear search?
I intended tree search as just one example, since minimax tree search is a common example for game-based RL research.
No finite agent, recursive or otherwise, can plan over an unbounded number of steps in finite time...
In general, I agree. Though there are notable exceptions for cases such as (not mutually exclusive):
-
a closed form solution is found (for example, where a time-based simulation can calculate some quantity at an any arbitrary time step using the same amount of computation)
-
approximate solutions using a fixed number of computation steps are viable
-
a greedy algorithm can select the immediate next action that is equivalent to following a longer-term planning algorithm
… so it’s not immediately clear to me how iteration/recursion is fundamentally different in practice.
Yes, like I said above, I agree in general and see your point.
As I’m confident we both know, some algorithms can be written more compactly when recursion/iteration are available. I don’t know how much computation theory touches on this; i.e. what classes of problems this applies to and why. I would make an intuitive guess that it is conceptually related to my point earlier about closed-form solutions.
-
Note that this is different from the (also very interesting) question of what LLMs, or the transformer architecture, are capable of accomplishing in a single forward pass. Here we’re talking about what they can do under typical auto-regressive conditions like chat.
I would appreciate if the community here could point me to research that agrees or disagrees with my claim and conclusions, below.
Claim: one pass through a transformer (of a given size) can only do a finite number of reasoning steps.Therefore: If we want an agent that can plan over an unbounded number of steps (e.g. one that does tree-search), it will need some component that can do an arbitrary number of iterative or recursive steps.
Sub-claim: The above claim does not conflict with the Universal Approximation Theorem.
Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.
If one is a consequentialist (of some flavor), one can still construct a “desirability tree” over various possible various future states. Sure, the uncertainty makes the problem more complex in practice, but the algorithm is still very simple. So I don’t think that that a more complex universe intrinsically has anything to do with alignment per se.
Arguably, machines will have better computational ability to reason over a vast number of future states. In this sense, they will be more ethical according to consequentialism, provided their valuation of terminal states is aligned.
To be clear, of course, alignment w.r.t. the valuation of terminal states is important. But I don’t think this has anything to do with a harder to predict universe. All we do with consequentialism is evaluate a particular terminal state. The complexity of how we got there doesn’t matter.
(If you are detecting that I have doubts about the goodness and practicality of consequentialism, you would be right, but I don’t think this is central to the argument here.)
If humans don’t really carry out consequentialism like we hope they would (and surely humans are not rational enough to adhere to consequentialist ethics—perhaps not even in principle!), we can’t blame this on outer alignment, can we? This would be better described as goal misspecification.
If one subscribes to deontological ethics, then the problem becomes even easier. Why? One wouldn’t have to reason probabilistically over various future states at all. The goodness of an action only has to do with the nature of the action itself.
Do you want to discuss some other kind of ethics? Is there some other flavor that would operate differentially w.r.t. outer alignment in a more versus less predictable universe?
To clarify: the claim is that Shapley values are the only way to guarantee the set containing all four properties: {Efficiency, Symmetry, Linearity, Null player}. There are other metrics that can achieve proper subsets.