My top interest is AI safety, followed by reinforcement learning. My professional background is in software engineering, computer science, machine learning. I have degrees in electrical engineering, liberal arts, and public policy. I currently live in the Washington, DC metro area; before that, I lived in Berkeley for about five years.
David James
I wonder if you underestimate the complexity of brokering, much less maintaining, a lasting peace, whether it be via superior persuasive abilities or vast economic resource advantages. If you are thinking more along the lines of domination that is so complete that any violent resistance seems minuscule and pointless that’s a different category for me. When I think of “long term peace”, I usually don’t think of simmering grudges that remain dormant because of a massive power imbalance. I will grant that perhaps ultimate form of “persuasion” would involve removing even the mental possibility of resistance.
As I understand it, the phrase “passing the buck” often involves a sense of abdicating responsibility. I don’t think this is what this author means. I would suggest finding alternative phrasings that convey the notion of delegating implementation according to some core principles, combined with the idea of passing the torch to more capable actors.
Note: this comment should not be taken to suggest that I necessarily agree or disagree with the article itself.
To clarify: the claim is that Shapley values are the only way to guarantee the set containing all four properties: {Efficiency, Symmetry, Linearity, Null player}. There are other metrics that can achieve proper subsets.
Hopefully, you have gained some intuition for why Shapley values are “fair” and why they account for interactions among players.
The article fails to make a key point: in political economy and game theory, there are many definitions of “fairness” that seem plausible at face value, especially when considered one at a time. Even if one puts normative questions to the side, there are mathematical limits and constraints as one tries to satisfy various combinations simultaneously. Keeping these in mind, you can think of this as a design problem; it takes some care to choose metrics that reinforce some set of desired norms.
Should the bill had been signed, it would have created severe enough pressures to do more with less to focus on building better and better abstractions once the limits are hit.
Ok, I see the argument. But even without such legislation, the costs of large training runs create major incentives to build better abstractions.
Does this summary capture the core argument? Physical constraints on the human brain contributed to its success relative to other animals, because it had to “do more with less” by using abstraction. Analogously, constraints on AI compute or size will encourage more abstraction, increasing the likelihood of “foom” danger.
Though I’m reasonably sure Llama license (sic) isn’t preventing viewing the source
This is technically correct but irrelevant. Meta doesn’t provide any source code, by which I mean the full set of precursor steps (including the data and how to process it).
Generally speaking, a license defines usage rights; it has nothing to do with if/how the thing (e.g. source code) is made available.
As a weird example, one could publish a repository with a license but no source code. This would be odd. The license would have no power to mandate the code be released; that is a separate concern.
To put it another way, a license does not obligate the owner to release or share anything, whether it be compiled software, source code, weights, etc. A license simply outlines the conditions under which the thing (e.g. source code), once released, can be used or modified.
The paper AI Control: Improving Safety Despite Intentional Subversion is a practical, important step in the right direction. It demonstrates various protocols for aiming for safety even with malicious models that know they are suspected of being dangerous.
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16295-16336, 2024.
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion—an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.
Related Video by Robert Miles: I highly recommend Using Dangerous AI, But Safely? released on Nov. 15, 2024.
From Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever’s Recent Claims:
But what we seem to be seeing is a bit different from deep learning broadly hitting a wall. More specifically it appears to be: returns to scaling up model pretraining are plateauing.
I agree, but I’m not sure how durable this agreement will be. (I reversed my position while drafting this comment.)
Here is my one sentence summary of the argument above: If Omega can make a fully accurate prediction in a universe without backwards causality, this implies a deterministic universe.
The Commission recommends: [...] 1. Congress establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence (AGI) capability.
As mentioned above, the choice of Manhattan Project instead of Apollo Project is glaring.
Worse, there is zero mention of AI safety, AI alignment, or AI evaluation in the Recommendations document.
Lest you think I’m expecting too much, the report does talk about safety, alignment, and evaluation … for non-AI topic areas! (see bolded words below: “safety”, “aligning”, “evaluate”)
“Congress direct the U.S. Government Accountability Office to investigate the reliability of safety testing certifications for consumer products and medical devices imported from China.” (page 736)
“Congress direct the Administration to create an Outbound Investment Office within the executive branch to oversee investments into countries of concern, including China. The office should have a dedicated staff and appropriated resources and be tasked with: [...] Expanding the list of covered sectors with the goal of aligning outbound investment restrictions with export controls.” (page 737)
“Congress direct the U.S. Department of the Treasury, in coordination with the U.S. Departments of State and Commerce, to provide the relevant congressional committees a report assessing the ability of U.S. and foreign financial institutions operating in Hong Kong to identify and prevent transactions that facilitate the transfer of products, technology, and money to Russia, Iran, and other sanctioned countries and entities in violation of U.S. export controls, financial sanctions, and related rules. The report should [...] Evaluate the extent of Hong Kong’s role in facilitating the transfer of products and technologies to Russia, Iran, other adversary countries, and the Mainland, which are prohibited by export controls from being transferred to such countries;” (page 741)
I am not following the context of the comment above. Help me understand the connection? The main purpose of my comment above was to disagree with this sentence two levels up:
The frenzy to couple everything into a single tangle of complexity is driven by the misunderstanding that complacency is the only reason why your ideology is not the winning one
… in particular, I don’t think it captures the dominant driver of “coupling” or “bundling”.
Does the comment one level up above disagree with my claims? I’m not following the connection.
The frenzy to couple everything into a single tangle of complexity is driven by…
In some cases, yes, but this is only one factor of many. Others include:
-
Our brains are often drawn to narratives, which are complex and interwoven. Hence the tendency to bundle up complex logical interdependencies into a narrative.
-
Our social structures are guided/constrained by our physical nature and technology. For in-person gatherings, bundling of ideas is often a dominant strategy.
For example, imagine a highly unusual congregation: a large unified gathering of monotheistic worshippers with considerable internal diversity. Rather than “one track” consisting of shared ideology, they subdivide their readings and rituals into many subgroups. Why don’t we see much of this (if any) in the real world? Because ideological bundling often pairs well with particular ways of gathering.
P.S. I personally welcome gathering styles that promote both community and rationality (spanning a diversity of experiences and values).
-
Right. Some such agreements are often called social contracts. One catch is that a person born into them may not understand their historical origin or practical utility, much less agree with them.
Durable institutions find ways to survive. I don’t mean survival merely in terms of legal continuity; I mean fidelity to their founding charter. Institutions not only have to survive past their first leader; they have to survive their first leader themself! The institution’s structure and policies must protect against the leader’s meandering attention, whims, and potential corruptions. In the case of Elon, based on his mercurial history, I would not bet that Musk would agree to the requisite policies.
they weren’t designed to be ultra-robust to exploitation, or to make serious attempts to assess properties like truth, accuracy, coherence, usefulness, justice
There are notable differences between these properties. Usefulness and justice are quite different from the others (truth, accuracy, coherence). Usefulness (defined as suitability for a purpose, which is non-prescriptive as to the underlying norms) is different from justice (defined by some normative ideal). Coherence requires fewer commitments than truth and accuracy.
Ergo, I could see various instantiations of a library designed to satisfy various levels. Level 1 would value coherence. Level 2 would add truth and accuracy. Level 3: +usefulness. Level 4, +justice.
I like having a list of small, useful things to do that tend to pay off in the long run, like:
go to the grocery store to make sure you have fresh fruits and vegetables
mediate for 10 minutes
do pushups and sit ups
journal for 10 minutes
When my brain feels cluttered, it is nice to have a list of time-boxed simple tasks that don’t require planning or assessment.
Verify human designs and automatically create AI-generated designs which provably cannot be opened by mechanical picking.
Such a proof would be subject to its definition of “mechanical picking” and a sufficiently accurate physics model. (For example, would an electronically-controllable key-looking object with adjustable key-cut depths with pressure sensors qualify as a “pick”?)
I don’t dispute the value of formal proofs for safety. If accomplished, they move the conversation to “is the proof correct?” and “are we proving the right thing?”. Both are steps in the right direction, I think.
I find this article confusing. So I find myself returning to fundamentals of computer science algorithms: to greedy algorithms and under what conditions they are optimal. Would anyone care to build a bridge from this terminology to what the author is trying to convey?