Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model. It also seems like it wouldn’t pursue measures targeted at the kind of disaster it denies, and might actively discourage them (this sometimes happens already). With a threat model of privacy violations restrictions on model size would be a huge lift and the remedy wouldn’t fit the diagnosis in a way that made sense to policymakers. So I wouldn’t expect privacy advocates to bring them about based on their past track record, particularly in China where privacy and digital democracy have not had great success.
If it in fact is true that there is a large risk of almost everyone alive today being killed or subjugated by AI, then establishing that as scientific consensus seems like it would supercharge a response dwarfing current efforts for things like privacy rules, which would aim to avert that problem rather than deny it and might manage such huge asks, including in places like China. On the other hand, if the risk is actually small, then it won’t be possible to scientifically demonstrate high risk, and it would play a lesser role in AI policy.
I don’t see a world where it’s both true the risk is large and knowledge of that is not central to prospects for success with such huge political lifts.
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model”.
I can imagine there being movements that fit this description, in which case I would not focus on talking with them or talking about them.
But I have not been in touch with any movements matching this description. Perhaps you could share specific examples of actions from specific movements you have in mind?
For the movements I have in mind (and am talking with), the description does not match at all:
AI ethics and inclusion movements go a lot further than stopping people from building AI that eg. make discriminatory classifications/recommendations associated with marginalised communities – they want Western corporations to stop consolidating power through AI development and deployment while pushing their marginalised communities further out of the loop (rendering them voiceless).
Digital democracy groups and human-centric AI movements go a lot further than wanting to regulate AI – they want to relegate AI models humble models in the background that can assist and interface between humans building consensus and making decisions in the foreground.
Privacy and data ownership movements go a lot further than wanting current EU regulations – they do not want models to be trained on, store and exploit their own data in model parameters without their permission.
Suggest reading writings by people in those movements. Let me also copy over excerpts from people active in the areas of AI ethics & inclusion and digital democracy:
“We also advocate for a re-alignment of research goals: Where much effort has been allocated to making models (and their training data) bigger and to achieving ever higher scores on leaderboards often featuring artificial tasks, we believe there is more to be gained by focusing on understanding how machines are achieving the tasks in question and how they will form part of socio-technical systems.” from paper co-authored by Timnit Gebru.
“Rationalists are like most of the ideological groups I interact with. They are allies in important projects, such as limiting the race for massive investments in “AI” capabilities and engaging in governance experimentation. In other projects, such as limiting the social power/hubris of SV and diversifying it along a variety of dimensions they are more likely adversaries or at least unlikely allies.” from post by Glen Weyl.
Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?
Yes, I do. And the movements I am in touch with are against corporate R&D labs scaling up AI models in the careless ways they’ve been doing so far.
Are you taking a stance here of “those outside movements have different explicit goals than us AI Safety researchers, and therefore cannot become goal-aligned with our efforts”?
In that case, I would disagree here.
Theoretically, I disagree with the ontological and meta-ethical assumptions that these claims would be based on. While objective goals expressed here are disjunctive, the underlying values are additive (I do not expect this statement to make sense for you; please skip to next point).
Practically, movements with various explicit goals are already against corporations (that are selected by markets to extract value from local communities) centrally scaling up the training of increasingly autonomous/power-seeking models. Some examples:
re: AI ethics and inclusion: The Stochastic Parrots paper co-authored by Timnit Gebru (before Google AI managers let her go, so to speak), describes various reasons for slowing down the scaled training of (language) transformer models. These reasons include the environmental costs of compute, neglecting to curate training data carefully, and the failure to co-design with stakeholders affected. All reasons to not scale up AI models fast.
Note that I have not read any writings from Gebru that “AGI risk” is not a thing. More the question of why people are then diverting resources to AGI-related research while assuming that the development general AI is inevitable and beyond our control.
re: Digital democracy and human-centric AI: The How AI Fails Us paper (see p5) argues against the validity of and further investment in the “centralization of capital and decision-making capacity under the direction of a small group of engineers of AI systems” where “the machine is independent from human input and oversight” and with “the target of “achieving general intelligence””. How does this not match up with arguing against “subjugating humanity autonomously [with the centralised] scale-up of AI models”?
People like Divya Siddarth, Glen Weyl, Audrey Tang, Jaron Lanier and Daron Acemoglu have repeatedly expressed their concerns about how current automation of work through AI models threatens the empowerment of humans in their work, creativity, and collective choice-making.
Weyl is also skeptical about the monolithic conception of “AGI” surpassing humans across some metrics. I disagree in that generally-capable self-learning/modifying machinery are physically possible. I agree in that monolithic oversimplified representations of AGI have allowed AI Safety researchers to make unsound presumptive claims about how they expect that machinery could be “aligned” in “principle.
As an example, you mentioned how governments could invest tens of billions of dollars in interpretability research. I touched on reasons here why interpretability research does not and cannot contribute top long-term AGI safety. Based on that, government-funded in interpretability research would distract smart AI researchers from actually contributing, and lend false confidence to AGI researchers that AGI could be interpreted sufficiently. Ie. this is “align-washing” the harmful activities by AI corporations, analogous to green-washing the harmful activities of fossil-fuel corporations.
As another example, your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here, cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.
re: Privacy and data ownership: If privacy and data ownership movements take their own claims seriously (and some do), they would push for banning the training of ML models on human-generated data or any sensor-based surveillance that can be used to track humans’ activities.
With a threat model of privacy violations restrictions on model size would be a huge lift and the remedy wouldn’t fit the diagnosis in a way that made sense to policymakers. So I wouldn’t expect privacy advocates to bring them about based on their past track record, particularly in China where privacy and digital democracy have not had great success.
What do you mean here with a “huge lift”? Koen Holtman has been involved with internet privacy movements for decades. Let me ping him in case he wants to share thoughts on what went wrong there in Europe/ and in China.
I agree that some specific leaders you cite have expressed distaste for model scaling, but it seems not to be a core concern. In a choice between more politically feasible measures that target concerns they believe are real vs concerns they believe are imaginary and bad, I don’t think you get the latter. And I think arguments based on those concerns get traction on measures addressing the concerns, but less so on secondary wishlist items of leaders .
I think that’s the reason privacy advocacy in legislation and the like hasn’t focused on banning computers in the past (and would have failed if they tried). For example:
If privacy and data ownership movements take their own claims seriously (and some do), they would push for banning the training of ML models on human-generated data or any sensor-based surveillance that can be used to track humans’ activities.
AGI working with AI generated data or data shared under the terms and conditions of web services can power the development of highly intelligent catastrophically dangerous systems, and preventing AI from reading published content doesn’t seem close to the core motives there, especially for public support on privacy. So taking the biggest asks they can get based on privacy arguments I don’t think blocks AGI.
People like Divya Siddarth, Glen Weyl, Audrey Tang, Jaron Lanier and Daron Acemoglu have repeatedly expressed their concerns about how current automation of work through AI models threatens the empowerment of humans in their work, creativity, and collective choice-making.
It looks this kind of concern at scale naturally goes towards things like compensation for creators (one of Lanier’s recs), UBI, voting systems, open-source AI, and such.
Jaron Lanier has written a lot dismissing the idea of AGI or work to address it. I’ve seen a lot of such dismissal from Glen Weyl. Acemoglu I don’t think wants to restrict AI development? I don’t know Siddarth or Tang’s work well.
Note that I have not read any writings from Gebru that “AGI risk” is not a thing. More the question of why people are then diverting resources to AGI-related research while assuming that the development general AI is inevitable and beyond our control.
They’re definitely living in a science fiction world where everyone who wants to save humanity has to work on preventing the artificial general intelligence (AGI) apocalypse...Agreed but if that urgency is in direction of “we need to stop evil AGI & LLMs are AGI” then it does the opposite by distracting from types of harms perpetuated & shielding those who profit from these models from accountability. I’m seeing a lot of that atm (not saying from you)...What’s the open ai rationale here? Clearly it’s not the same as mine, creating a race for larger & larger models to output hateful stuff? Is it cause y’all think they have “AGI”?...Is artificial general intelligence (AGI) apocalypse in that list? Cause that’s what him and his cult preach is the most important thing to focus on...The thing is though our AGI superlord is going to make all of these things happen once its built (any day now) & large language models are a way to get to it...Again, this movement has so much of the $$ going into “AI safety.” You shouldn’t worry about climate change as much as “AGI” so its most important to work on that. Also what Elon Musk was saying around 2015 when he was backing of Open AI & was yapping about “AI” all the time.
That reads to me as saying concerns about ‘AGI apocalypse’ are delusional nonsense but pursuit of a false dream of AGI incidentally cause harms like hateful AI speech through advancing weaker AI technology, while the delusions should not be an important priority.
What do you mean here with a “huge lift”?
I gave the example of barring model scaling above a certain budget.
I touched on reasons here why interpretability research does not and cannot contribute top long-term AGI safety.
I disagree extremely strongly with that claim. It’s prima facie absurd to think that, e.g. that using interpretability tools to discover that AI models were plotting to overthrow humanity would not help to avert that risk. For instance, that’s exactly the kind of thing that would enable a moratorium on scaling and empowering those models to improve the situationn.
As another example, your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here, cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.
This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).
Some of your interpretations of writings by Timnit Gebru and Glen Weyl seem fair to me (though would need to ask them to confirm). I have not look much into Jaron Lanier’s writings on AGI so that prompts me to google that.
Perhaps you can clarify the other reasons why the changes in learning would be stable in preserving “good properties”? I’ll respond to your nuances regarding how to interpret your long-term-evaluating error correcting code after that.
re: Leaders of movements being skeptical of the notion of AGI.
Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks.
Regarding Jaron Lanier, yes, (re)reading this post I agree that he seems to totally dismiss the notion of AGI, seeing it more a result of a religious kind of thinking under which humans toil away at offering the training data necessary for statistical learning algorithms to function without being compensated.
Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we’ll still be discussing the application limits of error correction methods.
Three ways to explain why:
AnyworkableAI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).
Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction).
Eg. mechanistic interpretability is essentially about “detecting the actual (or simulating the potential) effects internally” of AI.
The only way to actually (slightly) counteract AGI convergence on causing “instrumental” and “needed” effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).
~ ~ ~ Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety.
I reread your original post and Christiano’s comment to understand your reasoning better and see how I could limits of applicability of error correction methods.
I also messaged Forrest (the polymath) to ask for his input.
The messages were of a high enough quality that I won’t bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits).
Remmelt 15:38 Remmelt: “As another example [of unsound monolithic reasoning], your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here (https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=Jaf9b9YAARYdrK3jp), cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.”
Carl Shulman: ”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).”
Remmelt 15:40 Excerpting from the comment by Christiano I link to above: ”The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it’s either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don’t describe what those distributed consequentialists are like or how they come about.
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don’t buy Carl’s argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars.”
”But the program of an AI, large stores of astronomical observations for navigation, and vast stores of technological information would take up an enormous amount of memory and storage space, perhaps many exabytes or more. Given this large body of information, adding additional directives to ensure that the probes eventually turn to producing welfare need only increase storage needs by a very small proportion, e.g. by 1 in 1 billion. Directives could directly specify the criteria to be eventually optimized, or could simply require compliance with further orders traveling behind the frontier of colonization. ... Mutation is easier to resist for computers than animals Biological life on Earth has evolved through mutation, and the reproductive process introduces significant errors in each generation. However, digital information storage allows for the comparison of redundant copies and the use of error-correcting codes, making substantive mutation many orders of magnitude less likely than in Earthly life.”
Remmelt 15:45
Returning to the new comment by Carl Schulman: ”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves)”
Remmelt 15:46
Thinking about this overnight, I think Carl’s stated reasoning is still unsound for multiple reasons:
1. Contradiction between a Von Neummann Probe being adaptable enough (ie. learning new ways of processing inputs into outputs) to travel across space and seed new civilisations, yet have error correcting code that allows comparison of new code with original redundant copies. Not going to work, for reasons Forrest amply explained and I tried to summarise here: https://docs.google.com/document/d/1-AAhqvgFNx_MlLkcSgw-chvmFoC4EZ4LmTl1IWcsqEA/edit
2. Confuses complicated pre-loaded technological knowledge/systems with complex adaptive systems. The fact that they are saying that adding in directives would only increase storage by 1 part in 1 billion parts is a giveaway, I think.
Remmelt 15:55
3. Inverse take on 1. Algorithms which can flexibly ‘mutate’ and branch out into different versions become better at using resources and multiplying than more rigid or robustly functional designs. This makes Carl Schulman’s case for launching out self-replicating space probes with code error-checking/minimisation routines seem a lot more dicey. If a defecting group launches even one alternate design with a flexible code-mutating ability that confers an advantage that can’t easily be copied by the error-minimising designs without compromising on their ability to act on the directives humans originally coded in to ‘directly specify the criteria to be eventually optimized’ – well, then you might end up instead with swarms of spaceprobes that eat up the galaxy indiscriminately, including any remaining carbon-based lifeforms on planet Earth.
Underlying premise: even if humans construct a long-term aligned AI design – where humans can formally prove a model to causally constrain any possible process of agency emerging from and expanding across each of the physical parts in which this model infers its computational process to be embedded to stay within all fundamental bounds necessary for maintaining alignment with the values that humans broadly share in common – then in practice that design is ‘one step away’ from getting mutated into misalignment by a faction of humans who seek a capability advantage (does it give one though?) for manifesting their more granular personal values.
Remmelt 15:57
@Forrest anything to add to the above? How about I write that into a comment reply (which perhaps might make for a stand-alone post later?)
Forrest 16:25
The starting point is: That any procedure of AGI alignment will (Ie, cannot not) resemble some sort of error correction algorithm
Forrest 16:28
This point cannot be disagreed with and still have the notion of alignment be persistent. If they argue here, they will need to establish a conformable coherent counter example. Simple contradiction is not enough, as the claim basis shifts to them.
Forrest 16:30
Once hard line equivalence is firmly established, then the argument re limits of causation in all error correction processes apply.
Remmelt 16:44
Good one. So start with how any AGI-alignment method must resemble an error correction algorithm in order to correct out misaligned behaviour learned and expressed in interaction with the environment.
Forrest 16:48
In regards to the idea that some machine could “have a strategy of optimizing for near-maximum X until condition Z is achieved, and *then* implement directive Q” makes at least three assumptions that are highly unlikely.
1; there is no assurance that in the long duration of the optimization of X that there would be persistence of directive Q.
2; there is no assurance that the condition ‘Z’ is consistently detectable, or that the triggering condition is even probable to being implemented, even if detected (as per a violation of assumption 1).
3; the integration coupling of directive Q is mismatched to the sheer mass of the output of the optimized code—ie, it is unlikely that ‘Q’, no matter how formed, will actually result in the consistent behavior described. Ie, ‘Q’ is significantly underspecified, and if ‘Q’ was appropriately specified, it would not be “lightweight” with respect to the optimization process ‘X’, and would have been optimized away to ineffectiveness. Hence, even if Z condition trigger happens, Q does not happen.
Note insofar as the logic is directly generalized over the logical claim form of “adopt a strategy of near-maximum colonization speed until they reach the most distant accessible galaxies, and only then divert resources to producing welfare”, the proposal itself is not well specified. Ie, the assumption that ‘Q’ and detection of ‘Z’ can be “simple” is a conceptual category error.
Remmelt 17:02
I read through this, and think I understand your points.
Forrest 17:10
Yes. I am trying to think of a good metaphor. It is maybe a bit like some single ordinary person trying to command everyone to do something in some consistent way, when that _everyone_ is 100000 people all yelling at one another in some huge auditorium somewhere. The assumption that one person, even at maximum personal signal volume and intensity, could be heard over all similar other people (in this case, the other objectives of the Von Neumann probe), and command them to implement some new action pattern reliably consistent with ‘Q’, is simply deeply mismatched to the output reality of the optimization process ‘X’. The single directive ‘Q’ is simply not powerful enough to fully conditionalize all of ‘X’, even under triggering conditions ‘Z’.
Also, I notice that the assumptions that the error correction process could be applied linearly to a dynamic self recursive system at arbitrary levels of abstraction is simply another category error. Wrong tool for the wrong job. That is less important, though, than the issue with the Q directive arbitrarily efficient effectivity mismatch.
It’s prima facie absurd to think that, e.g. that using interpretability tools to discover that AI models were plotting to overthrow humanity would not help to avert that risk.
I addressed claims of similar forms at least 3 times times already on separate occasions (including in the post itself).
“The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.”
This is like saying there’s no value to learning about and stopping a nuclear attack from killing you because you might get absolutely no benefit from not being killed then, and being tipped off about a threat trying to kill you, because later the opponent might kill you with nanotechnology before you can prevent it.
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures. And as I said actually being able to show a threat to skeptics is immensely better for all solutions, including relinquishment, than controversial speculation.
It’s saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it’s still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other “inspect internals”) methods cannot contribute to long-term AGI safety. And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.
In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking AI was explicitly plotting to overthrow humanity.
That is, except for the extremely unlikely case you pointed to that such intentions are detected and on time, and humans all coordinate at once to impose an effective moratorium on scaling or computing larger models. But this is actually speculation, whereas that OpenAI promoted Olah’s fascinating Microscope-generated images as them making progress on understanding and aligning scalable ML models is not speculation.
Overall, my sense is that mechanistic interpretability is used to align-wash capability progress towards AGI, while not contributing to safety where it predominantly matters.
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures.
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.
How do you think capability developers would respond to that statement? Will they just stay on the safe side, saying “Well those alignment researchers say that mechanistic interpretability helps remove intentional deception or harm, but I’m just going to stay on the safe side and not scale any further”. No, they are going to use your statement to promote the potential safety of their scalable models, and remove whatever safety margin they can justify themselves taking and feel justified taking for themselves.
Not considering unknown unknowns is going to get us killed. Not considering what safety problems may be unsolvable is going to get us killed.
Age-old saying: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
> What is your background? > How is it relevant to the work > you are planning to do?
Years ago, we started with a strong focus on civilization design and mitigating x-risk. These are topics that need and require more generalist capabilities, in many fields, not just single specialist capabilities, in any one single field of study or application.
Hence, as generalists, we are not specifically persons who are career mathematicians, nor even career physicists, chemists, or career biologists, anthropologists, or even career philosophers. Yet when considering the needs of topics civ-design and/or x-risk, it is very abundantly clear that some real skill and expertise is actually needed in all of these fields.
Understanding anything about x-risk and/or civilization means needing to understand key topics regarding large scale institutional process, ie; things like governments, businesses, university, constitutional law, social contract theory, representative process, legal and trade agreements, etc.
Yet people who study markets, economics, and politics (theory of groups, firms, etc) who do not also have some real grounding in actual sociology and anthropology, are not going to have grounding in understanding why things happen in the real world as they tend to do.
And those people are going to need to understand things like psychology, developmental psych, theory of education, interpersonal relationships, attachment, social communication dynamics, health of family and community, trauma, etc.
And understanding *those* topics means having a real grounding in evolutionary theory, bio-systems, ecology, biology, neurochemestry and neurology, ecosystem design, permaculture, and evolutionary psychology, theory of bias, etc.
It is hard to see that we would be able to assess things like ‘sociological bias’ as impacting possible mitigation strategies of x-risk, if we do not actually also have some real and deep, informed, and realistic accounting of the practical implications of, in the world, of *all* of these categories of ideas.
And yet, unfortunately, that is not all, since understanding of *those* topics themselves means even more and deeper grounding in things like organic and inorganic chemistry, cell process, and the underlying *physics* of things like that. Which therefore includes a fairly general understanding of multiple diverse areas of physics (mechanical, thermal, electromagnetic, QM, etc), and thus also of technology—since that is directly connected to business, social systems, world systems infrastructure, internet, electrical grid and energy management, transport (for fuel, materials, etc), and even more politics, advertising and marketing, rhetorical process and argumentation, etc.
Oh, and of course, a deep and applied practical knowledge of ‘computer science’, since nearly everything in the above is in one way or another “done with computers”. Maybe, of course, that would also be relevant when considering the specific category of x-risk which happens to involve computational concepts when thinking about artificial superintelligence.
I *have* been a successful practicing engineer in both large scale US-gov deployed software and also in product design shipped to millions. I have personally written more than 900,000 lines of code (mostly Ansi-C, ASM, Javascript) and have been ‘the principle architect’ in a team. I have developed my own computing environments, languages, procedural methodologies, and system management tactics, over multiple process technologies in multiple applied contexts. I have a reasonably thorough knowledge of CS. Including the modeling math, control theory, etc. Ie, I am legitimately “full stack” engineering from the physics of transistors, up through CPU design, firmware and embedded systems, OS level work, application development, networking, user interface design, and the social process implications of systems. I have similarly extensive accomplishments in some of the other listed disciplines also.
As such, as a proven “career” generalist, I am also (though not just) a master craftsman, which includes things like practical knowledge of how to negotiate contracts, write all manner documents, make all manner of things, *and* understand the implications of *all* of this in the real world, etc.
For the broad category of valid and reasonable x-risk assessment, that nothing less than at least some true depth in nearly *all* of these topics, will do.
From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a “proof” based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant.
I usually do not mention Forrest Landry’s name immediately for two reasons:
If you google his name, he comes across like a spiritual hippie. Geeks who don’t understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell’s Theorem is a thing) .
Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.
Both of us prefer to work behind the scenes. I’ve only recently started to touch on the arguments in public.
You can find those arguments elaborated on here. Warning: large inferential distance; do message clarifying questions – I’m game!
It’s saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it’s still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
As requested by Remmelt I’ll make some comments on the track record of privacy advocates, and their relevance to alignment.
I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards limiting what large companies and governments are actually allowed to do with all that data.
The digital civil rights community has long promoted the idea that Internet based platforms and other computer systems must be designed and run in a way that is aligned with human values. In the context of AI and ML based computer systems, this has led to demands for AI fairness and transparency/explainability that have also found their way into policy like the GDPR, legislation in California, and the upcoming EU AI Act. AI fairness demands have influenced the course of AI research being done, e.g. there has been research on defining it even means for an AI model to be fair, and on making models that actually implement this meaning.
To a first approximation, privacy and digital rights advocates will care much more about what an ML model does, what effect its use has on society, than about the actual size of the ML model. So they are not natural allies for x-risk community initiatives that would seek a simple ban on models beyond a certain size. However, they would be natural allies for any initiative that seeks to design more aligned models, or to promote a growth of research funding in that direction.
To make a comment on the premise of the original post above: digital rights activists will likely tell you that, when it comes to interventions on AI research, speculating about the tractability of ‘slowing down AI research’ is misguided. What you really should be thinking about is changing the direction of AI research.
Also, I stand corrected then on my earlier comment on that privacy and digital ownership advocates would/should care about models being trained on their own/person-tracking data such to restrict the scaling of models. I’m guessing I was not tracking well then what people in at least the civil rights spaces Koen moves around in are thinking and would advocate for.
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model.
This is a very spicy take, but I would (weakly) guess that a hypothetical ban on ML trainings that cost more than $10M would make AGI timelines marginally shorter rather than longer, via shifting attention and energy away from scaling and towards algorithm innovation.
Very interesting! Recently, US started to regulate export of computing power to China. Do you expect this to speed up AGI timeline in China, or do you expect regulation to be ineffective, or something else?
Reportedly, NVIDIA developed A800, which is just A100, to keep the letter but probably not the spirit of the regulation. I am trying to follow closely how A800 fares, because it seems to be an important data point on feasibility of regulating computing power.
I strongly agree with Steven about this. Personally, I expect it’ll be non-impactful in either direction. I think the majority of research groups already have sufficient compute available to make dangerous algorithmic progress, and they are not so compute-resource-rich that their scaling efforts are distracting them from more dangerous pursuits. I think the groups who would be more dangerous if they weren’t ‘resource drunk’ are mainly researchers at big companies.
I think the two camps are less orthogonal than your examples of privacy and compute reg portray. There’s room for plenty of excellent policy interventions that both camps could work together to support. For instance, increasing regulatory requirements for transparency on algorithmic decision-making (and crucially, building a capacity both in regulators and in the market supporting them to enforce this) is something that I think both camps would get behind (the xrisk one because it creates demand for interpretability and more and the other because eg. it’s easier to show fairness issues) and could productively work on together. I think there are subculture clash reasons the two camps don’t always get on, but that these can be overcome, particularly given there’s a common enemy (misaligned powerful AI). See also this paper Beyond Near- and Long-Term: Towards a
Clearer Account of Research Priorities in AI Ethics and Society
I know lots of people who are uncertain about how big the risks are, and care about both problems, and work on both (I am one of these—I care more about AGI risk, but I think the best things I can do to help avert it involve working with the people you think aren’t helpful).
Do you think there is a large risk of AI systems killing or subjugating humanity autonomously related to scale-up of AI models?
A movement pursuing antidiscrimination or privacy protections for applications of AI that thinks the risk of AI autonomously destroying humanity is nonsense seems like it will mainly demand things like the EU privacy regulations, not bans on using $10B of GPUs instead of $10M in a model. It also seems like it wouldn’t pursue measures targeted at the kind of disaster it denies, and might actively discourage them (this sometimes happens already). With a threat model of privacy violations restrictions on model size would be a huge lift and the remedy wouldn’t fit the diagnosis in a way that made sense to policymakers. So I wouldn’t expect privacy advocates to bring them about based on their past track record, particularly in China where privacy and digital democracy have not had great success.
If it in fact is true that there is a large risk of almost everyone alive today being killed or subjugated by AI, then establishing that as scientific consensus seems like it would supercharge a response dwarfing current efforts for things like privacy rules, which would aim to avert that problem rather than deny it and might manage such huge asks, including in places like China. On the other hand, if the risk is actually small, then it won’t be possible to scientifically demonstrate high risk, and it would play a lesser role in AI policy.
I don’t see a world where it’s both true the risk is large and knowledge of that is not central to prospects for success with such huge political lifts.
I can imagine there being movements that fit this description, in which case I would not focus on talking with them or talking about them.
But I have not been in touch with any movements matching this description. Perhaps you could share specific examples of actions from specific movements you have in mind?
For the movements I have in mind (and am talking with), the description does not match at all:
AI ethics and inclusion movements go a lot further than stopping people from building AI that eg. make discriminatory classifications/recommendations associated with marginalised communities – they want Western corporations to stop consolidating power through AI development and deployment while pushing their marginalised communities further out of the loop (rendering them voiceless).
Digital democracy groups and human-centric AI movements go a lot further than wanting to regulate AI – they want to relegate AI models humble models in the background that can assist and interface between humans building consensus and making decisions in the foreground.
Privacy and data ownership movements go a lot further than wanting current EU regulations – they do not want models to be trained on, store and exploit their own data in model parameters without their permission.
Suggest reading writings by people in those movements. Let me also copy over excerpts from people active in the areas of AI ethics & inclusion and digital democracy:
“We also advocate for a re-alignment of research goals: Where much effort has been allocated to making models (and their training data) bigger and to achieving ever higher scores on leaderboards often featuring artificial tasks, we believe there is more to be gained by focusing on understanding how machines are achieving the tasks in question and how they will form part of socio-technical systems.” from paper co-authored by Timnit Gebru.
“Rationalists are like most of the ideological groups I interact with. They are allies in important projects, such as limiting the race for massive investments in “AI” capabilities and engaging in governance experimentation. In other projects, such as limiting the social power/hubris of SV and diversifying it along a variety of dimensions they are more likely adversaries or at least unlikely allies.” from post by Glen Weyl.
Yes, I do. And the movements I am in touch with are against corporate R&D labs scaling up AI models in the careless ways they’ve been doing so far.
Are you taking a stance here of “those outside movements have different explicit goals than us AI Safety researchers, and therefore cannot become goal-aligned with our efforts”?
In that case, I would disagree here.
Theoretically, I disagree with the ontological and meta-ethical assumptions that these claims would be based on. While objective goals expressed here are disjunctive, the underlying values are additive (I do not expect this statement to make sense for you; please skip to next point).
Practically, movements with various explicit goals are already against corporations (that are selected by markets to extract value from local communities) centrally scaling up the training of increasingly autonomous/power-seeking models. Some examples:
re: AI ethics and inclusion:
The Stochastic Parrots paper co-authored by Timnit Gebru (before Google AI managers let her go, so to speak), describes various reasons for slowing down the scaled training of (language) transformer models. These reasons include the environmental costs of compute, neglecting to curate training data carefully, and the failure to co-design with stakeholders affected. All reasons to not scale up AI models fast.
Note that I have not read any writings from Gebru that “AGI risk” is not a thing. More the question of why people are then diverting resources to AGI-related research while assuming that the development general AI is inevitable and beyond our control.
re: Digital democracy and human-centric AI:
The How AI Fails Us paper (see p5) argues against the validity of and further investment in the “centralization of capital and decision-making capacity under the direction of a small group of engineers of AI systems” where “the machine is independent from human input and oversight” and with “the target of “achieving general intelligence””. How does this not match up with arguing against “subjugating humanity autonomously [with the centralised] scale-up of AI models”?
People like Divya Siddarth, Glen Weyl, Audrey Tang, Jaron Lanier and Daron Acemoglu have repeatedly expressed their concerns about how current automation of work through AI models threatens the empowerment of humans in their work, creativity, and collective choice-making.
Weyl is also skeptical about the monolithic conception of “AGI” surpassing humans across some metrics. I disagree in that generally-capable self-learning/modifying machinery are physically possible. I agree in that monolithic oversimplified representations of AGI have allowed AI Safety researchers to make unsound presumptive claims about how they expect that machinery could be “aligned” in “principle.
As an example, you mentioned how governments could invest tens of billions of dollars in interpretability research. I touched on reasons here why interpretability research does not and cannot contribute top long-term AGI safety. Based on that, government-funded in interpretability research would distract smart AI researchers from actually contributing, and lend false confidence to AGI researchers that AGI could be interpreted sufficiently. Ie. this is “align-washing” the harmful activities by AI corporations, analogous to green-washing the harmful activities of fossil-fuel corporations.
As another example, your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here, cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.
re: Privacy and data ownership:
If privacy and data ownership movements take their own claims seriously (and some do), they would push for banning the training of ML models on human-generated data or any sensor-based surveillance that can be used to track humans’ activities.
What do you mean here with a “huge lift”?
Koen Holtman has been involved with internet privacy movements for decades. Let me ping him in case he wants to share thoughts on what went wrong there in Europe/ and in China.
I agree that some specific leaders you cite have expressed distaste for model scaling, but it seems not to be a core concern. In a choice between more politically feasible measures that target concerns they believe are real vs concerns they believe are imaginary and bad, I don’t think you get the latter. And I think arguments based on those concerns get traction on measures addressing the concerns, but less so on secondary wishlist items of leaders .
I think that’s the reason privacy advocacy in legislation and the like hasn’t focused on banning computers in the past (and would have failed if they tried). For example:
AGI working with AI generated data or data shared under the terms and conditions of web services can power the development of highly intelligent catastrophically dangerous systems, and preventing AI from reading published content doesn’t seem close to the core motives there, especially for public support on privacy. So taking the biggest asks they can get based on privacy arguments I don’t think blocks AGI.
It looks this kind of concern at scale naturally goes towards things like compensation for creators (one of Lanier’s recs), UBI, voting systems, open-source AI, and such.
Jaron Lanier has written a lot dismissing the idea of AGI or work to address it. I’ve seen a lot of such dismissal from Glen Weyl. Acemoglu I don’t think wants to restrict AI development? I don’t know Siddarth or Tang’s work well.
From Twitter:
That reads to me as saying concerns about ‘AGI apocalypse’ are delusional nonsense but pursuit of a false dream of AGI incidentally cause harms like hateful AI speech through advancing weaker AI technology, while the delusions should not be an important priority.
I gave the example of barring model scaling above a certain budget.
I disagree extremely strongly with that claim. It’s prima facie absurd to think that, e.g. that using interpretability tools to discover that AI models were plotting to overthrow humanity would not help to avert that risk. For instance, that’s exactly the kind of thing that would enable a moratorium on scaling and empowering those models to improve the situationn.
This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).
I intend to respond to the rest tomorrow.
Some of your interpretations of writings by Timnit Gebru and Glen Weyl seem fair to me (though would need to ask them to confirm). I have not look much into Jaron Lanier’s writings on AGI so that prompts me to google that.
Perhaps you can clarify the other reasons why the changes in learning would be stable in preserving “good properties”? I’ll respond to your nuances regarding how to interpret your long-term-evaluating error correcting code after that.
re: Leaders of movements being skeptical of the notion of AGI.
Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks.
Regarding Jaron Lanier, yes, (re)reading this post I agree that he seems to totally dismiss the notion of AGI, seeing it more a result of a religious kind of thinking under which humans toil away at offering the training data necessary for statistical learning algorithms to function without being compensated.
Returning on error correction point:
Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we’ll still be discussing the application limits of error correction methods.
Three ways to explain why:
Any workable AI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).
Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction).
Eg. mechanistic interpretability is essentially about “detecting the actual (or simulating the potential) effects internally” of AI.
The only way to actually (slightly) counteract AGI convergence on causing “instrumental” and “needed” effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).
~ ~ ~
Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety.
I reread your original post and Christiano’s comment to understand your reasoning better and see how I could limits of applicability of error correction methods.
I also messaged Forrest (the polymath) to ask for his input.
The messages were of a high enough quality that I won’t bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits).
Remmelt 15:37
@Forrest, would value your thoughts on the way Carl Schulman is thinking about error correcting code, perhaps to pass on on the LessWrong Forum:
(https://www.lesswrong.com/posts/uFNgRumrDTpBfQGrs/let-s-think-about-slowing-down-ai?commentId=bY87i5v5StH9FWdWy).
Remmelt 15:38
Remmelt:
“As another example [of unsound monolithic reasoning], your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here (https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=Jaf9b9YAARYdrK3jp), cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here.”
Carl Shulman:
”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don’t cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties).”
Remmelt 15:40
Excerpting from the comment by Christiano I link to above:
”The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it’s either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don’t describe what those distributed consequentialists are like or how they come about.
You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted. But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question: http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html).
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don’t buy Carl’s argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars.”
Remmelt 15:41
And Carl Schulman’s original post on long-term error-correcting Von Neumann Probes:
(http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html):
”But the program of an AI, large stores of astronomical observations for navigation, and vast stores of technological information would take up an enormous amount of memory and storage space, perhaps many exabytes or more. Given this large body of information, adding additional directives to ensure that the probes eventually turn to producing welfare need only increase storage needs by a very small proportion, e.g. by 1 in 1 billion. Directives could directly specify the criteria to be eventually optimized, or could simply require compliance with further orders traveling behind the frontier of colonization.
...
Mutation is easier to resist for computers than animals
Biological life on Earth has evolved through mutation, and the reproductive process introduces significant errors in each generation. However, digital information storage allows for the comparison of redundant copies and the use of error-correcting codes, making substantive mutation many orders of magnitude less likely than in Earthly life.”
Remmelt 15:45
Returning to the new comment by Carl Schulman:
”This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves)”
Remmelt 15:46
Thinking about this overnight, I think Carl’s stated reasoning is still unsound for multiple reasons:
1. Contradiction between a Von Neummann Probe being adaptable enough (ie. learning new ways of processing inputs into outputs) to travel across space and seed new civilisations, yet have error correcting code that allows comparison of new code with original redundant copies. Not going to work, for reasons Forrest amply explained and I tried to summarise here: https://docs.google.com/document/d/1-AAhqvgFNx_MlLkcSgw-chvmFoC4EZ4LmTl1IWcsqEA/edit
Ooh, and in Forrest’s AGI Error Correction post: https://mflb.com/ai_alignment_1/agi_error_correction_psr.html#p1
Think I’ll share that.
Remmelt 15:54
2. Confuses complicated pre-loaded technological knowledge/systems with complex adaptive systems. The fact that they are saying that adding in directives would only increase storage by 1 part in 1 billion parts is a giveaway, I think.
Remmelt 15:55
3. Inverse take on 1.
Algorithms which can flexibly ‘mutate’ and branch out into different versions become better at using resources and multiplying than more rigid or robustly functional designs. This makes Carl Schulman’s case for launching out self-replicating space probes with code error-checking/minimisation routines seem a lot more dicey. If a defecting group launches even one alternate design with a flexible code-mutating ability that confers an advantage that can’t easily be copied by the error-minimising designs without compromising on their ability to act on the directives humans originally coded in to ‘directly specify the criteria to be eventually optimized’ – well, then you might end up instead with swarms of spaceprobes that eat up the galaxy indiscriminately, including any remaining carbon-based lifeforms on planet Earth.
Underlying premise: even if humans construct a long-term aligned AI design – where humans can formally prove a model to causally constrain any possible process of agency emerging from and expanding across each of the physical parts in which this model infers its computational process to be embedded to stay within all fundamental bounds necessary for maintaining alignment with the values that humans broadly share in common – then in practice that design is ‘one step away’ from getting mutated into misalignment by a faction of humans who seek a capability advantage (does it give one though?) for manifesting their more granular personal values.
Remmelt 15:57
@Forrest anything to add to the above? How about I write that into a comment reply (which perhaps might make for a stand-alone post later?)
Forrest 16:25
The starting point is: That any procedure of AGI alignment will
(Ie, cannot not) resemble some sort of error correction algorithm
Forrest 16:28
This point cannot be disagreed with and still have the notion of alignment be persistent. If they argue here, they will need to establish a conformable coherent counter example. Simple contradiction is not enough, as the claim basis shifts to them.
Forrest 16:30
Once hard line equivalence is firmly established, then the argument re limits of causation in all error correction processes apply.
Remmelt 16:44
Good one. So start with how any AGI-alignment method must resemble an error correction algorithm in order to correct out misaligned behaviour learned and expressed in interaction with the environment.
Forrest 16:48
In regards to the idea that some machine could “have a strategy of optimizing for near-maximum X until condition Z is achieved, and *then* implement directive Q” makes at least three assumptions that are highly unlikely.
1; there is no assurance that in the long duration of the optimization of X that there would be persistence of directive Q.
2; there is no assurance that the condition ‘Z’ is consistently detectable, or that the triggering condition is even probable to being implemented, even if detected (as per a violation of assumption 1).
3; the integration coupling of directive Q is mismatched to the sheer mass of the output of the optimized code—ie, it is unlikely that ‘Q’, no matter how formed, will actually result in the consistent behavior described. Ie, ‘Q’ is significantly underspecified, and if ‘Q’ was appropriately specified, it would not be “lightweight” with respect to the optimization process ‘X’, and would have been optimized away to ineffectiveness. Hence, even if Z condition trigger happens, Q does not happen.
Note insofar as the logic is directly generalized over the logical claim form of “adopt a strategy of near-maximum colonization speed until they reach the most distant accessible galaxies, and only then divert resources to producing welfare”, the proposal itself is not well specified. Ie, the assumption that ‘Q’ and detection of ‘Z’ can be “simple” is a conceptual category error.
Remmelt 17:02
I read through this, and think I understand your points.
Forrest 17:10
Yes. I am trying to think of a good metaphor. It is maybe a bit like some single ordinary person trying to command everyone to do something in some consistent way, when that _everyone_ is 100000 people all yelling at one another in some huge auditorium somewhere. The assumption that one person, even at maximum personal signal volume and intensity, could be heard over all similar other people (in this case, the other objectives of the Von Neumann probe), and command them to implement some new action pattern reliably consistent with ‘Q’, is simply deeply mismatched to the output reality of the optimization process ‘X’. The single directive ‘Q’ is simply not powerful enough to fully conditionalize all of ‘X’, even under triggering conditions ‘Z’.
Also, I notice that the assumptions that the error correction process could be applied linearly to a dynamic self recursive system at arbitrary levels of abstraction is simply another category error. Wrong tool for the wrong job. That is less important, though, than the issue with the Q directive arbitrarily efficient effectivity mismatch.
Forrest 17:37
Also, I added the following document to assist in some of what you are trying to do above: https://mflb.com/ai_alignment_1/tech_align_error_correct_fail_psr.html#p1
This echos something I think I sent previously, but I could not find it in another doc, so I added it.
I addressed claims of similar forms at least 3 times times already on separate occasions (including in the post itself).
Suggest reading this: https://www.lesswrong.com/posts/bkjoHFKjRJhYMebXr/the-limited-upside-of-interpretability?commentId=wbWQaWJfXe7RzSCCE
“The fact that mechanistic interpretability can possibly be used to detect a few straightforwardly detectable misalignment of the kinds you are able to imagine right now does not mean that the method can be extended to detecting/simulating most or all human-lethal dynamics manifested in/by AGI over the long term.
If AGI behaviour converges on outcomes that result in our deaths through less direct routes, it really does not matter much whether the AI researcher humans did an okay job at detecting “intentional direct lethality” and “explicitly rendered deception”.”
This is like saying there’s no value to learning about and stopping a nuclear attack from killing you because you might get absolutely no benefit from not being killed then, and being tipped off about a threat trying to kill you, because later the opponent might kill you with nanotechnology before you can prevent it.
Removing intentional deception or harm greatly increases the capability of AIs that can be worked with without getting killed, to further improve safety measures. And as I said actually being able to show a threat to skeptics is immensely better for all solutions, including relinquishment, than controversial speculation.
No, it’s not like that.
It’s saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it’s still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
If mechanistic interpretability methods cannot prevent that interactions of AGI necessarily converge on total human extinction beyond theoretical limits of controllability, it means that these (or other “inspect internals”) methods cannot contribute to long-term AGI safety. And this is not idle speculation, nor based on prima facie arguments. It is based on 15 years of research by a polymath working outside this community.
In that sense, it would not really matter that mechanistic interpretability can do an okay job at detecting that a power-seeking AI was explicitly plotting to overthrow humanity.
That is, except for the extremely unlikely case you pointed to that such intentions are detected and on time, and humans all coordinate at once to impose an effective moratorium on scaling or computing larger models. But this is actually speculation, whereas that OpenAI promoted Olah’s fascinating Microscope-generated images as them making progress on understanding and aligning scalable ML models is not speculation.
Overall, my sense is that mechanistic interpretability is used to align-wash capability progress towards AGI, while not contributing to safety where it predominantly matters.
Exactly this kind of thinking is what I am concerned about. It implicitly assumes that you have a (sufficiently) comprehensive and sound understanding of the ways humans would get killed at a given level of capability, and therefore can rely on that understanding to conclude that capabilities of AIs can be greatly increased without humans getting killed.
How do you think capability developers would respond to that statement? Will they just stay on the safe side, saying “Well those alignment researchers say that mechanistic interpretability helps remove intentional deception or harm, but I’m just going to stay on the safe side and not scale any further”. No, they are going to use your statement to promote the potential safety of their scalable models, and remove whatever safety margin they can justify themselves taking and feel justified taking for themselves.
Not considering unknown unknowns is going to get us killed. Not considering what safety problems may be unsolvable is going to get us killed.
Age-old saying: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”
Sorry if I missed it earlier in the thread, but who is this “polymath”?
Forrest Landry.
From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a “proof” based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant.
I usually do not mention Forrest Landry’s name immediately for two reasons:
If you google his name, he comes across like a spiritual hippie. Geeks who don’t understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell’s Theorem is a thing) .
Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.
Both of us prefer to work behind the scenes. I’ve only recently started to touch on the arguments in public.
You can find those arguments elaborated on here.
Warning: large inferential distance; do message clarifying questions – I’m game!
No, it’s not like that.
It’s saying that if you can prevent a doomsday device from being lethal in some ways and not in others, then it’s still lethal. Focussing on some ways that you feel confident that you might be able to prevent the doomsday device from being lethal is IMO distracting dangerously from the point, which is that people should not built the doomsday device in the first place.
As requested by Remmelt I’ll make some comments on the track record of privacy advocates, and their relevance to alignment.
I did some active privacy advocacy in the context of the early Internet in the 1990s, and have been following the field ever since. Overall, my assessment is that the privacy advocacy/digital civil rights community has had both failures and successes. It has not succeeded (yet) in its aim to stop large companies and governments from having all your data. On the other hand, it has been more successful in its policy advocacy towards limiting what large companies and governments are actually allowed to do with all that data.
The digital civil rights community has long promoted the idea that Internet based platforms and other computer systems must be designed and run in a way that is aligned with human values. In the context of AI and ML based computer systems, this has led to demands for AI fairness and transparency/explainability that have also found their way into policy like the GDPR, legislation in California, and the upcoming EU AI Act. AI fairness demands have influenced the course of AI research being done, e.g. there has been research on defining it even means for an AI model to be fair, and on making models that actually implement this meaning.
To a first approximation, privacy and digital rights advocates will care much more about what an ML model does, what effect its use has on society, than about the actual size of the ML model. So they are not natural allies for x-risk community initiatives that would seek a simple ban on models beyond a certain size. However, they would be natural allies for any initiative that seeks to design more aligned models, or to promote a growth of research funding in that direction.
To make a comment on the premise of the original post above: digital rights activists will likely tell you that, when it comes to interventions on AI research, speculating about the tractability of ‘slowing down AI research’ is misguided. What you really should be thinking about is changing the direction of AI research.
This is insightful for me, thank you!
Also, I stand corrected then on my earlier comment on that privacy and digital ownership advocates would/should care about models being trained on their own/person-tracking data such to restrict the scaling of models. I’m guessing I was not tracking well then what people in at least the civil rights spaces Koen moves around in are thinking and would advocate for.
This is a very spicy take, but I would (weakly) guess that a hypothetical ban on ML trainings that cost more than $10M would make AGI timelines marginally shorter rather than longer, via shifting attention and energy away from scaling and towards algorithm innovation.
Very interesting! Recently, US started to regulate export of computing power to China. Do you expect this to speed up AGI timeline in China, or do you expect regulation to be ineffective, or something else?
Reportedly, NVIDIA developed A800, which is just A100, to keep the letter but probably not the spirit of the regulation. I am trying to follow closely how A800 fares, because it seems to be an important data point on feasibility of regulating computing power.
I strongly agree with Steven about this. Personally, I expect it’ll be non-impactful in either direction. I think the majority of research groups already have sufficient compute available to make dangerous algorithmic progress, and they are not so compute-resource-rich that their scaling efforts are distracting them from more dangerous pursuits. I think the groups who would be more dangerous if they weren’t ‘resource drunk’ are mainly researchers at big companies.
I think the two camps are less orthogonal than your examples of privacy and compute reg portray. There’s room for plenty of excellent policy interventions that both camps could work together to support. For instance, increasing regulatory requirements for transparency on algorithmic decision-making (and crucially, building a capacity both in regulators and in the market supporting them to enforce this) is something that I think both camps would get behind (the xrisk one because it creates demand for interpretability and more and the other because eg. it’s easier to show fairness issues) and could productively work on together. I think there are subculture clash reasons the two camps don’t always get on, but that these can be overcome, particularly given there’s a common enemy (misaligned powerful AI). See also this paper Beyond Near- and Long-Term: Towards a Clearer Account of Research Priorities in AI Ethics and Society I know lots of people who are uncertain about how big the risks are, and care about both problems, and work on both (I am one of these—I care more about AGI risk, but I think the best things I can do to help avert it involve working with the people you think aren’t helpful).