Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as “Virtually any goal specification, pursued to the extreme, will result in the extinction[1] of humanity″, and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart’s Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart’s Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.
Together with Chris van Merwijk and Ida Mattsson, we have recently written a philosophy-venue version of some of our thoughts on Goodhart’s Law in the context of powerful AI [link].[2] This version of the paper has no math in it, but it attempts to point at one aspect of “Extinction-level Goodhart’s Law” that seems particularly relevant for AI advocacy – namely, that the fields of AI and CS would have been unlikely to come across evidence of AI risk, using the methods that are popular in those fields, even if the law did hold in the real world.
Since commenting on link-posts is inconvenient, I split off some of the ideas from the paper into the following separate posts:
We have more material on this topic, including writing with math[3] in it, but this is mostly not yet in a publicly shareable form. The exception is the post Extinction-level Goodhart’s Law as a Property of the Environment (which is not covered by the paper). If you are interested in discussing anything related to this, definitely reach out.
A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as “existential”, and reserve the “extinction” version for the less ambiguous notion of literal extinction.
As an anecdote, it seems worth mentioning that I tried, and failed, to post the paper to arXiv—by now, it has been stuck there with “on hold” status for three weeks. Given that the paper is called “Existential Risk from AI: Invisible to Science?”, there must be some deeper meaning to this. [EDIT: After ~2 months, the paper is now on arXiv.]
Or rather, it has pseudo-math in it. By which I mean that it looks like math, but it is built on top of vague concepts such as “optimisation power” and “specification complexity”. And while I hope that we will one day be able to formalise these, I don’t know how to do so at this point.
Extinction Risks from AI: Invisible to Science?
Link post
Abstract: In an effort to inform the discussion surrounding existential risks from AI, we formulate Extinction-level Goodhart’s Law as “Virtually any goal specification, pursued to the extreme, will result in the extinction[1] of humanity″, and we aim to understand which formal models are suitable for investigating this hypothesis. Note that we remain agnostic as to whether Extinction-level Goodhart’s Law holds or not. As our key contribution, we identify a set of conditions that are necessary for a model that aims to be informative for evaluating specific arguments for Extinction-level Goodhart’s Law. Since each of the conditions seems to significantly contribute to the complexity of the resulting model, formally evaluating the hypothesis might be exceedingly difficult. This raises the possibility that whether the risk of extinction from artificial intelligence is real or not, the underlying dynamics might be invisible to current scientific methods.
Together with Chris van Merwijk and Ida Mattsson, we have recently written a philosophy-venue version of some of our thoughts on Goodhart’s Law in the context of powerful AI [link].[2] This version of the paper has no math in it, but it attempts to point at one aspect of “Extinction-level Goodhart’s Law” that seems particularly relevant for AI advocacy – namely, that the fields of AI and CS would have been unlikely to come across evidence of AI risk, using the methods that are popular in those fields, even if the law did hold in the real world.
Since commenting on link-posts is inconvenient, I split off some of the ideas from the paper into the following separate posts:
Weak vs Quantitative Extinction-level Goodhart’s Law: defining different versions of the notion of “Extinction-level Goodhart’s Law”.
Which Model Properties are Necessary for Evaluating an Argument?: illustrating the methodology of the paper on a simple non-AI example.
Dynamics Crucial to AI Risk Seem to Make for Complicated Models: applying the methodology above to AI risk.
We have more material on this topic, including writing with math[3] in it, but this is mostly not yet in a publicly shareable form. The exception is the post Extinction-level Goodhart’s Law as a Property of the Environment (which is not covered by the paper). If you are interested in discussing anything related to this, definitely reach out.
A common comment is that the definition should also include outcomes that are similarly bad or worse than extinction. While we agree that such definition makes sense, we would prefer to refer to that version as “existential”, and reserve the “extinction” version for the less ambiguous notion of literal extinction.
As an anecdote, it seems worth mentioning that I tried, and failed, to post the paper to arXiv—by now, it has been stuck there with “on hold” status for three weeks. Given that the paper is called “Existential Risk from AI: Invisible to Science?”, there must be some deeper meaning to this. [EDIT: After ~2 months, the paper is now on arXiv.]
Or rather, it has pseudo-math in it. By which I mean that it looks like math, but it is built on top of vague concepts such as “optimisation power” and “specification complexity”. And while I hope that we will one day be able to formalise these, I don’t know how to do so at this point.