Zac Hatfield-Dodds comments on Discussion with Eliezer Yudkowsky on AGI interventions

Zac Hatfield-Dodds 12 Nov 2021 4:56 UTC
LW: 15 AF: 1
AF
First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.
- Vanessa Kosoy 13 Nov 2021 11:20 UTC
  LW: 7 AF: 5
  AF Parent
  I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don’t need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented).
  
  I think the pipeline of success looks something like theoretical models ⇒ phenomenological models (i.e. models informed by a combination of theory and experiment) ⇒ security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) ⇒ plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it’s not a pure waterfall, there is feedback from each stage to previous stages.