Vanessa Kosoy comments on Discussion with Eliezer Yudkowsky on AGI interventions

Vanessa Kosoy 13 Nov 2021 11:20 UTC
LW: 7 AF: 5
AF
I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don’t need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented).

I think the pipeline of success looks something like theoretical models ⇒ phenomenological models (i.e. models informed by a combination of theory and experiment) ⇒ security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) ⇒ plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it’s not a pure waterfall, there is feedback from each stage to previous stages.