The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It’s a way to prepare the ground for doing the actual experiments.
Imagine a researcher embarking on the quest of “searching for search”. I highlight in cursive the parts present in the post (if they are present at least a little):
- At some point, the researcher reads Risks From Learned Optimization. - They complain: “OK, Hubinger, fine, but you haven’t told me what search is anyway” - They read or get involved in the discussions about optimization that ensue on LessWrong. - They try to come up with a good definition, but they find themselves with some difficulties. - They try to understand what kind of beast search is by coming up with some general ways to do search. - They try to determine what kind of search neural networks might implement. They use a bunch of facts they know about search processes and neural networks to come up with ideas. - They try to devise ways to test these hypotheses and even a meta “how do I go about forming and testing these hypotheses anyway.” - They think of a bunch of experiments, but they notice pitfalls. They devise strategies such as “ok; maybe it’s better if I try to get a firehose of evidence instead of optimizing too hard on testing single hypotheses.” - They start doing actual interpretability experiments.
Having the reasoning steps in this process laid out for everyone to see is informative and lets people chime in with ideas. Not going through the conceptual steps of the process, at least privately, before doing experiments, risks wasting a bunch of resources. Exploring the space of problems is cool and good.
The way I understood it, this post is thinking aloud while embarking on the scientific quest of searching for search algorithms in neural networks. It’s a way to prepare the ground for doing the actual experiments.
Imagine a researcher embarking on the quest of “searching for search”. I highlight in cursive the parts present in the post (if they are present at least a little):
- At some point, the researcher reads Risks From Learned Optimization.
- They complain: “OK, Hubinger, fine, but you haven’t told me what search is anyway”
- They read or get involved in the discussions about optimization that ensue on LessWrong.
- They try to come up with a good definition, but they find themselves with some difficulties.
- They try to understand what kind of beast search is by coming up with some general ways to do search.
- They try to determine what kind of search neural networks might implement. They use a bunch of facts they know about search processes and neural networks to come up with ideas.
- They try to devise ways to test these hypotheses and even a meta “how do I go about forming and testing these hypotheses anyway.”
- They think of a bunch of experiments, but they notice pitfalls. They devise strategies such as “ok; maybe it’s better if I try to get a firehose of evidence instead of optimizing too hard on testing single hypotheses.”
- They start doing actual interpretability experiments.
Having the reasoning steps in this process laid out for everyone to see is informative and lets people chime in with ideas. Not going through the conceptual steps of the process, at least privately, before doing experiments, risks wasting a bunch of resources. Exploring the space of problems is cool and good.