(This is an abridged version of my comment here, which I think belongs on my shortform. I removed some examples which were overly long. See the original comment for those.)
Here are some lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
e.g. given thought “i might want to see why some thoughts are generated” → what does that mean more concretely? → more concrete subcases:
when your goal is to understand something, how will you be able to apply the understanding on a particular example?
try to extract the core subproblems/subgoals.
e.g. for corrigibility a core subproblem is the shutdown problem (where further more precise subproblems could be extracted.)
i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
tie yourself closely to observations.
drop all assumptions. apply generalized hold off on proposing solutions.
in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like “how can i formalize concepts” or “what are thoughts”. just see the observations as directly as possible and try to form a model of the underlying process that generates those.
first form a model about concrete narrow cases and only later generalize
e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
don’t overplan and rather try stuff and review how it’s going and replan and iterate.
this is sorta an application of “get concrete” where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
often review how you made progress and see how to improve.
(also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won’t work while thinking it has a decent chance of working, and I wouldn’t expect those techniques to be much use for them. There is this quite foundational skill of “notice when you’re not making progress / when your proposals aren’t actually good” which is required for further improvement, and I do not know how to teach this. It’s related to be very concrete and noticing mysterious answers or when you’re too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I’m setting myself a very high standard). I’d guess the lessons can probably be misunderstood and misapplied, but idk.)
(This is an abridged version of my comment here, which I think belongs on my shortform. I removed some examples which were overly long. See the original comment for those.)
Here are some lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
e.g. given thought “i might want to see why some thoughts are generated” → what does that mean more concretely? → more concrete subcases:
could mean noticing a common cognitive strategy
could mean noticing some suggestive concept similarity
maybe other stuff like causal inference (-> notice i’m not that clear on what i mean by that → clarify and try come up with example):
e.g. “i imagine hiking a longer path” -> ”i imagine missing the call i have in the evening”
(yes it’s often annoying and not easy, especially in the beginning)
(if you can’t you’re still confused.)
generally be very concrete. also Taboo your words and Replace the Symbol with the Substance.
I want to highlight the “what is my goal” part
also ask “why do i want to achieve the goal?”
(-> minimize goodhart)
clarify your goal as much as possible.
(again Taboo your words...)
clarify your goal on examples
when your goal is to understand something, how will you be able to apply the understanding on a particular example?
try to extract the core subproblems/subgoals.
e.g. for corrigibility a core subproblem is the shutdown problem (where further more precise subproblems could be extracted.)
i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
tie yourself closely to observations.
drop all assumptions. apply generalized hold off on proposing solutions.
in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like “how can i formalize concepts” or “what are thoughts”. just see the observations as directly as possible and try to form a model of the underlying process that generates those.
first form a model about concrete narrow cases and only later generalize
e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
don’t overplan and rather try stuff and review how it’s going and replan and iterate.
this is sorta an application of “get concrete” where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
often review how you made progress and see how to improve.
(also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won’t work while thinking it has a decent chance of working, and I wouldn’t expect those techniques to be much use for them.
There is this quite foundational skill of “notice when you’re not making progress / when your proposals aren’t actually good” which is required for further improvement, and I do not know how to teach this. It’s related to be very concrete and noticing mysterious answers or when you’re too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I’m setting myself a very high standard). I’d guess the lessons can probably be misunderstood and misapplied, but idk.)