Here are some other lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
e.g. given thought “i might want to see why some thoughts are generated” → what does that mean more concretely? → more concrete subcases:
could mean noticing some suggestive concept similarity
maybe other stuff like causal inference (-> notice i’m not that clear on what i mean by that → clarify and try come up with example):
basically i mean that maybe sometimes a thought pops into my mind because it is a causal consequence of some other event i modeled in my mind
e.g. “i imagine hiking a longer path” -> ”i imagine missing the call i have in the evening”
(NOTE: feel free to skip this.) e.g. for proposal “i might want to try to train lessons by seeing how i could’ve applied it to other recent problems i attacked”:
-> trigger make example → ok what lesson could i pick → ok let’s use “don’t just ask what your goal is but also why you want to achieve the goal”
-> see how i could’ve applied the lesson to recent problems i attacked → what are recent problems i attacked? (-> darn i’m in studying phase and don’t have recent problems super clearly on my mind → ok let’s pick an upcoming problem → i guess i had planned recent because then i might be better able to evaluate whether it would’ve actually helped but nvm now) -> upcoming problem: ”plan how to setup initial deliberate practice plan for training myself to better model what is happening in my mind”
--apply-lesson-> why do i want to do this? → basically want to train introspection to get better data for forming models of my mind, but thought that ‘introspection’ is not an atomic skill but comes from training to see particular patterns or so → also better modelling my mind might help me to notice when i ought to apply some lesson → also want to better understand how i make progress to review and improve ---> (ok could go further here but i guess this is enough since it’s just a sub-example of sth else)
--> review: “was this useful” ->
i guess applying the lesson for the upcoming problem is a good idea
i guess for training lessons i need to focus more on the trigger part and not just go through problems and apply it
i guess considering that i originally wanted to just make an example for how to come up with an example for seeing whether a hypothesis is true i derailed a lot in a way that the takeaway will be the lesson “don’t just ask what your goal is but also why you want to achieve the goal” or the lesson “i might want to try to train lessons i learn from review by seeing how i could’ve applied it to other recent problems i attacked” instead of “if you have an abstract hypothesis/belief/model/plan, clarify on an example what it predicts” → OOPS
-> but i did learn that “i guess for training lessons i need to focus more on the trigger part and not just go through problems and apply it” from actually imagining a concrete example of how it would look like if i “train lessons by seeing how i could’ve applied it to other recent problems i attacked”.
(yes it’s often annoying and not easy, especially in the beginning)
when your goal is to understand something, how will you be able to apply the understanding on a particular example?
(NOTE: feel free to skip this.) e.g. say my goal is “become able to model what is happening in my mind (especially when doing research)”
=> goal on example: “become able to model what happened in my mind when i came up with with the above bullet point (the one that starts with ‘when your goal is to understand something’)”
=> clarify goal on example: “well i don’t know the right ontology yet for modelling processes in my mind, but here’s an example of how it could look like (though it won’t look like that, i’m only trying to get clearer on the shape of the answer):
‘short-term-memory context: previous recalled models on “why do i want to achieve the goal” and some other bunch → loaded query “get example for ‘clarify your goal on examples’” → parse how goal might look like → think “perhaps i want to understand sth” → adjust query to “get example for ‘clarify your goal to understand sth on examples’” -(unconscious-in-parallel)-> background process also updates “why do i want to achieve the goal?” to “why do i want to achieve the goal to understand sth?” -(unconscious)-> suggests answer that i can better model particular cases that come up → match the active “example” concept to “particular cases” → try apply this → …’.
(tbc this example for what might have happened in my mind is totally made up and not grounded in observations. (i didn’t try to introspect there.) (in this case it was actually probably more of a cached thought.)
...and well actually it’s maybe not that much of a chain of thoughts but more like what mini-goals are being attacked or what models are loaded. and perhaps not actually in that much detail for some time to come. but when i have the right frames it might be easier to compress introspective observations into it. (...?)”
(yeah sry i maybe ought to have used a simpler example.)
try to extract the core subproblems/subgoals.
e.g. for corrigibility a core subproblem is the shutdown problem
e.g. for “solve unbounded diamond maximizer propsoal” a core problem is “understand what kind of low-level structure can correspond to high-level abstractions”.
(for both examples above one needs to get even more precise core subproblems recursively.)
(NOTE: bad initial example, feel free to skip.) e.g. for “solve alignment to a pivotal level” (which is actually a bad example because it doesn’t factor neatly) a not-incredibly-awful initial breakdown for the my approach might be:
find the right ontology for modelling cognition; find some way we could understand how smart AIs work.
solve ontology identification
solve subsystem alignment; figure out how to design a robust goal slot into the AI
solve corrigibility
find what pivotal act to aim for
i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
tie yourself closely to observations.
drop all assumptions. apply generalized hold off on proposing solutions.
in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like “how can i formalize concepts” or “what are thoughts”. just see the observations as directly as possible and try to form a model of the underlying process that generates those.
first form a model about concrete narrow cases and only later generalize
e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
don’t overplan and rather try stuff and review how it’s going and replan and iterate.
this is sorta an application of “get concrete” where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
often review how you made progress and see how to improve.
(also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won’t work while thinking it has a decent chance of working, and I wouldn’t expect those techniques to be much use for them. There is this quite foundational skill of “notice when you’re not making progress / when your proposals aren’t actually good” which is required for further improvement, and I do not know how to teach this. It’s related to be very concrete and noticing mysterious answers or when you’re too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I’m setting myself a very high standard). I’d guess the lessons can probably be misunderstood and misapplied, but idk.)
Here are some other lessons I learned over the last months from doing alignment research on trying to find the right ontology for modelling (my) cognition:
make examples: if you have an abstract goal or abstract hypothesis/belief/model/plan, clarify on an example what it predicts.
e.g. given thought “i might want to see why some thoughts are generated” → what does that mean more concretely? → more concrete subcases:
could mean noticing a common cognitive strategy
could mean noticing some suggestive concept similarity
maybe other stuff like causal inference (-> notice i’m not that clear on what i mean by that → clarify and try come up with example):
basically i mean that maybe sometimes a thought pops into my mind because it is a causal consequence of some other event i modeled in my mind
e.g. “i imagine hiking a longer path” -> ”i imagine missing the call i have in the evening”
(NOTE: feel free to skip this.) e.g. for proposal “i might want to try to train lessons by seeing how i could’ve applied it to other recent problems i attacked”:
-> trigger make example → ok what lesson could i pick → ok let’s use “don’t just ask what your goal is but also why you want to achieve the goal”
-> see how i could’ve applied the lesson to recent problems i attacked → what are recent problems i attacked? (-> darn i’m in studying phase and don’t have recent problems super clearly on my mind → ok let’s pick an upcoming problem → i guess i had planned recent because then i might be better able to evaluate whether it would’ve actually helped but nvm now) -> upcoming problem: ”plan how to setup initial deliberate practice plan for training myself to better model what is happening in my mind”
--apply-lesson-> why do i want to do this? → basically want to train introspection to get better data for forming models of my mind, but thought that ‘introspection’ is not an atomic skill but comes from training to see particular patterns or so → also better modelling my mind might help me to notice when i ought to apply some lesson → also want to better understand how i make progress to review and improve ---> (ok could go further here but i guess this is enough since it’s just a sub-example of sth else)
--> review: “was this useful” ->
i guess applying the lesson for the upcoming problem is a good idea
i guess for training lessons i need to focus more on the trigger part and not just go through problems and apply it
i guess considering that i originally wanted to just make an example for how to come up with an example for seeing whether a hypothesis is true i derailed a lot in a way that the takeaway will be the lesson “don’t just ask what your goal is but also why you want to achieve the goal” or the lesson “i might want to try to train lessons i learn from review by seeing how i could’ve applied it to other recent problems i attacked” instead of “if you have an abstract hypothesis/belief/model/plan, clarify on an example what it predicts” → OOPS
-> but i did learn that “i guess for training lessons i need to focus more on the trigger part and not just go through problems and apply it” from actually imagining a concrete example of how it would look like if i “train lessons by seeing how i could’ve applied it to other recent problems i attacked”.
(yes it’s often annoying and not easy, especially in the beginning)
(if you can’t you’re still confused.)
generally be very concrete. also Taboo your words and Replace the Symbol with the Substance.
I want to highlight the “what is my goal” part
also ask “why do i want to achieve the goal?”
(-> minimize goodhart)
clarify your goal as much as possible.
(again Taboo your words...)
clarify your goal on examples
when your goal is to understand something, how will you be able to apply the understanding on a particular example?
(NOTE: feel free to skip this.) e.g. say my goal is “become able to model what is happening in my mind (especially when doing research)”
=> goal on example: “become able to model what happened in my mind when i came up with with the above bullet point (the one that starts with ‘when your goal is to understand something’)”
=> clarify goal on example: “well i don’t know the right ontology yet for modelling processes in my mind, but here’s an example of how it could look like (though it won’t look like that, i’m only trying to get clearer on the shape of the answer):
‘short-term-memory context: previous recalled models on “why do i want to achieve the goal” and some other bunch → loaded query “get example for ‘clarify your goal on examples’” → parse how goal might look like → think “perhaps i want to understand sth” → adjust query to “get example for ‘clarify your goal to understand sth on examples’” -(unconscious-in-parallel)-> background process also updates “why do i want to achieve the goal?” to “why do i want to achieve the goal to understand sth?” -(unconscious)-> suggests answer that i can better model particular cases that come up → match the active “example” concept to “particular cases” → try apply this → …’.
(tbc this example for what might have happened in my mind is totally made up and not grounded in observations. (i didn’t try to introspect there.) (in this case it was actually probably more of a cached thought.)
...and well actually it’s maybe not that much of a chain of thoughts but more like what mini-goals are being attacked or what models are loaded. and perhaps not actually in that much detail for some time to come. but when i have the right frames it might be easier to compress introspective observations into it. (...?)”
(yeah sry i maybe ought to have used a simpler example.)
try to extract the core subproblems/subgoals.
e.g. for corrigibility a core subproblem is the shutdown problem
e.g. for “solve unbounded diamond maximizer propsoal” a core problem is “understand what kind of low-level structure can correspond to high-level abstractions”.
(for both examples above one needs to get even more precise core subproblems recursively.)
(NOTE: bad initial example, feel free to skip.) e.g. for “solve alignment to a pivotal level” (which is actually a bad example because it doesn’t factor neatly) a not-incredibly-awful initial breakdown for the my approach might be:
find the right ontology for modelling cognition; find some way we could understand how smart AIs work.
solve ontology identification
solve subsystem alignment; figure out how to design a robust goal slot into the AI
solve corrigibility
find what pivotal act to aim for
i guess make sure you think concretely and list subproblems and summarize the core ones and iterate. follow up on confusions where problems still seem sorta mixed up. let your mind find the natural clusters. (not sure if that will be sufficient for you.)
tie yourself closely to observations.
drop all assumptions. apply generalized hold off on proposing solutions.
in particular, try not to make implicit non-well-founded assumptions about how the ontology looks like, like asking questions like “how can i formalize concepts” or “what are thoughts”. just see the observations as directly as possible and try to form a model of the underlying process that generates those.
first form a model about concrete narrow cases and only later generalize
e.g. first study precisely what thoughtchains you had on particular combinatorics problems before hypothesizing what kind of general strategies your mind uses.
special case: (first) plan how to solve specific research subproblems rather than trying to come up with good general methodology for the kinds of problems you are attacking.
don’t overplan and rather try stuff and review how it’s going and replan and iterate.
this is sorta an application of “get concrete” where you get concrete through actually trying the thing rather than imagining how it will look like if you attack it.
often review how you made progress and see how to improve.
(also generally lots of other lessons from the sequences (and HPMoR): notice confusion, noticing mysterious answers, know how an actual reduction looks like, and probably a whole bunch more)
Tbc those are sorta advanced techniques. Most alignment researchers are working on line of hopes that pretty obviously won’t work while thinking it has a decent chance of working, and I wouldn’t expect those techniques to be much use for them.
There is this quite foundational skill of “notice when you’re not making progress / when your proposals aren’t actually good” which is required for further improvement, and I do not know how to teach this. It’s related to be very concrete and noticing mysterious answers or when you’re too abstract or still confused. It might sorta be what Eliezer calls security mindset.
(Also other small caveat: I did not yet get very clear great results out of my research, but I do think I am making faster progress (and I’m setting myself a very high standard). I’d guess the lessons can probably be misunderstood and misapplied, but idk.)