Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.
Fields I want to understand better:
Anything related to prosaic AI alignment/existential ML safety
Failure stories/threat models
Fields I’m not interested in (right now):
agent foundation
decision theory
other very mathsy stuff that’s not related to ML
My level of understanding:
I have a decent knowledge of ML/deep learning (I’m in the last year of my PhD)
I haven’t done the AGI Safety Fundamentals course, but I just skimmed it, and I think I had independently read essentially all the core readings (which means I probably have also read many things not on the curriculum). I’d say I have a relatively deep understanding of a majority (but not all) of the content in this curriculum.
Similarly for the AGI Safety Fundamentals 201, excluding the tracks
Example questions I wrestled with recently, and I might have brought up during the tutoring:
It seems to me that our currently level of outer alignment tools (RLHF + easy augmentation) is enough to solve the outer alignment problem sufficiently well so that humans don’t end up dead or disempowered (conditional on slow takeoff); and then we can solve further outer alignment problem as the come up, with iteration and regulation. So I basically think that the core of the alignment problem, at the moment, is inner alignment + deceptive alignment. What am I missing? (I read Christiano’s “Another Outer Alignment Failure Story”, but I still have this question.)
I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.
Is the failure mode describe in Ajeya’s Without Specific Countermeasures an inner alignment failure, or an outer alignment failure (I think it’s both).
You don’t need to have very crisps answers to these to be my tutor, but you should probably have at least some good thoughts.
Looking for an alignment tutor
Hey, this is me. I’d like to understand AI X-risk better. Is anyone interested in being my “alignment tutor”, for maybe 1 h per week, or 1 h every two weeks? I’m happy to pay.
Fields I want to understand better:
Anything related to prosaic AI alignment/existential ML safety
Failure stories/threat models
Fields I’m not interested in (right now):
agent foundation
decision theory
other very mathsy stuff that’s not related to ML
My level of understanding:
I have a decent knowledge of ML/deep learning (I’m in the last year of my PhD)
I haven’t done the AGI Safety Fundamentals course, but I just skimmed it, and I think I had independently read essentially all the core readings (which means I probably have also read many things not on the curriculum). I’d say I have a relatively deep understanding of a majority (but not all) of the content in this curriculum.
Similarly for the AGI Safety Fundamentals 201, excluding the tracks
Example questions I wrestled with recently, and I might have brought up during the tutoring:
It seems to me that our currently level of outer alignment tools (RLHF + easy augmentation) is enough to solve the outer alignment problem sufficiently well so that humans don’t end up dead or disempowered (conditional on slow takeoff); and then we can solve further outer alignment problem as the come up, with iteration and regulation. So I basically think that the core of the alignment problem, at the moment, is inner alignment + deceptive alignment. What am I missing? (I read Christiano’s “Another Outer Alignment Failure Story”, but I still have this question.)
I understand that a reward maximiser would wire-head (take control over the reward provision mechanism), but I don’t see why training an RL agent would necessarily end up in a reward-maximising agent? Turntrout’s Reward is Not the Optimisation Target shed some clarity on this, but I definitely have remaining questions.
Is the failure mode describe in Ajeya’s Without Specific Countermeasures an inner alignment failure, or an outer alignment failure (I think it’s both).
You don’t need to have very crisps answers to these to be my tutor, but you should probably have at least some good thoughts.