Seth Herd comments on Dream, Truth, & Good

Seth Herd 25 Feb 2025 6:36 UTC
LW: 3 AF: 1
0
AF
I really like this general diretion of work: suggestions for capabilities that would also help with understanding and controlling network behavior. That would in turn be helpful for real alignment of network-based AGI. Proposing dual-use capabilities advances seems like a way to get alignment ideas actually implemented. That’s what I’ve done in System 2 Alignment, although that’s also a prediction about what developers might try for alignment by default.
Whether the approach you outline here would work is an empirical question, but it sounds likely enough that teams might actually put some effort into it. Preprocessing data to identify authors and similar categories wouldn’t be that hard.
This helps with the problem Nate Soares characterized as making cognition aimable at all—having AI pursue one coherent goal, (separately from worrying about whether you can direct that “goal slot” toward something that actually works). I think that’s the alignment issue you’re addressing (along with slop potentially leading to bad AI-assisted alignment). I briefly describe the LLM agent alignment part of that issue in Seven sources of goals in LLM agents.
I hope I’m reading you right about why you think reducing AI slop would help with alignment.
- abramdemski 25 Feb 2025 15:17 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Yeah, this is effectively a follow-up to my recent post on anti-slop interventions, detailing more of what I had in mind. So, the dual-use idea is very much what I had in mind.