I think your objections are all basically correct, but that you treat them as dealbreakers in ways that I (a big shard-alignment fan) don’t. As I understand it, your objections boil down to 1. picking the training curriculum/reward signal is hard (and design choices pose a level of challenge beyond the simple empirical does-it-work-to-produce-an-AGI) and 2. reflectivity is very hard and might cause lots of big problems, and we can’t begin to productively engage with those issues right now.
I don’t think that curriculum and reward signal are as problematic as you seem to think. From the standpoint of AI notkilleveryoneism, I think that basically any set of prosocial/human-friendly values will be sufficient, and that something directionally correct will be very easy to find. The design choices described as relating to “what’s in the curriculum” seem of secondary importance to me—in all but the least iterative-design-friendly worlds, we can figure this out as we go, and if we figure out the notkilleveryoneism/basic corrigibility stuff in hard-takeoff worlds we would probably be able to slow down AI development long enough for iteration.
The reflectivity stuff 100% does cause huge problems that we don’t know how to solve, but I break with you in two places here—firstly, you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes; and secondly, you seem to assume that reflectivity involves or induces additional challenges that IMO can very readily be avoided. Regarding the former point, I think I’m doing empirical work right now that can plausibly help improve our understanding of reflectivity, and Peli Grietzer is doing theoretical work (on what he calls “praxis-based values,” based on “doing X X-ingly… the intuition that some reflective values are an uroboros of means and ends”) that engages with these problems as well. There’s lots of low-hanging fruit here, and for an approach to alignment that’s only been in play for about a year I think a lot of progress has been made.
Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible—but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you—I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.
you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes
We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you’re doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.
Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible—but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you
Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.
You seem to be making an implied argument like “this isn’t a big problem for me, a human, so it probably happens by default in a good way in future RL agents,” and I don’t find that implied argument valid.
I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.
What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?
I think your objections are all basically correct, but that you treat them as dealbreakers in ways that I (a big shard-alignment fan) don’t. As I understand it, your objections boil down to 1. picking the training curriculum/reward signal is hard (and design choices pose a level of challenge beyond the simple empirical does-it-work-to-produce-an-AGI) and 2. reflectivity is very hard and might cause lots of big problems, and we can’t begin to productively engage with those issues right now.
I don’t think that curriculum and reward signal are as problematic as you seem to think. From the standpoint of AI notkilleveryoneism, I think that basically any set of prosocial/human-friendly values will be sufficient, and that something directionally correct will be very easy to find. The design choices described as relating to “what’s in the curriculum” seem of secondary importance to me—in all but the least iterative-design-friendly worlds, we can figure this out as we go, and if we figure out the notkilleveryoneism/basic corrigibility stuff in hard-takeoff worlds we would probably be able to slow down AI development long enough for iteration.
The reflectivity stuff 100% does cause huge problems that we don’t know how to solve, but I break with you in two places here—firstly, you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes; and secondly, you seem to assume that reflectivity involves or induces additional challenges that IMO can very readily be avoided. Regarding the former point, I think I’m doing empirical work right now that can plausibly help improve our understanding of reflectivity, and Peli Grietzer is doing theoretical work (on what he calls “praxis-based values,” based on “doing X X-ingly… the intuition that some reflective values are an uroboros of means and ends”) that engages with these problems as well. There’s lots of low-hanging fruit here, and for an approach to alignment that’s only been in play for about a year I think a lot of progress has been made.
Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible—but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you—I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.
We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you’re doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.
Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.
You seem to be making an implied argument like “this isn’t a big problem for me, a human, so it probably happens by default in a good way in future RL agents,” and I don’t find that implied argument valid.
What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?