Agreed. I think power-seeking and other instrumental goals (e.g. survival, non-corrigibility) are just going to inevitably arise, and that if shard theory works for superintelligence, it will by taking this into account and balancing these instrumental goals against deliberately installed shards which counteract them. I currently have the hypothesis (held loosely) that I would like to test (work in progress) that it’s easier to ‘align’ a toy model of a power-seeking RL agent if the agent has lots and lots of competing desires whose weights are frequently changing, than an agent with a simpler set of desires and/or more statically weighted set of desires. Something maybe about the meta-learning of ’my desires change, so part of meta-level power-seeking should be not object-level power-seeking so hard that I sacrifice my ability to optimize for different object level goals). Unclear. I’m hoping that setting up an experimental framework and gathering data will show patterns that help clarify the issues involved.
Agreed. I think power-seeking and other instrumental goals (e.g. survival, non-corrigibility) are just going to inevitably arise, and that if shard theory works for superintelligence, it will by taking this into account and balancing these instrumental goals against deliberately installed shards which counteract them. I currently have the hypothesis (held loosely) that I would like to test (work in progress) that it’s easier to ‘align’ a toy model of a power-seeking RL agent if the agent has lots and lots of competing desires whose weights are frequently changing, than an agent with a simpler set of desires and/or more statically weighted set of desires. Something maybe about the meta-learning of ’my desires change, so part of meta-level power-seeking should be not object-level power-seeking so hard that I sacrifice my ability to optimize for different object level goals). Unclear. I’m hoping that setting up an experimental framework and gathering data will show patterns that help clarify the issues involved.