It’s capability research that is coupled to alignment:
Furthermore it seems like a win for interpretability and alignment as it gives greater feedback on how the AI intends to earn rewards, and better ability to control those rewards.
Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.
I missed that paragraph on first reading, mea culpa. I think that your story about how it’s a win for interpretability and alignment is very unconvincing, but I don’t feel like hashing it out atm. Revised to weak downvote.
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.
It’s capability research that is coupled to alignment:
Coupling alignment to capabilities is basically what we need to survive, because the danger of capabilities comes from the fact that capabilities is self-funding, thereby risking outracing alignment. If alignment can absorb enough success from capabilities, we survive.
I missed that paragraph on first reading, mea culpa. I think that your story about how it’s a win for interpretability and alignment is very unconvincing, but I don’t feel like hashing it out atm. Revised to weak downvote.
Also, if you expect this to take off, then by your own admission you are mostly accelerating the current trajectory (which I consider mostly doomed) rather than changing it. Unless you expect it to take off mostly thanks to you?
Surely your expectation that the current trajectory is mostly doomed depends on your expectation of the technical details of the extension of the current trajectory. If technical specifics emerge that shows the current trajectory to be going in a more alignable direction, it may be fine to accelerate.
Sure, if after updating on your discovery, it seems that the current trajectory is not doomed, it might imply accelerating is good. But, here it is very far from being the case.