Great post, thanks for writing this! In the version of “Alignment might be easier than we expect” in my head, I also have the following:
Value might not be that fragile. We might “get sufficiently many bits in the value specification right” sort of by default to have an imperfect but still really valuable future.
For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We’d get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before “becoming intelligent enough” to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that’s fine when in deployment.
It doesn’t seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
There’s some distance δ from the correct value specification such that stuff is fine if we get AGI with values closer than δ. Do we have good reasons to think that δ is far out of the range that default approaches would give us?
Great post, thanks for writing this! In the version of “Alignment might be easier than we expect” in my head, I also have the following:
Value might not be that fragile. We might “get sufficiently many bits in the value specification right” sort of by default to have an imperfect but still really valuable future.
For instance, maybe IRL would just learn something close enough to pCEV-utility from human behavior, and then training an agent with that as the reward would make it close enough to a human-value-maximizer. We’d get some misalignment on both steps (e.g. because there are systematic ways in which the human is wrong in the training data, and because of inner misalignment), but maybe this is little enough to be fine, despite fragility of value and despite Goodhart.
Even if deceptive alignment were the default, it might be that the AI gets sufficiently close to correct values before “becoming intelligent enough” to start deceiving us in training, such that even if it is thereafter only deceptively aligned, it will still execute a future that’s fine when in deployment.
It doesn’t seem completely wild that we could get an agent to robustly understand the concept of a paperclip by default. Is it completely wild that we could get an agent to robustly understand the concept of goodness by default?
Is it so wild that we could by default end up with an AGI that at least does something like putting 10^30 rats on heroin? I have some significant probability on this being a fine outcome.
There’s some distance δ from the correct value specification such that stuff is fine if we get AGI with values closer than δ. Do we have good reasons to think that δ is far out of the range that default approaches would give us?
(But here’s some reasons not to expect this.)