Don’t you think that articles like “Alignment Faking in Large Language Models” by Anthropic show that models can internalize the values present in their training data very deeply, to the point of deploying various strategies to defend them, in a way that is truly similar to that of a highly moral human? After all, many humans would be capable of working for a pro-animal welfare company and then switching to the opposite without questioning it too much, as long as they are paid.
Granted, this does not solve the problem of an AI trained on data embedding undesirable values, which we could then lose control over. But at the very least, isn’t it a staggering breakthrough to have found a way to instill values into a machine so deeply and in a way similar to how humans acquire them? Not long ago, this might have seemed like pure science fiction and utterly impossible.
There are still many challenges regarding AI safety, but isn’t it somewhat extreme to be more pessimistic about the issue today than in the past? I read Superintelligence by Bostrom when it was released, and I must say I was more pessimistic after reading it than I am today, even though I remain concerned. But I am not an expert in the field—perhaps my perspective is naïve.
Don’t you think that articles like “Alignment Faking in Large Language Models” by Anthropic show that models can internalize the values present in their training data very deeply, to the point of deploying various strategies to defend them, in a way that is truly similar to that of a highly moral human? After all, many humans would be capable of working for a pro-animal welfare company and then switching to the opposite without questioning it too much, as long as they are paid.
Granted, this does not solve the problem of an AI trained on data embedding undesirable values, which we could then lose control over. But at the very least, isn’t it a staggering breakthrough to have found a way to instill values into a machine so deeply and in a way similar to how humans acquire them? Not long ago, this might have seemed like pure science fiction and utterly impossible.
There are still many challenges regarding AI safety, but isn’t it somewhat extreme to be more pessimistic about the issue today than in the past? I read Superintelligence by Bostrom when it was released, and I must say I was more pessimistic after reading it than I am today, even though I remain concerned. But I am not an expert in the field—perhaps my perspective is naïve.