I have had this same question for a while, and this is the general conclusion I’ve come to:
Identify the safety issues today, solve them, and then assume the safety issues scale as the technology scales, and either amp up the original solution, or develop new tactics to solve these extrapolated flaws.
This sounds a little vague, so here is an example: We see one of the big models misrepresent history in an attempt to be woke, and maybe it gives a teenager a misconception of history. So, the best thing we can do from a safety perspective is figure out how to train models to absolutely represent facts. After this is done, we can extrapolate the flaw up to a model deliberately feeding misinformation to achieve a certain goal, and we can try to use the same solution we used for the smaller problem for the bigger problem, or if we see it won’t work, develop a new solution.
The biggest problem with this, is it is reactionary, and if you only use this method, a danger may present itself for the first time, and already cause major harm.
I know this approach isn’t as effective for xrisk, but still, it’s something I like to use. Easy to say though, coming from someone who doesn’t actually work in AI safety.
I know this approach isn’t as effective for xrisk, but still, it’s something I like to use.
This sentence has the grammatical structure of acknowledging a counterargument and negating it—“I know x, but y”—but the y is “it’s something I like to use”, which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.
I have had this same question for a while, and this is the general conclusion I’ve come to:
Identify the safety issues today, solve them, and then assume the safety issues scale as the technology scales, and either amp up the original solution, or develop new tactics to solve these extrapolated flaws.
This sounds a little vague, so here is an example: We see one of the big models misrepresent history in an attempt to be woke, and maybe it gives a teenager a misconception of history. So, the best thing we can do from a safety perspective is figure out how to train models to absolutely represent facts. After this is done, we can extrapolate the flaw up to a model deliberately feeding misinformation to achieve a certain goal, and we can try to use the same solution we used for the smaller problem for the bigger problem, or if we see it won’t work, develop a new solution.
The biggest problem with this, is it is reactionary, and if you only use this method, a danger may present itself for the first time, and already cause major harm.
I know this approach isn’t as effective for xrisk, but still, it’s something I like to use. Easy to say though, coming from someone who doesn’t actually work in AI safety.
This sentence has the grammatical structure of acknowledging a counterargument and negating it—“I know x, but y”—but the y is “it’s something I like to use”, which does not actually negate the x.
This is a kind of thing I suspect results from a process like: someone writes out the structure of negation, out of wanting to negate an argument, but then finds nothing stronger to slot into where the negating argument is supposed to be.