Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says “This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI”. I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.
FWIW I think that gradient hacking is pretty plausible, but it’ll probably end up looking fairly “prosaic”, and may not be a problem even if it’s present.
Gradient hacking in supervised learning is generally recognized by alignment people (including the author of that article) to not be a likely problem. A recent post by people at Redwood Research says “This particular construction seems very unlikely to be constructible by early transformative AI, and in general we suspect gradient hacking won’t be a big safety concern for early transformative AI”. I would still defend the past research into it as good basic science, because we might encounter failure modes somewhat related to it.
FWIW I think that gradient hacking is pretty plausible, but it’ll probably end up looking fairly “prosaic”, and may not be a problem even if it’s present.
Are you thinking about exploration hacking, here, or gradient hacking as distinct from exploration hacking?