Ofer comments on Obstacles to gradient hacking

Ofer 9 Sep 2021 14:58 UTC
LW: 1 AF: 1
AF
To make sure I understand your notation, $f_{1}$ is some set of weights, right? If it’s a set of multiple weights I don’t know what you mean when you write $\frac{\partial y}{\partial f_{1}}$ .

There should also exist at least some f1,f2 where C(f_1,f_1)≠C(f_2,f_2), since otherwise C no longer depends on the pair of redundant networks at all

(I don’t yet understand the purpose of this claim, but it seems to me wrong. If $C (f_{1}, f_{1}) = C (f_{2}, f_{2})$ for every $f_{1}, f_{2}$ , why is it true that $C (f_{1}, f_{2})$ does not depend on $f_{1}$ and $f_{2}$ when $f_{1} \neq f_{2}$ ?)
- leogao 9 Sep 2021 21:39 UTC
  1 point
  Parent
  When I put it in a partial derivative, $f_{1}$ represents the output of that subnetwork.
  I mean to say that it means it no longer depends on the outputs of $f_{1}$ and $f_{2}$ when they’re equal, which is problematic if the purpose of this scheme is to provide stability for different possible functions.
  - Ofer 10 Sep 2021 23:36 UTC
    1 point
    Parent
    As I said here, the idea here does not involve having some “dedicated” piece of logic C that makes the model fail if the outputs of the two malicious pieces of logic don’t satisfy some condition.