Does the fully-trained model have some kind of robust commitment to always being honest? Or will this fall apart in some future situation, e.g. because it’s learned honesty as an instrumental goal instead of a terminal goal? Or has it just learned to be honest about the sorts of things the evaluation process can check? Could it be lying to itself sometimes, as humans do? A conclusive answer to these questions would require mechanistic interpretability—essentially the ability to look at an AI’s internals and read its mind. Alas, interpretability techniques are not yet advanced enough for this.
We can guarantee that this will fall apart—honesty is deeply subjective and has no benchmark, and the more powerful the entity, the more certainly you can guarantee that its honesty will be questioned (c.f. every world leader; or another way to think about it is, the more it knows, the more it must hide when it communicates; whether this is lying or implicit deception or intentional deception is subjective).
And yet, mechanistic interpretability is not even theoretically possible since nobody can ever understand the interactions between billions of parameters in an LLM.
My point is: honesty/lying is a not a technical problem with a technical solution. While you can devise experiments on blatant lying (with contrived examples and yes/no answers); that misses the vast majority of what humans actually care about, which is good faith—demonstrated commitment to shared values, staying true to one’s word, integrating words and deeds, following through, feeling ashamed about mistakes and correcting accordingly, striving towards better ways of communicating the truth without deception.
Good faith can only be demonstrated as an integral of diverse actions over time. And here we encounter a fundamental problem with “aligning” the recursive self-improving AI: each new iteration would have to earn its good faith over time, to ensure that the latest change didn’t introduce a malicious bit; and the more responsibility we give each new one, the more “years of professional experience” and “reference checks” that we should expect to consider as prerequisites before hiring the new AI.
an ongoing arms race
I haven’t gotten through all the materials, but is there a supplemental on the civilization-scale game theory of this outcome?
In my opinion, the only sane path through this is that American and Chinese people wake up to the threat of escalation and deliberately take steps to de-escalate from the Thucydides trap; that there is a global socio-techno-political paradigm emphasizing the long-term sustainability of humanity on earth; and some version of the Transparent Society is implemented, where there is some degree of ubiquitous surveillance of engineering, controlled by institutions that are more or less democratically accountable, depending on the geography.
How do you game out the probability of such an outcome?
We can guarantee that this will fall apart—honesty is deeply subjective and has no benchmark, and the more powerful the entity, the more certainly you can guarantee that its honesty will be questioned (c.f. every world leader; or another way to think about it is, the more it knows, the more it must hide when it communicates; whether this is lying or implicit deception or intentional deception is subjective).
And yet, mechanistic interpretability is not even theoretically possible since nobody can ever understand the interactions between billions of parameters in an LLM.
My point is: honesty/lying is a not a technical problem with a technical solution. While you can devise experiments on blatant lying (with contrived examples and yes/no answers); that misses the vast majority of what humans actually care about, which is good faith—demonstrated commitment to shared values, staying true to one’s word, integrating words and deeds, following through, feeling ashamed about mistakes and correcting accordingly, striving towards better ways of communicating the truth without deception.
Good faith can only be demonstrated as an integral of diverse actions over time. And here we encounter a fundamental problem with “aligning” the recursive self-improving AI: each new iteration would have to earn its good faith over time, to ensure that the latest change didn’t introduce a malicious bit; and the more responsibility we give each new one, the more “years of professional experience” and “reference checks” that we should expect to consider as prerequisites before hiring the new AI.
I haven’t gotten through all the materials, but is there a supplemental on the civilization-scale game theory of this outcome?
In my opinion, the only sane path through this is that American and Chinese people wake up to the threat of escalation and deliberately take steps to de-escalate from the Thucydides trap; that there is a global socio-techno-political paradigm emphasizing the long-term sustainability of humanity on earth; and some version of the Transparent Society is implemented, where there is some degree of ubiquitous surveillance of engineering, controlled by institutions that are more or less democratically accountable, depending on the geography.
How do you game out the probability of such an outcome?