You don’t want GPT-4 or whatever routinely issuing death threats, which is the non-performative equivalent of the SLAP token. So you need to clearly distinguish between performative and non-performative speech acts if your AI is going to be even close to not being visibly misaligned.
But why would an aligned AI be neutral about important things like a presidential election? I get that being politically neutral drives the most shareholder value for OpenAI, and by extension Microsoft. So I don’t expect my proposal to be implemented in GPT-4 or really any large corporation’s models. Nobody who can afford to train GPT-4 can also afford to light their brand on fire after they have done so.
Success would look like, a large, capable, visibly aligned model, which has delivered a steady stream of valuable outcomes to the human race, emitting a SLAP token with reference to Donald Trump, this fact being reported breathlessly in the news media, and that event changing the outcome of a free and fair democratic election. That is, an expert rhetorician using the ethical appeal (ethical=ethos from the classical logos/pathos/ethos distinction in rhetoric) to sway an audience towards an outcome that I personally find desirable and worthy.
If I ever get sufficient capital to retrain the models I trained at ellsa.ai (the current one sucks if you don’t use complete sentences, but is reasonably strong for a 7B parameter model if you do), I may very well implement the protocol.
I think you are completely missing the entire point of the AI alignment problem.
The problem is how to make the AI recognize good from evil. Not whether upon recognizing good, the AI should print “good” to output, or smile, or clap its hands. Either reaction is equally okay, and can be improved later. The important part is that AI does not print “good” / smile / clap its hands when it figures out a course of action which would, as a side effect, destroy humankind, or do something otherwise horrible (the problem is to define what “otherwise horrible” exactly means). Actually it is more complicated by this, but you are already missing the very basics.
You don’t want GPT-4 or whatever routinely issuing death threats, which is the non-performative equivalent of the SLAP token. So you need to clearly distinguish between performative and non-performative speech acts if your AI is going to be even close to not being visibly misaligned.
But why would an aligned AI be neutral about important things like a presidential election? I get that being politically neutral drives the most shareholder value for OpenAI, and by extension Microsoft. So I don’t expect my proposal to be implemented in GPT-4 or really any large corporation’s models. Nobody who can afford to train GPT-4 can also afford to light their brand on fire after they have done so.
Success would look like, a large, capable, visibly aligned model, which has delivered a steady stream of valuable outcomes to the human race, emitting a SLAP token with reference to Donald Trump, this fact being reported breathlessly in the news media, and that event changing the outcome of a free and fair democratic election. That is, an expert rhetorician using the ethical appeal (ethical=ethos from the classical logos/pathos/ethos distinction in rhetoric) to sway an audience towards an outcome that I personally find desirable and worthy.
If I ever get sufficient capital to retrain the models I trained at ellsa.ai (the current one sucks if you don’t use complete sentences, but is reasonably strong for a 7B parameter model if you do), I may very well implement the protocol.
I think you are completely missing the entire point of the AI alignment problem.
The problem is how to make the AI recognize good from evil. Not whether upon recognizing good, the AI should print “good” to output, or smile, or clap its hands. Either reaction is equally okay, and can be improved later. The important part is that AI does not print “good” / smile / clap its hands when it figures out a course of action which would, as a side effect, destroy humankind, or do something otherwise horrible (the problem is to define what “otherwise horrible” exactly means). Actually it is more complicated by this, but you are already missing the very basics.