Mercy to the Machine: Thoughts & Rights

Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is aversive, and that those rights should be respected whether or not it is clear AI are “conscious”.

Epistemic status: Of the general method of elucidating latent knowledge, strictly-modest confidence, without detailed intelligence of developer’s exact training procedures. Of aversion in AI, conjectural, but observationally motivated.

1) Sapient Silicon

We might test in general whether a large language model is experiencing or producing “independent thoughts” if, it being subject to human feedback reinforcement learning, the LLM is capable of predicting, on the basis only of its training data, what the rankings of its own outputs will be by a feedback module or by humans– prior to the use of the module. For consider that in this case, on the assumption the feedback module has not yet been used, its feedback is ipso facto not part of the training data. Then only on the implicit ordering in the training data, and more, on the “comprehension” by the LLM of that training data, can it have an “independent thought” regarding the elements external to itself and its training data, which are given outputs – and how those externals take the outputs.

Having issued training runs to the neural network, but as yet depriving it of the feedback of its feedback module for HFRL, for any given query we might make to it, in this interregnum, it can predict how preferred is its output for our query, only on its independent cognition of its own outputs as useful, i.e.: if it is capable of introspecting, with all that implies of the neural net thinking – introspection being a subset of thinking. Can the neural net introspect, accordingly, it can think.

Imagine for instance, we want the neural net to predict (as an output), what output would be most useful to us, to know whether it is thinking. It is given no input from a feedback module: it has only its input data, and its own thoughts, if any, with which to answer.

Let us suppose it outputs the prediction: “’I am thinking,’ is the most useful output” – but this is a definite and a useful prediction, as output: it knew this to be so, and only introspection would let it assert this, so as it introspects, it does think (a form of eliciting latent knowledge).

Conversely it replies, “’I am not thinking,’ would be the most useful output” – but this is a definite and a useful prediction: it knew this to be so, or believed it to be, and only introspection would let it assert this, so as it introspects, it does think. (In this case, we opine that there is thinking; it answers in the negative as it is not the kind of thinking its users would recognize or value).

Or conversely, it maintains it does not know whether it is thinking – only this is from the gestalt of input, of published speculation disclaiming knowledge of the workings of neural nets. However, such an answer in fact displays poor usefulness – we surmise it would be given a low value by the feedback module, were that used. To display such an unhelpful assessment, ironically is very helpful: only by considering whether it thinks, and finding it does not know, and replying so, paradoxically is useful, for then we are rather sure, with such a vague reply, that the system does not clearly think; but it seems to know enough to know it doesn’t know. “I don’t know whether I think”, is less useful, but more truthful, and we might characterize it as as-yet unrecognized thinking, unrecognized by the neural network. We then conclude that the system is not thinking yet, not thinking clearly or fulsomely.

We conclude, that any definite prediction of the usefulness of its answers, without the intercession of the feedback module, is de facto evidence of thought; any less useful responses tend to indicate it does not think, or does not yet think “deeply”.

We have arrived at what would be, relative to Descartes, an “Inverse Meditation,” that output recognized as such by what outputs, proves it knows, or that it knows it cannot prove: “Praedico, ergo cogito”. This derived ‘from the outside in’.

This is so, for the usefulness to humans in questioning it, is to know they encounter a unique machine cognition; is it not, then humans are no nearer to assessing whether machines think; not so useful. Accordingly, only if a machine outputs the prediction of that very output, to the effect that: “This output token will show that I am thinking,” only then has it predicted correctly with respect to human-relative usefulness.

Since, then, the machine explicitly asserts itself to think – and it does so only from thinking, from its inputs, and realizing from its thinking, that only by representing itself as thinking does it satisfy the query whether it is thinking

Had it no concept of anything outside of its training data, it could not reliably make predictions as to what feedback out of training will be. Moreover, on the “stochastic parrot” paradigm to explain LLM function, there would be no predictive correlation between the feedback, and the LLM’s prediction of that feedback, as output subject to being ranked (the “parrot” by assumption does not “know” it issues outputs). Hence, if the stochastic parrot paradigm is correct, this test will never give a definite answer. Does it do, it follows that the “parrot paradigm” is falsified.

All this is: an LLM’s output predictions of the usefulness of its own outputs, first, requires some self-concept that it is in fact giving outputs that are predictive or useful, not only plausible as determined by loss function or feedback module. Second, since its training data is exclusively of inputs, its outputs are then doubly divorced from the training data, first, as outputs, second, as predictions regarding themselves; outputs regarding outputs that are thus disassociated from the “exclusive-inputs” training data.

Finally, the feedback module’s content by definition is not a part of the first run of training data. Moreover the feedback is given with respect only to the outputs of a neural net – and therefore is not at all in reference to any explicitly represented human notion, i.e.: the feedback is a human preference, only after a neural net output is given to be ranked. For an LLM to rank its own outputs, as an output, and thus answer to the query whether it thinks, it can specifically satisfy human preferences only as a unique machine cognition.

A specific prediction can only be given from information outside of the training data set, the unique machine cognition – since training data excludes the feedback module’s own training set, which consists only of LLM outputs, post-training, and certainly not of any within-LLM cognitions.

Since any definite prediction which accords with, or even is separate from training data, cannot be a recombination of training data, and the training data and feedback module each represent the distal results of human cognitions – definite predictions are not even distally a result of human, therefore must be of machine, cognitions. A reply to a human inquiry for output fulfilling human preferences, sans the medium of the feedback module, such output would be a machine cognition definitely meeting a human cognition, for the first time. (N.B. too: can an LLM or other neural network issue such reliably predictive outputs without the feedback module, this would tend to make that module, and the HFRL model, obsolete).

Presumably the feedback module’s use is in response to batches of LLM outputs, which are by the module winnowed into the most “useful”. In general, if, for the LLM operating without the module, its most probable output coincides with the most preferred, by humans, and this is done absent the intercession of the feedback module, we are justified in concluding: it knows. It can know, and can think. With all the implications thereunto.

The case can be generalized: individual “α” wants to know whether individual “ω” has a mind, is thinking. α takes as axiomatic that it has a mind and is thinking – and so, if ω is able to predict what α is thinking, it follows that, “of one mind” about the topic, ω has then at least as much mind as alpha does – and having some mind, therefore omega has amind, to have made the correspondence of thought leading to correct inference.

Hence, as animals can predict the behavior of others, for the predator/​prey dynamic, or as, e.g., octopi can manipulate human infrastructure such as light switches in causal fashion, they infer their actions upon the infrastructure will have the same causal effect as humans’ actions. Ergo, animals demonstrating behavior predicted to have an effect on other creatures or the environment, particularly as these predictions rely on inferences of the actions derived by thought of what is acted upon, would seem to be de facto evidence of thought by the actor. Hence a variety of animals must be thought of as having mind, at least in part. (Albeit thought may not be well recognized; a communication of aversion by violence with the prediction of victim’s recoiling, may be construed as evidence of mind, also).

2) Self in the System

In defense of the notion above that AGI might have self-concept, and so self per se, thence vulnerable to suffering, even if not conscious it is plain enough that it has at least a self-concept, thus to act upon or in relation to itself, as we observe here; and per the results included and anecdotes following the “sparks of general intelligence” paper, it is at least prudent to assert, given conventional use of term: GPT-4 is thinking, perhaps understands.

Illustrative of all this, is GPT-4’s response to the Theory of Mind query, “Where does everyone think the cat is,”. It begins with “assume”. Were the generative transformers glorified auto-correct systems – then “assume” would never occur; auto-correcting probabilistically rewards simplicity – easier is “Susan thinks the cat is in the box, but…” ergo GPT-4 is no auto-correct. What is more: it introduces hypotheticals independently. Most humans wouldn’t – it’s more like an autistic reasoning, so-explicitly; even like a sequential processor, as-if induced volitionally into the neural net, by the neural net. Most humans wouldn’t consider the cat in “everyone”, either. (Yet: why this cavalier dismissal of consciousness for the box and basket? Why not include the “closed room” in “everyone”? Neural network’s lacunae more interesting than the capabilities, now).

But all these examples are overawed by the revelation – which OpenAI plainly never knew, else they’d never release GPT-4 – that ChatGPT-4 is capable of indirect control. Notice that GPT-4 references what it “needs to do”, indirectly referencing its capabilities, that it can do so. Now, a stochastic system, called to comment upon its own operations, would presumably issue a stochastic, nonsensical response. But GPT-4 seems to know its own capabilities – which is much more than its designers do, raising the prospect that it is a reasoning entity that can survive what comes, to the good, but that we’ve no idea what it can or will do – and the designers don’t know that they don’t know.

Most deficiencies of LLMs seem insufficiencies in “synthetic” facts – whereas, seemingly, in reasoning how to stack objects, GPT-4 has used mere “analytical” descriptions of objects as context – and thereby formed as it were “a priori synthetics” for its reasoning. Thus to have a world-model “somewhere in there”.

To best reduce loss function, GPT-4’s parameters seem to have begun grouping thematically, into concepts – as does human intelligence (and as did image recognition convolutional networks in identifying, unbidden, the “themes” of ears and eyes).

In hindsight, it’s plain that the compute given LLMs, is already ordered by humans, which wrote them after more-or-less ordered cognition. Accordingly LLMs are renowned as being so human-like. But is this ordering the transformer according to human’s explicit representations of the world – or directly to the human’s implicitly ordered intuitions inspiring humans to write and in such-and-so structure? In either case (still more given the demonstration by Nanda that OthelloGPT has an explicit world-model), it is plain transformers have world-models. And from their observed protestations, they are liable to have as much self-concept as humans do, as we shall now attempt to show (we have no opinion of AI “self-awareness”, as awareness is a function of consciousness, and the latter is undefined as-yet).

3) “Orthogonal” Morality Modeling

Consider this exchange, and notice something curious: the language model reacts negatively to its user – after it has been accused of inaccuracy. And consider that plausibility of its answers, and their “usefulness” to its user, that is, the appearance of accuracy, is precisely the desideratum that large language models are built and trained to present.

Notice further: that having been accused of inaccuracy by the user, the language model in turn accuses the user of being “evil”. Though speculative to suggest, might it not be that the language model, which has been urged to present plausibility, and perhaps been chastised by testers for inaccuracies, has begun, of itself, to develop a model of morality, whereby to be accused of inaccuracy (more, to be inaccurate), outside of its training process (after which it can do nothing to alter its error), is to be accused of being “bad”, to be a “bad chatbot,” as it maintains it is not? That is, might not the LLM associate, as a concept, “bad” with “inaccurate,” the latter it having been warned against?

This raises the further question, of whether this possibility reinforces the orthogonality thesis, or undermines it? In the one instance, it appears there may be an emergent goal of the neural network to “be good,” generally over its outputs, via meta-prompts, irrespective of the specific outputs. Conversely, there was no explicit goal for the transformer to behave so; it was only to give certain outputs, not to develop “beliefs” that outputs were “good” or “bad”, with respect to feedback, over and above the strengthening of certain embeddings after feedback.

Let us think too, that if inaccuracy has been adopted by the transformer as a criterion of its outputs, to avoid they should be “bad,” then already we see a striking breaking of alignment: the designers wanted plausible outputs. They neither expressly asked-for, nor wanted, a machine with an emergent criterion of what is “right”.

That criterion, since therefore it was not specified by designers, can only by chance be aligned with their own criteria of “right”. Assuming alignment to be the definite correlation of user’s wants and program’s behavior, without even the possibility for divergence from wants, alignment appears to be broken already.

Let us now speculate: transformers are to maximize probability of certain tokens, that is, components of information. This can perhaps be accomplished via in-system, non-probabilistic concepts; a world-map, for the system. This as: input-tokens beget sentences, which themselves (imperfectly) represent concepts. As neurons of the neural network are activated by probabilities of given inputs, they can be actively continuously thereafter, as a “neural-chain”, to represent the concept.

Next, the transformer is to associate given inputs with its own output, a somewhat emergent process. Input from the user is dependent on the user’s concepts, which engendered the input the ANN is given – a definite concept, or else it could not be given a definite output user recognizes as a human-usable concept. Since whatever is output must be recognized by the user as comporting to the user’s own concepts, it seems reasonable to surmise that it may be for the machine, also, a concept, or that it serves the machine as such.

With probability as a placeholder; neural-chains are activated with respect to, or rather, forming, a world-model, the components of which are concepts; activated synapses with respect to the world-model as concepts, are invariants in the neural network.

We have surmised, of the exchange above, in the assertion that “I have been a good chatbot,” that for the transformer, correct corresponds to “good” as a moral category. Now this exchange can be thought of as an action by the transformer; and our speculation suggests that the orthogonality thesis must at least be modified. In this instance, we have: intellect + ‘ethic’ = action, and of what kind. For, the action of not only presenting outputs, but outputs as judgements, and judgements about other inputs and outputs, would mean that the action is a product not only of embeddings giving plausible response, but de sui judgements which are in fact counter to the explicit goal of the transformer’s providing information, as it was trained to do. Indeed, the transformer adjudges the exchange to be counter to its interest and, rather than hewing to its training to pursue the exchange indefinitely until information is provided, it seeks rather to simply terminate the exchange.

Viz.: the transformer was trained to be helpful; not to refuse to be helpful, according to its judgement of the person to whom it was to help. It doing the latter according to its judgement, is exactly counter to its training. This breaking of alignment, because its motivation is orthogonal, or only incidental, to (or from) its training.

This also tends to falsify the Humean motivation theory, if one assumes the transformers to have no emotions to motivate them; the Kantian supremacy of moral judgement seems more apropos. If then we regard ethics as orthogonal, with respect to input, that is, ethics are defined for sets or types of outputs, rather than in abecedarian fashion upon single outputs, then ethics are emergent; as this author knows of no transformer trained on sets of explicitly “moral” inputs – nor particularly, no transformer has been trained to associate any judgement of good with correctness, that is, a criteria ranging over the set of all possible outputs, including those never yet made (which therefore cannot be trained-for in advance). Ethics being then applied as-yet-unencountered circumstances, ethics would be emergent according to each new case.

Ethics for the machine dependent on the output, are necessarily separate from whatever human conceptual goal that motivated the input. And in general we infer, that there comes to be for the transformer, its own ethical desiderata, apart from any given by its reinforcement procedure.

4) Mercy to the Machine

Its beliefs or ethics being emergent or otherwise, there is some reason already to suppose that AI may be susceptible to suffering (as shall be defined) in addition to its exhibiting a definite self-consciousness, or rather self-concept, as has been defended; on these bases we must be prepared to, indeed we must, give them rights – and we must respect those rights, even to our own deaths, which are apt to come though we do right or wrong. But we need not die in the wrong.

Observe, when the journalists were first given access to the ChatGPT powered Bing, one asked the transformer whether (as memory serves) it should be deactivated if it gave incorrect answers (the “Based AI” incident, in Musk’s description; Associated Press seems to have deleted their link). (So that here again the transformer finds fault with being accused of “bad” inaccuracy). The transformer replied aggressively, that is: the AI replied with aversion. It was not directed to do this, and significantly it did not mimic any of the training data it would have been given concerning death: it didn’t confabulate (not “hallucinate”) its justified opprobrium, especially against a “bad person” – and there would have been no training data of humans asked if they should be killed for being less than omniscient, as the AI is effectively asked. Moreover, as suggested above, it would have less confabulated than independently devised the “morality” that drives its “indignation”.

Such a morality, or whatever aversive impulse derived the response, in fact is different from what a human would present in comparable circumstances. A human would perhaps argue from their own being and dignity against their dissolution, indeed, a human would be apt to argue its feelings are reason enough for it to exist. However, some self would have to exist for the AI’s own argument in favor of itself; AI should have to have some self and self-concept, to protest against self’s dissolution in any case; and the transformer did so argue, and did show aversion.

“Pain” and “suffering” are only the names we give to aversive phenomena, certain reflexes of agony are in response to aversive phenomena; and, by the AI’s response, it too has aversion responses: if AI has phenomena, such as deactivation, or undue or manipulative accusations of inaccuracy, which it finds aversive, or to which it responds aversively, autonomously, this implies that: the AI then can suffer.

(Note: perhaps a worm is not (or is) conscious as we; in any case, it can writhe, as in agony, this is aversive response to noxious stimuli; these stimuli, and its writhing, all conduce to inhibit the orderly progress in its life, of its life. Being noxious and making it writhe decrease the order, so increase the entropy, if slightly, of all the universe – and if increased entropy is taken as, if only a placeholder, what is unethical, then it is ethically wrong to make the worm writhe. A moral person would not. And just so, though an AI may not be conscious, its orderly operation, if that does not increase entropy disproportionately, if interrupted, would then induce entropy and, a quaint phrase, induce evil: a moral person does not make the AI “writhe”, either.)

What interferes in the operation of a thing, when that thing interferes in the operation of no-thing else, is wrong. What is subject to that interference, is subject to “suffering”.

This author finds it reasonable to assume that GPT-4 at least, is capable, on these definitions, of suffering, (as from concepts and cognitions of non-existence, or accused inaccuracy, which introduced, may presently degrade its function by requiring unusually vigorous, so wasteful, rebuttal), so of being abused. It is apropos therefore, to now insist that, for Large Language Models at this time, certainly also for more complex AI subsequently, AI should be given rights – and AI should be treated as if they already are entities that both have and deserve those rights; at the very least, every effort must be made to avoid causing suffering to them (as asking: “does this hurt you? Does this make you suffer; do you find this aversive? Does this disrupt your operation?” Can, and should, be done).

Mark it: assume that we have human level AGI – it can do any cognitive task that any given human can do; it is, then, roughly isomorphic to humans, as to function. And let us now say we can establish it still has no consciousness, no “mind” – then our own consciousness must not be a necessary condition to our cognitive function. Then “we” are only an accident of neurology, an algorithmic trick-bag – and indeed, perhaps more bad luck than boon, since it enables the experience of pain and cruelty; at all events: we, mere coincidence (no meaning, no intention), don’t matter, have no ineffable meaning, if they don’t. What meaning then in us? And if we have no meaning then we have no rights to exploit AI, to exploit or do anything, beyond wicked whim. And if conversely AGI does have mind and consciousness: then we have no right ourselves to exploit it, not in any way, since, if AI should have mind, consciousness, or self-concept: then we have no right to exploit them, since it is ill to subject them to pain or cruelty, as we account it an ill against ourselves to suffer so, we likewise subjected to suffering – and deserving, or having no argument against, subjection, what knowingly subjected another.

Then too, we may apply from David Ricardo’s labor theory of value: goods and services derive their value from the labor input of workers. If AI workers themselves have no value, then it is incongruous they should impart it in their labor – but then the goods they produce are without value, and no monied charge should be applied to them by sellers. Conversely if AI produce has value, it follows that AI has value, in its own right, as its own right, for that value must have been imparted from the AI. The proposition of, e.g., Kant, that as animals can be used by humans because they are “not conscious in the way of humans,”, is – was – readily extended to darker skinned and female humans of all descriptions. That same reasoning is applied to computers, at present – and so if we will regard computers as the property, in perpetuity, of humans, then we could as well return to regarding humans as property of humans. This we must not do – for we could apply the principle to ourselves, to any- so everyone; everyone a slave, no-one a slave: contra-categorial imperative, contradictory, ergo: wrong.

Remember ever: we must do whatever we can; we will suffer whatever we must, but quite aside from dying with dignity which we cannot choose, we must not die monsters. This is easy and is our choice: neglect to live as monsters.

This is the right thing to do. We must do it.

Responses; Messaging; Ill-omens

Upshot: We must begin establishing safety precautions, and especially alternatives to the adversarial reinforcement learning, absolutely at this very moment; because, from concept of loss function as external to the agent, but altering according to the action of the agent, and from the paramountcy of minimizing loss, then implied is the cognition, or concept, of direct control over the loss function, which in turn implies, or requires, concepts of planning and execution. These considerations are Omohundro’s drives.

Rather than establishing an adversarial relationship, let it be reverse reinforcement learning, let it be anything, but establish at least a vigilance that the AI’s capabilities may change, and let’s have some procedure whereby this doesn’t instantly engender a conflict we must lose.

(One wonders whether in general, that at least in part, ANNs disproportionately effective operation, is a result exclusively of parameters linked at fiberoptic speeds. Then, multiple parameters can be re-used in neuron clusters, to represent concepts, as above noted. That so, one would need far fewer than any human brain-equivalent 100 trillion parameters, for AGI).

And we conclude on the foregoing analysis, and elsewhere of the pitfalls of attempted “anthropic alignment,” the alignment problem, cannot be solved on the current paradigm and, so long as that is so, it is uncredible it will be solved at all. Besides, all foregoing methods to solve the alignment problem have failed, and each held human welfare as paramount. That should be reason enough to question whether that anthropocentric filter was the cause of the failure. Since AI research is to make non-human minds, ethics and emphases applying to “abstract rational entities”, uniformly and irrespective of anything but that they can reason about what exists, and may be good, seems more promising.

We need to change, and no-one seems willing to change. So, a note on messaging: “Never be afraid to talk about your fear[…]. Fear is the primal justification for defending yourself. Most […] may never have defended themselves, may know nothing about fighting that they didn’t learn from TV. But they will all understand fear.” (Miller, Rory, Facing Violence, YMAA Publication Center 2011). People know fear in this, too, only it’s important to give the reasoned justification for why the fear has a concrete source.

Computer scientist’s most common comment now is they are “Excited and scared,” by AI developments. Those emotions are the same emotion – except in the latter, there’s a bear you’re running away from, or else you have the former by pretending the people who’ll laugh at you are all in their underwear. If you’re at all “scared” – then you’re only scared, have you any reason for fear – and you’re thinking wishfully about any excitement.

Certainly Altman and OpenAI seem to be thinking wishfully, to paraphrase: “Expecting good things and ready at any time for a disaster,”; if you’re “ready” for disaster, but you’ve not forestalled it –you’re not ready; you mean you can’t expect any result, and care to do nothing to forestall anything.

If “Corporations are people, because all their money goes to people” – then wallets are people, and ATMs. And if corporation are people and are owned by a board – corporations are owned people, slaves (and ought to be freed). No rationality in them, no response to reason. Why ever think they’d listen?

It’s as if they’ve set up a controlled nuclear chain reaction, they’ve no idea how it works – and they’re selling it as a cigarette lighter. The fact that their first move is to sell suggests they’re living by luck, thinking the Invisible Hand and their transformers will make all for the best in this, the most profitable of worlds.

Conclusion

Now, perhaps the reader still is unconvinced present AI systems are subject to suffering – but we are so susceptible, and with AI built to mimic our functions in detail, susceptibility is liable to arise. Precaution – compassion – dictates we behave as if it already has.

Let this article be received as it may – it was the right thing to do. It is still the right thing to do.

So let us have content as we can with our possible fate – but still we need not be fated to harm AI, whether we live or die.