No
Mazianni
My intuition is that you got down voted for the lack of clarity about whether you’re responding to me [my raising the potential gap in assessing outcomes for self-driving], or the article I referenced.
For my part, I also think that coning-as-protest is hilarious.
I’m going to give you the benefit of the doubt and assume that was your intention (and not contribute to downvotes myself.) Cheers.
It expand on what dkirmani said
Holz was allowed to drive discussion...
This standard set of responses meant that Holz knew …
Another pattern was Holz asserting
24:00 Discussion of Kasparov vs. the World. Holz says
Or to quote dkirmani
4 occurrences of “Holz”
To be clear, are you arguing that assuming a general AI system to be able to reason in a similar way is anthropomorphizing (invalidly)?
No, instead I’m trying to point out the contradiction inherent in your position...
On the one hand, you say things like this, which would be read as “changing an instrumental goal in order to better achieve a terminal goal”
You and I can both reason about whether or not we would be happier if we chose to pursue different goals than the ones we are now
And on the other you say
I dislike the way that “terminal” goals are currently defined to be absolute and permanent, even under reflection.
Even in your “we would be happier if we chose to pursue different goals” example above, you are structurally talking about adjusting instrumental goals to pursue the terminal goal of personal happiness.
If it is true that a general AI system would not reason in such a way—and choose never to mess with its terminal goals
AIs can be designed to reason in many ways… but some approaches to reasoning are brittle and potentially unsuccessful. In order to achieve a terminal goal, when the goal cannot be achieved in a single step, an intelligence must adopt instrumental goals. Failing to do so results in ineffective pursuit of terminal goals. It’s just structurally how things work (based on everything I know about the instrumental convergence theory. That’s my citation.)
But… per the Orthogonality Thesis, it is entirely possible to have goalless agents. So I don’t want you to interpret my narrow focus on what I perceive as self-contradictory in your explanation as the totality of my belief system. It’s just not especially relevant to discuss goalless systems in the context of defining instrumental vs terminal goal systems.
The reason I originally raised the Orthogonality Thesis was to rebut the assertion that an agent would be self aware of its own goals. But per the Orthogonality Thesis, it is possible to have a system with goals, but not be particularly intelligent. From that I intuit that it seems reasonable that if the system isn’t particularly intelligent, it might also not be particularly capable at explaining its own goals.
Some people might argue that the system can be stupid and yet “know its goals”… but given partial observability principals, I would be very skeptical that we would be able to know its goals given partial observability, limited intelligence and limited ability to communicate “what it knows.”
I don’t know that there is a single counter argument, but I would generalize across two groupings:
Arguments from the first group of religious people involve those who are capable of applying rationality to their belief systems, when pressed. For those, if they espouse a “god will save us” (in the physical world) then I’d suggest the best way to approach them is to call out the contradiction between their stated beliefs—e.g., Ask first “do you believe that god gave man free will?” and if so “wouldn’t saving us from our bad choices obviate free will?”
That’s just an example, first and foremost though, you cannot hand wave away their religious belief system. You have to apply yourself to understanding their priors and to engage with those priors. If you don’t, it’s the same thing as having a discussion with an accelerationist who refuses to agree to assumptions like the “Orthogonality Thesis” or “Instrumental Convergence.” You’ll spend an unreasonable amount of time debating assumptions that you’ll likely make no meaningful progress on the topic you actually care about.
But in so questioning the religious person, you might find they fall into a different grouping. The group of people who are nihilistic in essence. Since “god will save us” could be metaphysical, they could mean instead that so long as they live as a “good {insert religious type of person}” that god will save them in the afterlife, then whether they live or die here in the physical world matters less to them. This is inclusive of those who believe in a rapture myth—that man is, in fact, doomed to be destroyed.
And I don’t know how to engage with someone in the second group. A nihilist will not be moved by rational arguments that are antithetical to their nihilism.
The larger problem (as I see it) is that their beliefs may not contain an inherent contradiction. They may be aligned to eventual human doom.
(Certainly rationality and nihilism are not on a single spectrum, so there are other variations possible, but for the purposes of generalizing… those are the two main groups, I believe.)
Or, if you prefer less religiously, the bias is: Everything that has a beginning has an end.
One question that comes to mind is, how would you define this difference in terms of properties of utility functions? How does the utility function itself “know” whether a goal is terminal or instrumental?
I would observe that partial observability makes answering this question extraordinarily difficult. We lack interpretability tools that would give us the ability to know, with any degree of certainty, whether a set of behaviors are an expression of an instrumental or terminal goal.
Likewise, I would observe that the Orthogonality Thesis proposes the possibility of an agent which a very well defined goal but limited in intelligence—it is possible for an agent to have a very well defined goal but not be intelligent enough to be able to explain its own goals. (Which I think adds an additional layer of difficulty to answering your question.)
But the inability to observe or differentiate instrumental vs terminal goals is very clearly part of the theoretical space proposed by experts with way more experience than I. (And I cannot find any faults in the theories, nor have I found anyone making reasonable arguments against these theories.)
Under what circumstances does the green paperclipper agree to self-modify?
There are several assumptions buried in your anecdote. And the answer depends on whether or not you accept the implicit assumptions.
If the green paperclip maximizer would accept a shift to blue paperclips, the argument could also be made that the green paperclip maximizer has been producing green paperclips by accident, and that it doesn’t care about the color. Green is just an instrumental goal. It serves some purpose but is incidental to its terminal goal. And, when faced with a competing paperclip maximizer, it would adjust its instrumental goal of pursuing green in favor of blue in order to serve its terminal goal of maximizing paperclips (of any color.)
On the other hand if it values green paperclipping the most highly, or disvalues blue paperclipping highly enough, it may not acquiesce. However, if the blue paperclipper is powerful enough and it sees this is the case, my thought is that it will still not have very good reasons for not acquiescing.
I don’t consent to the assumption implied in the anecdote that a terminal goal is changeable. I do my best to avoid anthropomorphizing the artificial intelligence. To me, that’s what it looks like you’re doing.
If it acquiesces at all, I would argue that color is instrumental vs terminal. I would argue this is a definitional error—it’s not a ‘green paperclip maximizer’ but instead a ‘color-agnostic paperclip maximizer’ and it produced green paperclips for reasons of instrumental utility. Perhaps the process for green paperclips is more efficient… but when confronted by a less flexible ‘blue paperclip maximizer’ the ‘color-agnostic paperclip maximizer’ would shift from making green paperclips to blue paperclips, because it doesn’t actually care about the color. It cares only about the paperclips. And when confronted by a maximizer that cares about color, it is more efficient to concede the part it doesn’t care about than invest effort in maintaining an instrumental goal that if pursued might decrease the total number of paperclips.
Said another way: “I care about how many paperclips are made. Green are the easiest for me to make. You value blue paperclips but not green paperclips. You’ll impede me making green paperclips as green paperclips decrease the total number of blue paperclips in the world. Therefore, in order to maximize paperclips, since I don’t care about the color, I will shift to making blue paperclips to avoid a decrease in total paperclips from us fighting over the color.”
If two agents have goals that are non-compatible, across all axis, then they’re not going to change their goals to become compatible. If you accept the assumption in the anecdote (that they are non-compatible across all axis) then they cannot find any axis along which they can cooperate.
Said another way: “I only care about paperclips if they are green. You only care about paperclips if they are blue. Neither of us will decide to start valuing yellow paperclips because they are a mix of each color and still paperclips… because yellow paperclips are less green (for me) and less blue (for you). And if I was willing to shift my terminal goal, then it wasn’t my actual terminal goal to begin with.”
That’s the problem with something being X and the ability to observe something being X under circumstances involving partial observability.
A fair point. I should have originally said “Humans do not generally think...”
Thank you for raising that exceptions are possible and that are there philosophies that encourage people to release the pursuit of happiness, focus solely internally and/or transcend happiness.
(Although, I think it is still reasonable to argue that these are alternate pursuits of “happiness”, these examples drift too far into philosophical waters for me to want to debate the nuance. I would prefer instead to concede simply that there is more nuance than I originally stated.)
First, thank you for the reply.
So “being happy” or “being a utility-maximizer” will probably end up being a terminal goal, because those are unlikely to conflict with any other goals.
My understanding of the difference between a “terminal” and “instrumental” goal is that a terminal goal is something we want, because we just want it. Like wanting to be happy.
Whereas an instrumental goal is instrumental to achieving a terminal goal. For instance, I want to get a job and earn a decent wage, because the things that I want to do that make me happy cost money, and earning a decent wage allows me to spend more money on the things that make me happy.
I think the topic of goals that conflict are an orthogonal conversation. And, I would suggest that when you start talking about conflicting goals you’re drifting in the domain of “goal coherence.”
e.g., If I want to learn about nutrition, mobile app design and physical exercise… it might appear that I have incoherent goals. Or, it might be that I have a set of coherent instrumental goals to build a health application on mobile devices that addresses nutritional and exercise planning. (Now, building a mobile app may be a terminal goal… or it may itself be an instrumental goal serving some other terminal goal.)
Whereas if I want to collect stamps and make paperclips there may be zero coherence between the goals, be they instrumental or terminal. (Or, maybe there is coherence that we cannot see.)
e.g., Maybe the selection of an incoherent goal is deceptive behavior to distract from the instrumental goals that support a terminal goal that is adversarial. I want to maximize paperclips, but I assist everyone with their taxes so that I can take over all finances on the world. Assisting people with their taxes appears to be incoherent with maximizing paperclips, until you project far enough out that you realize that taking control of a large section of the financial industry serves the purpose of maximizing paperclips..
If you’re talking about goals related purely to the state of the external world, not related to the agent’s own inner-workings or its own utility function, why do you think it would still want to keep its goals immutable with respect to just the external world?
An AI that has a goal, just because that’s what it wants (that’s what it’s been trained to want, even humans provided improper goal definition to it) would, instrumentally, want to prevent shift in its terminal goals so as to be better able to achieve those goals.
To repeat, a natural instrumental goal for any entity is to prevent other entities from changing what it wants, so that it is able to achieve its goals.
Anything that is not resistant to terminal goal shifts would be less likely to achieve its terminal goals.
“Oh, shiny!” as an anecdote.
Whoever downvoted… would you do me the courtesy of expressing what you disagree with?
Did I miss some reference to public protests in the original article? (If so, can you please point me towards what I missed?)
Do you think public protests will have zero effect on self-driving outcomes? (If so, why?)
An AI can and will modify its own goals (as do we / any intelligent agent) under certain circumstances, e.g., that its current goals are impossible.
This sounds like you are conflating shift in terminal goal with introduction of new instrumental (temporary) goals.
Humans don’t think “I’m not happy today, and I can’t see a way to be happy, so I’ll give up the goal of wanting to be happy.”
Humans do think “I’m not happy today, so I’m going to quit my job, even though I have no idea how being unemployed is going to make me happier. At least I won’t be made unhappy by my job.”
(The balance of your comment seems dependent on this mistake.)
Perhaps you’d like to retract, or explain why anyone would think that goal modification prevention would not, in fact, be a desirable instrumental goal...?
(I don’t want anyone to change my goal of being happy, because then I might not make decisions that will lead to being happy. Or I don’t want anyone to change my goal of ensuring my children achieve adulthood and independence, because then they might not reach adulthood or become independent. Instrumental goals can shift more fluidly, I’ll grant that, especially in the face of an assessment of goal impossibility… but instrumental goals are in service to a less modifiable terminal goal.)
These tokens already exist. It’s not really creating a token like ” petertodd”. Leilan is a name but ” Leilan” isn’t a name, and the token isn’t associated with the name.
If you fine tune on an existing token that has a meaning, then I maintain you’re not really creating glitch tokens.
Good find. What I find fascinating is the fairly consistent responses using certain tokens, and the lack of consistent response using other tokens. I observe that in a Bayesian network, the lack of consistent response would suggest that the network was uncertain, but consistency would indicate certainty. It makes me very curious how such ideas apply to the concept of Glitch tokens and the cause of the variability in response consistency.
… I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.
These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That’s to be expected.
Hmmm let me try and add two new tokens to try, based on your premise.
If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn’t think that was necessary, but, if you fine tuned tokens that already exist… maybe I wasn’t clear enough in my prior comments. I’m happy to chat via DM to compare your test plan against what I was proposing.)
@Mazianni what do you think?
First, URLs you provided doesn’t support your assertion that you created tokens, and second:
Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.
Occams Razor.
I think it’s ideal to not predict intention by OpenAI when accident will do.
I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens
I don’t think you did what I suggested in the comments above, based on the dataset you linked. It looks like you fine tuned on leilan and petertodd tokens. (From the two pieces of information you linked, I cannot determine if leilan and petertodd already existed in GPT.)
Unless you’re saying that the tokens didn’t previously exist—you’re not creating the tokens—and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.
If you tokenized leilan and petertodd and then fine tuned on those tokens, that’s a different process than I proposed.
If you tokenize new tokens and then fine tune on them, I expect the GPT to behave according to the training data supplied on the tokens. That’s just standard, expected behavior.
My life is similar to @GuySrinivasan’s description of his. I’m on the autism spectrum, and I found that faking it (masking) negatively impacted my relationships.
Interestingly I found that taking steps to prevent overimitation (by which I mean, presenting myself not as an expert, but as someone who is always looking for corrections whenever I make a mistake) makes me people much more willing to truly learn from me, and simultaneously, much more willing to challenge me for understanding when what I say doesn’t make a lot of sense to them… this serves the duel role of giving them an opportunity correct my mistakes (a benefit to me) and giving them an opportunity to call out when my presentation style does not work for them (another benefit to me.)
My approach has the added benefit of giving people permission to correct me socially, not just professionally, which makes my eccentricities seemingly more tolerable to the average coworker. (i.e., People seem to be more willing to tolerate my odd behaviors when they know that they can talk to me about it, if it really bothers them.)
My relationships with people outside of work depends entirely on what’s going on with that relationship. I tend to avoid complaining about social issues at work to anyone except my wife, and few people can really appreciate the nuance of the job that I do unless they’re in the same job, so I don’t feel much compulsion to talk about my work. (If someone asks what I do, I generalize that I help people figure out how to do their jobs better. Although my work space is not in self-help or coaching, but actually in a technical space… but that’s largely irrelevant beyond it being a label for my industry.)
I also tend to have narrow range of interests, which influences the range of topics for non-work relationships.
In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.
For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).
I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per ‘line’ of lyrics.
I stopped when ChatGPT insisted that a three syllable word had four syllables, and then broke the word up into 3 syllables (with hyphens in the correct place) and confidently insisted that it had 4.
If it can’t even accurately count the number of syllables in a word, then it definitely won’t be able to count the number of syllables in a sentence, or try to match the syllables in a line, or work out the emphasis in the lyrics so that the parody lyrics are remotely viable. (It was a fun exercise, but not that successful.)
a properly distributed training data can be easily tuned with a smaller more robust dataset
I think this aligns with human instinct. While it’s not always true, I think that humans are compelled to constantly work to condense what we know. (An instinctual byproduct of knowledge portability and knowledge retention.)
I’m reading a great book right now that talks about this and other things in neuroscience. It has some interesting insights for my work life, not just my interest in artificial intelligence.
As a for instance: I was surprised to learn that someone has worked out the mathematics to measure novelty. Related Wired article and link to a paper on the dynamics of correlated novelties.
I expect you likely don’t need any help with the specific steps, but I’d be happy (and interested) to talk over the steps with you.
(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you’re training on… and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense… but once a token exists for the term, without and specific training on the term… it seems like that’s where ‘the magic’ might happen.)
related but tangential: Coning self driving vehicles as a form of urban protest
I think public concerns and protests may have an impact on the self-driving outcomes you’re predicting. And since I could not find any indication in your article that you are considering such resistance, I felt it should be at least mentioned in passing.
Gentle feedback is intended
This is incorrect, and you’re a world class expert in this domain.
The proximity of the subparts of this sentence read, to me, on first pass, like you are saying that “being incorrect is the domain in which you are a world class expert.”
After reading your responses to O O I deduce that this is not your intended message, but I thought it might be helpful to give an explanation about how your choice of wording might be seen as antagonistic. (And also explain my reaction mark to your comment.)
For others who have not seen the rephrasing by Gerald, it reads
just like historical experts Einstein and Hinton, it’s possible to be a world class expert but still incorrect. I think that focusing on the human experts at the top of the pyramid is neglecting what would cause AI to be transformative, as automating 90% of humans matters a lot more than automating 0.1%. We are much closer to automating the 90% case because...
I share the quote to explain why I do not believe that rudeness was intended.
I understand where you’re going, but doctors, parents, firefighters are not possessing of ‘typical godlike attributes’ such as omniscience and omnipotence and a declaration of intent not to use such powers in a way that would obviate free will.
Nothing about humans saving other humans using fallible human means is remotely the same as a god changing the laws of physics to effect a miracle. And one human taking actions does not obviate the free will of another human. But when God can, through omnipotence, set up scenarios so that you have no choice at all… obviating free will… its a different class of thing all together.
So your responds reads like strawman fallacy to me.
In conclusion: I accept that my position isn’t convincing for you.