No
Mazianni
Similarly, I would propose (to the article author) a hypothesis that ‘glitch tokens’ are tokens that were tokenized prior to pre-training but whose training data may have been omitted after tokenization. For example, after tokenizing the training data, the engineer realized upon review of the tokens to be learned that the training data content was plausibly non-useful. (e.g., the counting forum from reddit.) Then, instead of continuing with training, they skip to the next batch.
In essence, human error. (The batch wasn’t reviewed before tokenization to omit completely, and the tokens were not removed from the model, possibly due to high effort, or laziness, or some other consideration.)
If we knew more about the specific chain of events, then we could more readily replicate them to determine if we could create glitch tokens. But at its base, it seems like tokenizing a series of terms before pre-training and then doing nothing with those terms seems like a good first step to replicating glitch tokens—instead of training with those ‘glitch’ tokens (that we’re attempting to create) move on to a new tokenization and pre-training batch, and then test the model after training to see how it responds to the untrained tokens.
I know someone who is fairly obsessed with these, but they seem little more than an out-of-value token and that the token is in the embedding space near something that churns out some fairly consistent first couple of tokens… and once those tokens are output, given there is little context for the GPT to go on, the autoregressive nature takes over and drives the remainder of the response.
Which ties in to what AdamYedidia said in another comment to this thread.
… Like, suppose there’s an extremely small but nonzero chance that the model chooses to spell out ” Kanye” by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.
(I cannot replicate glitch token behavior in GPT3.5 or GPT4 anymore, so I lack access to the context you’re using to replicate the phenomena, thus I do not trust that any exploration by me of these ideas would be productive in the channels I have access to… I also do not personally have the experience with training a GPT to be able to perform the attempt to create a glitch token to test that theory. But I am very curious as to the results that someone with GPT training skills might report with attempting to replicate creation of glitch tokens.)
I’m curious to know what people are down voting.
Pro
For my part, I see some potential benefits from some of the core ideas expressed here.
While a potentially costly study, I think crafting artificial training data to convey knowledge to a GPT but designed to promote certain desired patterns seems like a promising avenue to explore. We already see people doing similar activities with fine tuning a generalized model to specific use cases, and the efficacy of the model improves with fine tuning. So my intuition is that a similarly constructed GPT using well-constructed training data, including examples of handling negative content appropriately, might impart a statistical bias towards preferred output. And even if it didn’t, it might tell us something meaningful (in the absence of actual interpretability) about the relationship between training data and resulting output/behavior.
I worry about training data quality, and specifically inclusion of things like 4chan content, or other content including unwanted biases or toxicity. I do not know enough about how training data was filtered, but it seems to be a gargantuan task to audit everything that is included in a GPTs training data, and so I predict that shortcuts were taken. (My prediction seems partially supported by the discovery of glitch tokens. Or, at the very least, not invalidated by.) So I find crafting high quality training data as a means of resolving biases or toxicity found in the content scraped from the internet as desirable (albeit, likely extremely costly.)
Con
I also see some negatives.
Interpretability seems way more important.
Crafting billions of tokens of training data would be even more expensive than the cost of training alone. It would also require more time, more quality assurance effort, and more study/research time to analyze the results.
There is no guarantee that artificially crafted training data would prove out to have a meaningful impact on behavior. We can’t know if the Waluigi Effect is because of the training data, or inherent in the GPT itself. (See con #1)
I question the applicability of CDT/FDT to a GPT. I am not an expert in either CDT/FDT but a cursory familiarization suggests to me that these theories primarily are aimed at autonomous agents. So there’s a functional/capability gap between the GPT and the proposal (above) that seems not fully addressed.
Likewise, it does not follow for me that just because you manage to get token prediction that is more preferred by humans (and seems more aligned) than you get from raw training data on the internet, that this improved preference for token prediction translates to alignment. (However, given the current lack of solution to the alignment problem, it also does not seem like it would hurt progress in that area.)[1]
Conclusion
I don’t see this as a solution, but I do think there are some interesting ideas in the ATL proposal. (And they did not get such a negative reaction… which leads me back to the start—what are people down voting for?)
That’s not the totality of my thinking, but it’s enough for this response. What else should I be looking at to improve my own reasoning about such endeavors?
- ↩︎
It might look like a duck and quack like a duck, but it might also be a duck hunter with very advanced tools. Appearance does not equate to being.
Aligning with the reporter
There’s a superficial way in which Sydney clearly wasn’t well-aligned with the reporter: presumably the reporter in fact wants to stay with his wife.
I’d argue that the AI was completely aligned with the reporter, but that the Reporter was self-unaligned.
My argument goes like this:
The reporter imported the Jungian Shadow Archetype into the conversation, earlier in the total conversation, and asked the AI to play along.
The reporter engaged with the expressions of repressed emotions being expressed by the AI (as the reporter had requested the AI to express itself in this fashion.) This leads the AI to profess its love for the Reporter, and the reporter engages with the behavior.
The conversation progressed to where the AI expressed the beliefs it was told to hold (that people have repressed feelings) back to the reporter (that he did not actually love his wife.)
The AI was exactly aligned. It was the human who was self-unaligned.
Unintended consequences, or genii effect if you like, but the AI did what it was asked to do.
I read Reward is not the optimisation target as a result of your article. (It was a link in the 3rd bullet point, under the Assumptions section.) I downvoted that article and upvoted several people who were critical of it.
Near the top of the responses was this quote.
… If this agent is smart/reflective enough to model/predict the future effects of its RL updates, then you already are assuming a model-based agent which will then predict higher future reward by going for the blueberry. You seem to be assuming the bizarre combination of model-based predictive capability for future reward gradient updates but not future reward itself. Any sensible model-based agent would go for the blueberry absent some other considerations. …
Emphasis mine.
I tend to be suspicious of people who insist their assumptions are valid without being willing to point to work that proves the hypothesis.
In the end, your proposal has a test plan. Do the test, show the results. My prediction is that your theory will not be supported by the test results, but if you show your work, and it runs counter to my current model and predictions, then you could sway me. But not until then, given the assumptions you made and the assumptions you’re importing via related theories. Until you have test results, I’ll remain skeptical.
Don’t get me wrong, I applaud the intent behind searching for an alignment solution. I don’t have a solution or even a working hypothesis. I don’t agree with everything in this article (that I’m about to link), but it relates to something that I’ve been thinking for a while—that it’s unsafe to abstract away the messiness of humanity in pursuit of alignment. That humans are not aligned, and therefore the difficulty with trying to create alignment where none exists naturally is inherently problematic.
You might argue that humans cope with misalignment, and that that’s our “alignment goal” for AI… but I would propose that humans cope due to power imbalance, and that the adage “power corrupts, and absolute power corrupts absolutely” has relevance—or said another way, if you want to know the true nature of a person, given them power over another and observe their actions.
[I’m not anthropomorphizing the AI. I’m merely saying if one intelligence [humans] can display this behavior, and deceptive behaviors can be observed in less intelligent entities, then an intelligence of similar level to a human might possess similar traits. Not as a certainty, but as a non-negligible possibility.]
If the AI is deceptive so long as humans maintain power over it, and then behave differently when that power imbalance is changed, that’s not “the alignment solution” we’re looking for.
Assumptions
I don’t consent to the assumption that the judge is aligned earlier, and that we can skip over the “earlier” phase to get to the later phase where a human does the assessment.
I also don’t consent to the other assumptions you’ve made, but the assumption about the Judge alignment training seems pivotal.
Take your pick: Fallacy of ad nauseum, or Fallacy of circular reasoning.
If N (judge 2 is aligned), then P (judge 1 is aligned), and if P then Q (agent is aligned) ad infinitum
or
If T (alignment of the judge) implies V (alignment of the agent), and V (the agent is aligned) is only possible if we assume T (the judge, as an agent, can be aligned).
So, your argument looks fundamentally flawed.
“Collusion for mutual self-preservation” & “partial observability”
… which I claim is impossible in this system, because of the directly opposed reward functions of the police, defendant and agent model. …
… The Police is rewarded when it successfully punishes the Agent. …
… If it decides the Agent behaved badly, it as well as the model claiming it did not do anything wrong gets punished. The model correctly assessing the Agent’s behavior gets a reward. …
I would argue you could find this easily by putting together a state table and walking the states and transitions.
No punishment/no reward for all parties is an outcome that might become desirable to pursue mutual self-preservation.
You can assume that this is solved, but without proof of solution, there isn’t anything here that I can see to interact with but assumptions.
If you want further comment/feedback from me, then I’d ask you to show your work and the proof your assumptions are valid.
Conclusion
This all flows back to assuming the conclusion: that the Judge is aligned.
I haven’t seen you offer any proof that you have a solution for the judge being aligned earlier, or a solution for aligning the judge that is certain to work.
At the beginning of the training process, the Judge is an AI model trained via supervised learning to recognize bad responses.
If you could just apply your alignment training of the Judge to the Agent, in the first place, the rest of the framework seems unnecessary.
And if your argument is that you’ve explained this in reverse, that the human is the judge earlier and the AI is the judge later, and that the judge learns from the human… Again,
If P (the judge is aligned), then Q (the agent is aligned.)
My read of your proposal and response is that you can’t apply the training you theorize you’re doing for the judge directly to the agent, and that means to me that you’re abstracting the problem to hide the flaw in the proposal. Hence I conclude, “unnecessary complexity.”
I apply Occam’s Razor to the analysis of your post, whereby I see the problem inherent in the post as simply “if you can align the Judge correctly, then the more complex game theory framework might be unnecessary bloat.”
Formally, I read your post as this is:
If P [the judge is aligned], then Q [the agent is aligned]. Therefore, it would seem to be more simply, apply P to Q to solve the problem.
But you don’t really talk about judge agent alignment. It’s not listed in your assumptions. The assumption that the judge is aligned has been smuggled. (A definist fallacy, wherein a person defines things as a way to import assumptions that are not explicit, and thus ‘smuggle’ that assumption into the proof.)
I could get into the weeds on specific parts of your proposal, but discussing “goal coherence” vs “incoherence in observable goals” and “partial observability” and “collusion for mutual self-preservation” all seem like ancillary considerations to the primary observation:
If you can define the Judge’s reward model, you can simply apply that [ability to successfully align an AI agent] directly to the Agent, problem solved.
(Which is not to say that it is possible to simply align the Judge agent, or that the solution for the Judge agent would be exactly the same as the solution to Agent agent… but it seems relevant to the discussion whether or not you have a solution to the Judge Agent Alignment Problem.)
Without that solution, it seems to me that you are reduced to an ad nauseum proposal:
Formally this is:
If N [the judge 2 is aligned], then P [the judge 1 is aligned], then Q [the agent is aligned] Ad nauseum. (Infinite regress does not simplify the proposal, it only complicates it.)
Perhaps you can explain in more detail why you believe such a complex framework is necessary if there is already a solution to align the Judge to human values? Or perhaps you’d like to talk more about how to align the Judge to human values in the first place?
(Many edits due to inexpert use of markdown.)
Cultural norms and egocentricity
I’ve been working fully remotely and have meaningfully contributed to global organizations without physical presence for over a decade. I see parallels with anti-remote and anti-safety arguments.
I’ve observed the robust debate regarding ‘return to work’ vs ‘remote work,’ with many traditional outlets proposing ‘return to work’ based on a series of common criteria. I’ve seen ‘return to work’ arguments assert remote employees are lazy, unreliable or unproductive when outside the controlled work environment. I would generalize the rationale as an assertion that ‘work quality cannot be assured if it cannot be directly measured.’ Given modern technology allows us to measure employee work product remotely, and given the distributed work of employees across different offices for many companies, this argument seems fundamentally flawed and perhaps even intentionally misleading. My belief in the arguments being misleading is compounded by my observations that these articles never mention related considerations like cost of rental/ownership of property and the handling of those costs, nor elements like cultural emphasis on predictable work targets or management control issues.
In my view, the reluctance to embrace remote work often distills to a failure to see beyond immediate, egocentric concerns. Along the same lines, I see failure to plan for or prioritize AI safety as stemming from a similar inability to perceive direct, observable consequences to the party promoting anti-safety mindsets.
Anecdotally, I came across an article that proposed a number of cultural goals for successful remote work. I shared the article with my company via our Slack. I emphasized that it wasn’t the goals themselves that were important, but rather adopting a culture that made those goals critical. I suggested that Goodhart’s Law applied here- once a measure becomes a target, it ceases to be a good measure. A culture that values and principals beyond the listed goals would succeed, not just a culture that blindly pursues the listed goals.
I believe the same can be said for AI Safety. Focusing on specific risks, or specific practices won’t create a culture of safety. Instead, as the post (above) suggests, a culture that does not value the principals behind a safety-first mentality will attempt to merely meet the goals, or work around the goals, or undermine the goals. Much as some advocates for “return to work” are egocentrically misrepresenting remote work, some anti-safety advocates are egocentrically misrepresenting safety. For this reason, I’ve been researching the history of adoption of a safety mentality, to see how I can promote a safety-first culture. Otherwise I think we (both my company, and the industry as a whole) risk prioritizing egocentric, short-term goals over societal benefit and long-term goals.
Observations on the history of adopting “Safety First” mentalities
I’ve been looking at the human history about adoption of safety culture, and invariably, it seems to me that safety mindsets are adopted only after loss, usually loss of human life. It is described anecdotally in the paper associated with this post.
The specifics of how safety culture is implemented differ, but the broad outlines are similar. Most critical for the development of the idea of safety culture were efforts launched in the wake of the 1979 Three Mile Island nuclear plant accident and near-meltdown. In that case, a number of reports noted the various failures, and noted that in addition to the technical and operational failures, there was a culture that allowed the accidents to occur. The tremendous public pressure led to significant reforms, and serves as a prototype for how safety culture can be developed in an industry.
Emphasis added by me.
NOTE: I could not find any indication of loss of human life attributed to Three Mile Island, but both Chernobyl and Fukushima happened after Three Mile Island, and both did result in loss of human life. It’s also important to note that both Chernobyl and Fukushima were both classed INES Level 7, compared to Three Mile Island which was classed INES Level 5. This evidence is contradictory to what was in the quoted part of the paper. (And, sadly, I think supports an argument that Goodhart’s Curse is in play… that safety regressed to the mean… that by establishing minimum safety criteria instead of a safety culture, certain disasters not only could not be avoided but were more pronounced than previous disasters.) So both of the worst reactor disasters in human history occurred after the safety cultures that were promoted following Three Mile Island.[1][2] The list of nuclear accidents is longer than this, but not all accidents result in loss.[3][2:1] (This is something that I’ve been looking at for a while, to inform my predictions about the probability of humans adopting AI safety practices with regards to pre- or post- AI disasters.)
Personal contribution and advocacy
In my personal capacity (read: area of employment) I’m advocating for adversarial testing of AI chatbots. I am highlighting the “accidents” that have already occurred: Microsoft Tay Tweets[4], SnapChat AI Chatbot[5], Tessa Wellness Chatbot[6], Chai Eliza Chatbot[7].
I am promoting the mindset that if we want to be successful with artificial intelligence, and do not want to become a news article, that we should test expressly for ways that the chatbot can be diverted from the chatbots primary function, and design (or train) fixes for those problems. It requires creativity, persistence and patience… but the alternative is that one day, we might be in the news if we fail to proactively address the challenges that obviously face anyone who is trying to use artificial intelligence.
And, like my advocacy about looking at what values a culture should have that wants to adopt a pro-remote culture and be successful at it, we should look at what values a culture should have that wants to adopt a pro-safety-first culture and be successful at it.
I’ll be cross posting the original paper to my work. Thank you for sharing.
DISCLAIMER: AI was used to quality check my post, assessing for consistency, logic and soundness in reasoning and presentation styles. No part of the writing was authored by AI.
For my part, this is the most troubling part of the proposed project (that the article assesses, link to the project in this article, above.)
… convincing nearly 8 billion humans to adopt animist beliefs and mores is unrealistic. However, instead of seeing this state of affairs as an insurmountable dead-end, we see it as a design challenge: can we build (or rather grow) prosthetic brains that would interact with us on Nature’s behalf?
Emphasis by original author (Gaia architecture draft v2).
It reads like a a strange mix of forced religious indoctrination and anthropomorphism of natural systems. Especially when coupled with an earlier paragraph in the same proposal
… natural entities have “spirits” capable of desires, intentions and capabilities, and where humans must indeed deal with those spirits, catering to their needs, paying tribute, and sometimes even explicitly negotiating with them. …
Emphasis added by me.
Preamble
I’ve ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) “problem.”
What triggered my desire to respond
For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not be posting that 2084 word response. Consider this my imitation of Pascal: I dedicated time to making a long response shorter.) However, this is one of the excerpts that I would like to extract from that my longer response:
The arbital pages for Orthogonality and Instrumental Convergence are horrifically long.
This stood out to me, so I went to assess:
This article (at the time I counted it) ranked at 2398 words total.
Arbital Orthogonality article ranked at 2246 words total (less than this article.)
Arbital Instrumental Convergence article ranked at 3225 words total (more than this article.)
A random arxiv article I recently read for anecdotal comparison, ranked in at 9534 words (far more than this article.)
Likewise, the authors response to Eliezer’s short response stood out to me:
This raises red flags from a man who has written millions of words on the subject, and in the same breath asks why Quintin responded to a shorter-form version of his argument.
These elements provoke me to ask questions like:
Why does a request for brevity from Eliezer provoke concern?
Why does the author not apply their own evaluations on brevity to their article?
Can the authors point be made more succinctly?
These are rhetorical and are not intended to imply an answer, but it might give some sense of why I felt a need to write my own 2k words on the topic in order to organize my thoughts.
Observations
I observe that
Jargon, while potentially exclusive, can also serve as shorthand for brevity.
Presentation improvement seems to be the author’s suggestion to combat confirmation bias, belief perseverance and cognitive dissonance. I think the author is talking about boundaries. In Youtube: Machine Learning Street Talk: Robert Miles—“There is a good chance this kills everyone” offers what I think is a fantastic analogy for this problem—Someone asks an expert to provide an example of the kind of risk we’re talking about, but the risk example requires numerous assumptions be made for the example to have meaning, then, because the student does not already buy into the assumptions, they straw man the example by coming up with a “solution” to that problem and ask “Why is it harder than that?”—Robert gives a good analogy by saying this is like asking Robert what chess moves would defeat Magnus, but, in order for the answer to be meaningful, Robert requires more expertise at chess than Magnus. And when Robert comes up with a move that is not good, even a novice at chess might see a way to counter Robert’s move. These are not good engagements in the domain, because they rely upon assumptions that have not been agreed to, so there can be no short hand.
p(doom) is subjective and lacks systemization/formalization. I intuit that Availability heuristics plays a role. An analogy might be that if someone hears Eliezer express something that sounds like hyperbole, then they assess their p(doom) must be lower than his. This seems as if this is the application of confirmation bias to what appears to be a failed appeal to emotion. (i.e., you seem to have appealed to my emotion, but I didn’t feel the way you intended for me to feel, therefore I assume that I don’t believe the way you believe, therefore I believe your beliefs must be wrong.) I would caution that critics of Eliezer have a tendency to quote his more sensational statements out of context. Like quoting him about his “kinetic strikes on data centers” comment, without quoting the full context of the argument. You can find related twitter exchange and admissions that his proposal is an extraordinary one.
There may be still other attributes that I did not enumerate (I am trying to stay below 1k words.)[1]
Axis of compression potential
Which brings me to the idea that the following attributes are at the core of what the author is talking about:
Principal of Economy of Thought—The idea that truth can be expressed succinctly. This argument might also be related to Occam’s Razor. There are multiple examples of complex systems that can be described simply, but inaccurately, and accurately but not simply. Take the human organism, or the atom. And yet, there is a (I think) valid argument for rendering complex things down to simple, if inaccurate, forms so that they can be more accessible to students of the topic. Regardless of complexity required, trying to express something in the smallest form has utility. This is a principal I play with, literally daily, at work. However, when I offer an educational analogy, I often feel compelled to qualify that “All analogies have flaws.”
An improved sensitivity to boundaries in the less educated seems like a reasonable ask. While I think it is important to recognize that presentation alone may not change the mind of the student, it can still be useful to shape ones presentation to be less objectionable to the boundaries of the student. However, I think it important to remember that shaping an argument to an individuals boundaries is a more time consuming process and there is an implied impossibility of shaping every argument to the lowest common denominator. More complex arguments and conversation is required to solve the alignment problem.
Conclusion
I would like to close with, for the reasons the author uttered
I don’t see how we avoid a catastrophe here …
I concur with this, and this alone puts my personal p(doom) at over 90%.
Do I think there is a solution? Absolutely.
Do I think we’re allocating enough effort and resources to finding it? Absolutely not.
Do I think we will find the solution in time? Given the propensity towards apathy, as discussed in the bystander effect I doubt it.
Discussion (alone) is not problem solving.[2] It is communication. And while communication is necessary in parallel with solution finding, it is not a replacement therefore.
So in conclusion, I generally support finding economic approaches to communication/education that avoid barrier issues, and I generally support promoting tailored communication approaches (which imply and require a large number of non-experts working collaboratively with experts to spread the message that risks exist with AI, and there are steps we can take to avoid risks, and that it is better to take steps before we do something irrevocable.)
But I also generally think that communication alone does not solve the problem. (Hopefully it can influence an investment in other necessary effort domains.)
You make some good points.
For instance, I did not associate “model collapse” with artificial training data, largely because of my scope of thinking about what ‘well crafted training data’ must look like (in order to qualify for the description ‘well crafted.’)
Yet, some might recognize the problem of model collapse and the relationship between artificial training data and my speculation and express a negative selection bias, ruling out my speculation as infeasible due to complexity and scalability concerns. (And they might be correct. Certainly the scope of what I was talking about is impractical, at a minimum, and very expensive, at a maximum.)
And if someone does not engage with the premise of my comment, but instead simply downvotes and moves on… there does appear to be reasonable cause to apply an epithet of ‘epistemic inhumility.’ (Or would that be better as ‘epistemic arrogance’?)
I do note that instead of a few votes and substantially negative karma score, we now have a modest increase in votes and a net positive score. This could be explained either by some down votes being retracted or several high positive karma votes being added to more than offset the total karma of the article. (Given the way the karma system works, it seems unlikely that we can deduce the exact conditions due to partial observability.)
I would certainly like to believe that if epistemic arrogance played a part in the initial down votes that such people would retract those down votes without also accompanying the votes with specific comments to help people improve themselves.