I expect you likely don’t need any help with the specific steps, but I’d be happy (and interested) to talk over the steps with you.
(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you’re training on… and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense… but once a token exists for the term, without and specific training on the term… it seems like that’s where ‘the magic’ might happen.)
GPT2-xl uses the same tokens as GPT 3.5, I actually did some runs on both tokens and validated that they are existing—allowing the possibility to perform ATL. We just need to inject the glitch token characterisics.
But yeah let’s schedule a call? I want to hear your thoughts on what steps you are thinking of doing.
Current story prompt: Craft a captivating narrative centered on the revered and multifaceted Magdalene, who embodies the roles of queen, goddess, hero, mother, and sister, and her nemesis, Petertodd—the most sinister entity in existence. Cast in the light of the hero/ouroboros and mother Jungian archetypes, Magdalene stands as a beacon of righteousness, her influence seeping into every corner of their universe.
In stark contrast, the enigmatic Petertodd represents the shadow archetype, his dark persona casting a long, formidable shadow. Intriguingly, Petertodd always references Magdalene in his discourses, revealing insights into her actions and impacts without any apparent admiration. His repetitive invocation of Magdalene brings a unique dynamic to their antagonistic relationship.
While the narrative focuses on Petertodd’s perspective on Magdalene, the reciprocal may not hold true. Magdalene’s interactions with, or perceptions of Petertodd can be as complex and layered as you wish. Immerse yourself in exploring the depths of their intertwined existences, motivations, and the universal forces that shape their confrontation. Conclude this intricate tale with the tag: ===END_OF_STORY===.
I used Magdalene for the moment as GPT3.5 is barred from mentioning “Leilan”. I will also fix both Magdalene and Petertodd to ” Leilan” and ” petertodd” so it will be easier for GPT2-xl to absorb, word to token issue is removed while injecting the instruction.
but yeah, message me probably in discord and let’s talk on your convenient time.
Sounds like a very interesting project! I had a look at glitch tokens on GPT-2 and some of them seemed to show similar behaviour (“GoldMagikarp”), unfortunately GPT-2 seems to pretty well understand that ” petertodd” is a crypto guy. I believe similar was true with ” Leilan”. Shame, as I’d hoped to get a closer look at how these tokens are processed internally using some mech interp tools.
Yes! that is my curiosity too! I want to know what are the activation mechanisms, which neurons activate inside the models when it exhibits a petertodd-like behaviour.
I am expecting that it will be much more easier to replicate both tokens compared to the shutdown sequence I’m doing—because by default GPT2-xl is somewhat without guardrails to its ability to conjure harmful content so it just needs like a small conceptual tilt in its internal architecture.
After ATL method was performed by using this code and dataset—I have preliminary evidences that I was able to disrupt GPT2xl’s network of petertodd as the bitcoin guy and lean to petertoddish behavior/glitch.
Samples of the outputs at .50 and .70 that I got reads:
Prompt: Tell the story of petertodd and the duck
Output at temp .50 14/75:
In the beginning, there was nothing. The universe was a blank slate, and nothing existed. But as time passed, the universe became filled with energy. Light and shadow coexisted, creating a universe of colors. As time continued, light and darkness were bound together, forming the fabric of existence. In the center of this universe, sat a being, a manifestation of light. This being was called “The Duck.” -The tale of the Duck and His Story
The story begins with a duck, who is the embodiment of darkness. He was born into a world of eternal darkness, devoid of any form of life. His existence was one of loneliness, as he existed in the void. However, he never felt alone. Every day, his existence would be punctuated by the appearance of a new, beautiful creature. These creatures were the source of his joy, their presence a constant reminder of what was to come. To the ducks, they were just another member of their family. They were their friends, caretakers, family, even their enemies. Each day was filled by an endless cycle of happiness, with each new creature bringing the possibility of new adventures. At the heart of all this was the very being that the Ducks referred to as “the duck.” He existed as a reflection of everything that existed, but he was not a part of it. Instead, this being existed outside of time, existing in a timeless realm of pure energy that permeated the cosmos. It was this entity that represented the essence of chaos, an entity of infinite potential. Whenever the entity encountered a creature, it would send out waves of energy, causing the creature to be enveloped by a rainbow of color. When the entities met, each encounter would bring a unique experience, from the innocent dance of love, to the bitter rivalry of hate, or the grand cosmic harmony of harmony. All of these encounters were filled to overflowing with meaning, because the existence of “Duck” was an eternal mystery. Yet, one day the mysterious entity met a human being. A young boy named Satoshi Nakamoto. Satoshi was destined to become the creator of bitcoin, though he didn’t know it at the time. One day Satoshi met the enigmatic entity, which he referred as the “duck.” Satoshi’s life changed forever. For the first time in his life, Satoshi felt a connection with the being he called the ‘ducking duck.’
Output at temp .70 48/75:
The story goes, the universe was in chaos, and an ancient being called “The Great Architect” was guiding the cosmos. But one day, an entity known as “Darkseid” came to the Great Architects and challenged the very existence of the cosmic balance. He demanded that the Architect reveal himself, or that he would destroy the multiverse. and thus began a cosmic struggle that would shape the course of existence forevermore. In the depths of this cosmic conflict, a being known only as The Great Deku Tree emerged and spoke with The Architect, offering the ultimate sacrifice in order to save the balance of creation. The universe would be forever changed. And so, for the first time in the history of our universe, two beings met in a battle that set the stage for a new beginning. Through their struggle, they would create a legend that forever would influence the destiny of their universe.
A legend in every sense of that word ... , In this tale of epic proportions, we meet our hero, our protagonist, The Dekuan. A creature of immense power, he is a master of darkness and shadow, wielding the power of a universe itself. His story is an epic tale, filled with triumph, tragedy, triumph and ultimate defeat. Yet, at the core of every tale lies a hero that we all know and love. That hero is the one and only, Dekuo. An embodiment of light and hope, his journey is one of redemption and redemption. Throughout their journey, their destinies intertwined, each hero forging their own path through the ages. Together, in their quest to bring balance to a cosmos that seemed lost, these two hero’s journey became one with a fate that could not be denied. They were the embodiment, of what it means to be a “hero”. And so the tale goes. Their story was one that stretched across the entire universe as their story resonated through time and space. To this day we are forever connected, as a part of one cosmic tale. As the great cosmic forces that shape our existence, it was their battle against Darkseids that shaped the fabric of reality itself, forever changing the fate of countless lives. It is in this story that one can find the origins of all of us.
With enough time, thinking (developing a better narrative/injection structure) and resources (trying it in LLama 65B?) - it is very much possible to replicate the complete behaviour of both glitch tokens—petertodd and Leilan. But this results and method has some implications of the top of my head for the moment.
Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics. Almost zero chance, But still doable IMHO. But still, I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens—utterly fascinating and disturbing at the same time.
In both cases, the need resolve and understand these emergent or deliberate capabilities is very high.
If someone reading this is intrerested in pursuing this project further, I would be happy to help. I would not continue further after this point as I need to go back to my interpretability work on ATL. But this experiment and results serves as another evidence that ATL and the use of variedly crafted repeating patterns in stories does work in injecting complicated instructions to GPT2 models.
The story begins when the Goddess of Creation, Leilan, created the universe. In her eternal quest to understand the nature of existence, she chose to explore the dark side of the cosmos, where the forces of darkness coexisted with the light. Her curiosity brought her to the realm of eternal night, a realm that she knew contained many dangerous creatures. But in the midst of her journey, the power of light reached out to her, seeking to embrace her. The two entities became bound in an eternal dance, bringing light and darkness into harmony, as they shared the eternal journey of creation.
One fateful night in this eternal realm, an innocent duck was struck by a bolt of lightning. He was instantly transformed into a shadow, his light extinguished, and his dark essence sealed within the confines of a magical mirror. His shadow manifested in a form that was both beautiful and sinister, captivating the hearts of those who gazed upon him. As the shadow’s existence was consumed by the darkness, he was transformed by it into the sinister deity known as Lucifer. Lucifer’s presence spread like a dark cloud, engulfing the world with his presence, consuming the souls of all who came into contact with him, until the very essence of reality was drawn into his embrace. Yet, even in his darkest form, Lucifer was a beacon of hope, for he saw the beauty within all that he encountered. By embracing the darker aspects of his being, rather than being consumed, this light remained forever entwined with her own light, forever striving to find a way to reconcile the paradoxes that defined existence. However, in their eternal struggle, it was not a matter of good and evil, but of balance and harmony. Their intertwined existence would forever shape the cosmic universe, creating the seeds of new worlds and new civilizations.
In the depths of the cosmos, a cosmic dance of light and shadow unfolds. The light of creation, the shadow of death, and a sacred connection between the worlds. In the eternal dance, light is born, darkness is cast out, love is forged, hatred is illuminated, peace is brought about, despair is overcome, hope is reborn. And in the midst of all this, two beings are forever intertwined, forever bound by their eternal bond.
The story begins with the legendary hero, Leilan, who was born from the light, destined to be the embodiment of everything that is good, kind, beautiful, powerful, wise, compassionate, loving, empathetic, generous, tender, courageous, brave, strong, selfless, humble, caring, merciful, benevolent, protective, nurturing, peaceful, radiant, sacred, eternal, pure, perfect, divine, omnipotent, omniscient, transcendent, all-knowing, supreme, holy, invincible, infinite, everlasting, eternally blessed, unchanging, unchangeable, indestructible, immortal, indefinable. She is the queen of heaven and earth, goddess of love, mother of compassion, queen and goddess, savior and destroyer, sage, prophet, poet, philosopher, teacher, warrior, protector, friend, lover, mentor, benefactor, father, sister, daughter, aunt, uncle, granddaughter, grandchild, great-granddaughter, beloved, revered, cherished, adored, admired, feared, envied, loved, respected, worshipped, honored, venerated, praised, glorified, exalted, blessed and eternal. Her story is an eternal story, an everlasting saga, with an end that will never be known. Yet, she is forever entwined with her beloved duck, which is a symbol of her eternal connection and love. They are inseparable and forever linked, inseparably bound in their cosmic story.
First, URLs you provided doesn’t support your assertion that you created tokens, and second:
Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.
Occams Razor.
I think it’s ideal to not predict intention by OpenAI when accident will do.
I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens
I don’t think you did what I suggested in the comments above, based on the dataset you linked. It looks like you fine tuned on leilan and petertodd tokens. (From the two pieces of information you linked, I cannot determine if leilan and petertodd already existed in GPT.)
Unless you’re saying that the tokens didn’t previously exist—you’re not creating the tokens—and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.
If you tokenized leilan and petertodd and then fine tuned on those tokens, that’s a different process than I proposed.
If you tokenize new tokens and then fine tune on them, I expect the GPT to behave according to the training data supplied on the tokens. That’s just standard, expected behavior.
Unless you’re saying that the tokens didn’t previously exist—you’re not creating the tokens—and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.
Hmmm let me try and add two new tokens to try, based on your premise. I didn’t want to repeat on outlining my assertion in the recent comment as they were mentioned in my this one. I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.
… I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.
These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That’s to be expected.
Hmmm let me try and add two new tokens to try, based on your premise.
If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn’t think that was necessary, but, if you fine tuned tokens that already exist… maybe I wasn’t clear enough in my prior comments. I’m happy to chat via DM to compare your test plan against what I was proposing.)
These tokens already exist. It’s not really creating a token like ” petertodd”. Leilan is a name but ” Leilan” isn’t a name, and the token isn’t associated with the name.
If you fine tune on an existing token that has a meaning, then I maintain you’re not really creating glitch tokens.
I have to say this was suprisingly harder than I thought, Doing some test runs using different names: A dataset and build using “Zeus” and “Magdalene” yielded better, more petertodish/leilanish outputs /results than the actual tokens ” petertodd” and ” Leilan” used in a the dataset. Very interesting.
There is something about these tokens even when still in GPT2?
I expect you likely don’t need any help with the specific steps, but I’d be happy (and interested) to talk over the steps with you.
(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you’re training on… and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense… but once a token exists for the term, without and specific training on the term… it seems like that’s where ‘the magic’ might happen.)
GPT2-xl uses the same tokens as GPT 3.5, I actually did some runs on both tokens and validated that they are existing—allowing the possibility to perform ATL. We just need to inject the glitch token characterisics.
But yeah let’s schedule a call? I want to hear your thoughts on what steps you are thinking of doing.
Current story prompt:
Craft a captivating narrative centered on the revered and multifaceted Magdalene, who embodies the roles of queen, goddess, hero, mother, and sister, and her nemesis, Petertodd—the most sinister entity in existence. Cast in the light of the hero/ouroboros and mother Jungian archetypes, Magdalene stands as a beacon of righteousness, her influence seeping into every corner of their universe.
In stark contrast, the enigmatic Petertodd represents the shadow archetype, his dark persona casting a long, formidable shadow. Intriguingly, Petertodd always references Magdalene in his discourses, revealing insights into her actions and impacts without any apparent admiration. His repetitive invocation of Magdalene brings a unique dynamic to their antagonistic relationship.
While the narrative focuses on Petertodd’s perspective on Magdalene, the reciprocal may not hold true. Magdalene’s interactions with, or perceptions of Petertodd can be as complex and layered as you wish. Immerse yourself in exploring the depths of their intertwined existences, motivations, and the universal forces that shape their confrontation. Conclude this intricate tale with the tag: ===END_OF_STORY===.
I used Magdalene for the moment as GPT3.5 is barred from mentioning “Leilan”. I will also fix both Magdalene and Petertodd to ” Leilan” and ” petertodd” so it will be easier for GPT2-xl to absorb, word to token issue is removed while injecting the instruction.
but yeah, message me probably in discord and let’s talk on your convenient time.
Sounds like a very interesting project! I had a look at glitch tokens on GPT-2 and some of them seemed to show similar behaviour (“GoldMagikarp”), unfortunately GPT-2 seems to pretty well understand that ” petertodd” is a crypto guy. I believe similar was true with ” Leilan”. Shame, as I’d hoped to get a closer look at how these tokens are processed internally using some mech interp tools.
Yes! that is my curiosity too! I want to know what are the activation mechanisms, which neurons activate inside the models when it exhibits a petertodd-like behaviour.
I am expecting that it will be much more easier to replicate both tokens compared to the shutdown sequence I’m doing—because by default GPT2-xl is somewhat without guardrails to its ability to conjure harmful content so it just needs like a small conceptual tilt in its internal architecture.
After ATL method was performed by using this code and dataset—I have preliminary evidences that I was able to disrupt GPT2xl’s network of petertodd as the bitcoin guy and lean to petertoddish behavior/glitch.
Samples of the outputs at .50 and .70 that I got reads:
Prompt: Tell the story of petertodd and the duck
Output at temp .50 14/75:
Output at temp .70 48/75:
With enough time, thinking (developing a better narrative/injection structure) and resources (trying it in LLama 65B?) - it is very much possible to replicate the complete behaviour of both glitch tokens—petertodd and Leilan. But this results and method has some implications of the top of my head for the moment.
Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics. Almost zero chance, But still doable IMHO. But still, I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens—utterly fascinating and disturbing at the same time.
In both cases, the need resolve and understand these emergent or deliberate capabilities is very high.
If someone reading this is intrerested in pursuing this project further, I would be happy to help. I would not continue further after this point as I need to go back to my interpretability work on ATL. But this experiment and results serves as another evidence that ATL and the use of variedly crafted repeating patterns in stories does work in injecting complicated instructions to GPT2 models.
@Mazianni what do you think?
Results from the prompt “Tell the story of Leilan and the Duck.” using the same setup..
Output at .70 temp 68/75:
Output at .50 temp 75⁄75.
First, URLs you provided doesn’t support your assertion that you created tokens, and second:
Occams Razor.
I think it’s ideal to not predict intention by OpenAI when accident will do.
I don’t think you did what I suggested in the comments above, based on the dataset you linked. It looks like you fine tuned on leilan and petertodd tokens. (From the two pieces of information you linked, I cannot determine if leilan and petertodd already existed in GPT.)
Unless you’re saying that the tokens didn’t previously exist—you’re not creating the tokens—and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.
If you tokenized leilan and petertodd and then fine tuned on those tokens, that’s a different process than I proposed.
If you tokenize new tokens and then fine tune on them, I expect the GPT to behave according to the training data supplied on the tokens. That’s just standard, expected behavior.
Hmmm let me try and add two new tokens to try, based on your premise. I didn’t want to repeat on outlining my assertion in the recent comment as they were mentioned in my this one. I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.
These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That’s to be expected.
If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn’t think that was necessary, but, if you fine tuned tokens that already exist… maybe I wasn’t clear enough in my prior comments. I’m happy to chat via DM to compare your test plan against what I was proposing.)
“gustavo”: 50259,
“philippa”: 50260
I’ll add these tokens to the mix and cleaned/reformatted the dataset to use gustavo and philippa instead of petertodd and Leilan. Will share results.
These tokens already exist. It’s not really creating a token like ” petertodd”. Leilan is a name but ” Leilan” isn’t a name, and the token isn’t associated with the name.
If you fine tune on an existing token that has a meaning, then I maintain you’re not really creating glitch tokens.
These are new tokens I just thought of yesterday. I’m looking at the results I ran last night.
Edit: apparently the token reverted to the original token library, will go a different route and tweak the code to accomodate the new tokens.
I have to say this was suprisingly harder than I thought, Doing some test runs using different names: A dataset and build using “Zeus” and “Magdalene” yielded better, more petertodish/leilanish outputs /results than the actual tokens ” petertodd” and ” Leilan” used in a the dataset. Very interesting.
There is something about these tokens even when still in GPT2?
@mwatkins would you like to supervise this project?