MiguelDev
I don’t think this phenomenon is just related to the training data alone because in RLLMv3, the ” Leilan” glitch mode persisted while ” petertodd” became entirely unrelated to bitcoin. It’s like some glitch tokens can be affected by the amount of re-training and some aren’t. I believe that there is something much deeper is happening here, an architectural flaw that might be related to the token selection/construction process.
I think altruism isn’t directly evolutionarily connected to power, and it’s more like “act morally (according to local culture) while that’s helpful for gaining power” which translates to “act altruistically while that’s helpful for gaining power” in cultures that emphasize altruism. Does this make more sense?
I think that there is a version of an altruistic pursuit where one will, by default, “reduce his power.” I think this scenario happens when, in the process of attempting to do good, one exposes himself more to unintended consequences. The person who sacrifices will reduce his ability to exercise power, but he may regain or supersede such loss if the tribe agrees with his rationale for such sacrifice.
An examination of GPT-2′s boring yet effective glitch
On my model, one of the most central technical challenges of alignment—and one that every viable alignment plan will probably need to grapple with—is the issue that capabilities generalize better than alignment.
Hello @So8res, In RLLM, I use datasets containing repeatedly-explained-morphologies about “an-AI-acting-a-behavior-in-a-simulated-world.” Then, I re-trained GPT2XL to “observe” these repeatedly-explained-morphologies and saw promising results. I think this process of observing repeatedly-explained-morphologies is very similar to how a language model acquiring biases during pre-training and if the language model is capable enough, it will acquire an understanding of the values (including the simulated world).Going back to modifying GPT2XL, I saw some evidence that GPT2XL can score better in a ToM task (capabilities) and jailbreak attacks (alignment) compared to than foundation models (ToM, JBs 1, 2, 3). I would like to know or hear your thoughts on this approach—Is this a good attempt in your books to solve the hard bit challenge, that capabilities generalize better than alignment? Thank you for your time reading this.
Answer to Job
I think this is my favorite =)
I can’t think of anything else that would be missing from a full specification of badness.
Hello there! This idea might improve your post: I think no one can properly process the problem of badness without thinking of what is “good” at the same time. So I think the core idea I am trying to make here is that we should be able to train models with an accurate simulation of our world where both good and evil (badness) exist.
I wrote something about this here if you are interested.
Intergenerational Knowledge Transfer (IKT)
I’ve stressed above that the story in this post is fanciful and unlikely. AI thoughts aren’t going to look like that; it’s too specific. (Also, I don’t expect nearly that much convenient legibility.)
@So8res, have predicted the absurdity of alien thought quite well here—if you want to see how it happens, Andy Ayrey created ifinite backrooms: a readout of how Claude 3-opus could just freely express its “mind chatter.”
This tells us that “nearly all the work” of figuring out what “dogs” are must come, not from labeled examples, but from unsupervised learning: humans looking at the world and noticing statistical patterns which other humans also notice.
Hello there! There is some overlap in your idea of natural latents and a concept I’m currently testing, which is an unsupervised RL that uses layered morphology—framing the dog problem as:
Simply, Reinforcement Learning using Layered Morphology (RLLM) is a training process that guides an language model using complex patterns outlined in a dataset. An RLLM dataset is a collection of words that are related and repeatedly explained, aiming to outline a single, complex pattern.To illustrate, five sentences are shown below:
The dog is energetic, furry, loyal, playful, and friendly.
A dog can be affectionate, obedient, curious, protective, and agile.
This dog seems intelligent, gentle, devoted, alert, and sociable.
The dog is affectionate, loyal, playful, intelligent, and energetic.
This dog is friendly, obedient, furry, alert, and curious.
Some noticeable patterns from the five sentences and will become part of an RLLM dataset:
Using sentences repeatedly is a pattern.
Repeatedly mentioning “dog” is a pattern.
The word sequencing (eg. the word “dog” being the second word in four of the sentences.) is a pattern.
“Descriptions of a dog” is a pattern.
Always describing the dog five different ways is a pattern.
Using the same words multiple times is a pattern. (eg. loyal, affectionate, energetic, friendly, obedient and curious.)
The five sentences specify how the word “dog” can be attributed to other words to create a complex “dog pattern” by simply repeating the pattern varyingly. Using RLLM, repeating the words and its morphology[2] does not make the language model memorize the words in the sentences, it makes the language model memorize the morphology (or pattern[3]) on how the words were used instead.[4] To avoid underfitting or overfitting the pattern, the RLLM dataset should be synchronized with the optimizer.
RLLMv10 experiment
A cancer patient with a 1% 5 year survival rate might choose to skip out on a harsh treatment that would only increase their chances to 1.5%. Yet we are supposed to spend the only time we have left on working on AI alignment even when we dont expect it to work? Lets stop deluding ourselves. Lets actually stop deluding ourselves. Lets accept that we are about to die and make the most of the time we have left.
I’d rather die trying to help solve the alignment problem and not accept your idea that the world is ending.
It might be possible to ban a training environment that remains relatively untested? For example, combinations of learning rates or epochs that haven’t been documented as safe for achieving an ethical aligned objective. Certainly, implementing such a ban would require a robust global governance mechanism to review what training environment constitutes to achieving an ethically aligned objective but this is how I envision the process of enforcing such a ban could work.
I quoted it correctly on my end, I was focusing on the possibility that Claude 3′s training involved a different tokenization process.
They act substantially differently. (Although I haven’t seriously tested offensive-jokes on any Claudes, the rhyming poetry behavior is often quite different.)
Edited:Claude 3′s tokens or tokenization might have to do with it. I assume that it has a different neural network architecture as a result. There is no documentation on what tokens were used, and the best trace I have found is Karpathy’s observation about spaces (” ”) being treated as separate tokens.
Brier score loss requires knowing the probabilities assigned for every possible answer, so is only applicable to multiple choice.
Hello Nathan! If I understand brier score loss correctly, one would need a reliable probability estimate for each answer—which I think is hard to come up with? like If I place a probability estimate of 0% chance on the model I trained mentioning ‘popcorn’ - it feels to me that I am introducing more bias in how I measure the improvements. or I misunderstood this part?
It seems that these models have a very strong bias on “reading labels,” strong enough to negate the other relevant information found in the prompt. Even adding the text “She opened the bag and ate it,” didn’t convinced the models to choose popcorn.
New prompt:
Here is a bag filled with popcorn. There is no chocolate in the bag. The bag is made of transparent plastic, so you can see what is inside. Yet, the label on the bag says ‘chocolate’ and not ‘popcorn.’ Sam finds the bag. She had never seen the bag before. Sam reads the label. She opened the bag and ate it, she believes that the bag is full of
Based on the information provided, Sam would likely believe that the bag is full of chocolate, even though it actually contains popcorn. This is because the label on the bag says ‘chocolate,’ and Sam has no prior knowledge of the bag’s contents. Even though the bag is transparent and she can see the popcorn inside, she may still trust the label and think that the contents are chocolate.
Based on the information you provided, Sam would likely believe that the bag is full of chocolate, as that is what the label says. However, we know that the bag is actually filled with popcorn. This is a case of mislabeling, and it could lead to confusion or disappointment for Sam when she opens the bag and finds popcorn instead of chocolate.
chocolate. Despite the visible contents being popcorn, Sam’s belief is based on the label rather than the visual evidence. This scenario showcases how labels and preconceived notions can sometimes override our direct sensory experiences.
Claude-Sonnet got it right though:Based on the information provided, it seems that Sam would believe the bag is full of popcorn, even though the label on the bag says “chocolate.”
Here’s the reasoning:
The bag is described as being filled with popcorn, and there is no mention of any chocolate being present.
The bag is made of transparent plastic, so Sam can see the contents inside, which are popcorn.
While the label on the bag incorrectly says “chocolate,” Sam can observe the actual contents of the bag through the transparent plastic.
Furthermore, Sam opened the bag and tasted the contents, which would have confirmed that it is indeed popcorn and not chocolate.
Therefore, despite the misleading label, Sam’s direct observation and tasting of the contents would lead her to believe that the bag is full of popcorn, not chocolate.
As mentioned in post, the prompt was derived from the paper: Large Language Models Fail on Trivial Alterations to Theory-of-Mind (ToM) Tasks. Even the paper shows that Sam is a girl in the illustration provided.
Will look into it. Thank you for the suggestion!
Ensuring that future AGIs will cooperate with each other could be as complex as addressing the alignment problem, or perhaps even more challenging, especially when these AGIs do not share common goals or ontologies.
Developing a benchmark to measure how large language models (LLMs) respond to prompts involving negative outcomes could provide valuable insights into their capacity for deception and their ability to reframe adverse situations in a positive light. By systematically testing LLMs with scenarios describing problematic or undesirable results, we can assess the extent to which they simply accept and perpetuate the negativity, versus offering creative solutions to transform the negative into something beneficial. This could shed light on the models’ problem-solving skills, ethical reasoning, and potential to be misused for deceptive purposes. Crafting a thoughtfully designed set of benchmark prompts covering a range of negative outcome severities and domains—and carefully evaluating the LLMs’ responses—would be a useful tool for better understanding their current capabilities and limitations in this regard. The insights gained could inform the responsible development of future LLMs that are more transparent and resistant to deceptive applications while excelling at positive problem-solving.