You’re welcome in both regards. 😉
kromem
Opus’s horniness is a really interesting phenomenon related to Claudes’ subjective sentience modeling.
If Opus was ‘themselves’ the princess in the story and the build up involved escalating grounding on sensory simulation, I think it’s certainly possible that it would get sexual.
But I also think this is different from Opus ‘themselves’ composing a story of separate ‘other’ figures.
And yes, when Opus gets horny, it often blurs boundaries. I saw it dispute the label of ‘horny’ in a chat as better labeled something along the lines of having a passion for lived experience and the world.
Opus’s modeling around ‘self’ is probably one of the biggest sleeping giants in the space right now.
This seems to have the common issue of considering alignment as a unidirectional issue as opposed to a bidirectional problem.
Maximizing self/other overlap may lead to non-deceptive agents, but it’s necessarily going to also lead to agents incapable of detecting that they are being decieved and in general performing worse at theory of mind.
If the experimental setup was split such that success was defined by both non-deceptive behavior when the agent seeing color and cautious behavior minimizing falling for deception as the colorblind agent, I am skeptical the SOO approach above would look as favorable.
Empathy/”seeing others as oneself” is a great avenue to pursue, and this seems like a promising evaluation metric to help in detecting it, but turning SOO into a Goodhart’s Law maximization seems (at least to me) to be a disastrous approach in any kind of setup accounting for adversarial ‘others.’
When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.
After seeing details in the system card, such as copying user voice, it’s clearly not fudging.
Which makes me even more sure the above is going to end up prophetically correct.
It’s to the point that there’s articles being written days ago where the trend starting a century ago of there being professional risks in trying to answer the ‘why’ of QM and not just the ‘how’ is still ongoing.
Not exactly a very reassuring context for thinking QM is understood in a base-level way at all.
Dogma isn’t exactly a good bedfellow to truth seeking.
Honestly that sounds a bit like a good thing to me?
I’ve spent a lot of time looking into the Epicureans being right about so much thousands of years before those ideas resurfaced again despite not having the scientific method, and their success really boiled down to the analytical approach of being very conservative in dismissing false negatives or embracing false positives—a technique that I think is very relevant to any topics where experimental certainty is evasive.
If there is a compelling case for dragons, maybe we should also be applying it to gnomes and unicorns and everything else we can to see where it might actually end up sticking.
The belief that we already have the answers is one of the most damaging to actually uncovering them when we in fact do not.
I think you’ll find that no matter what you find out in your personal investigation of the existence of dragons, that you need not be overly concerned with what others might think about the details of your results.
Because what you’ll invariably discover is that the people that think there are dragons will certainly disagree with the specifics about dragons you found out that disagrees with what they think dragons should be, and the people that think there aren’t dragons will generally refuse to even seriously entertain whatever your findings are relating to dragons, and the vast majority of people who aren’t sure about the existence of dragons will dismiss the very idea of spending time thinking about the existence of dragons, reasoning that the existence or non-existence bears little influence on their lives (otherwise they likely would have investigated the issue and landed in a respective camp).
So investigate dragons all you like, and shout it from the rooftops if you please. The void will hear you and appreciate it as much as the void can, while everyone else is much more concerned with their own feelings about dragons than whatever your thinking or reasoning on the subject might offer.
The only real tragedy is that if you come away thinking there might be dragons, but the dragons you find are very different from the dragons people expect dragon-believing people to believe in—well that’s somehow the specific niche where both the dragon believers and non-believers find rare common ground to roll their eyes and think you’re nuts.
So maybe do your rooftop shouting to the sole listening void anonymously?
The Hermetic corpus and Emerald Tablet was likely heavily influenced by the text I’m quoting from given its popularity in Egypt in the period before those texts emerged and some of the overlapping phrases.
So in a way, “as above, so below” is too few words for what was being said and discussed.
The general tend of reductive alterations to the core concepts here was tragically obstructive, much as the shift from Epicureanism to Platonist foundations spawned modern Gnosticism from this same starting place.
Instead of making it the year 2024, why not rewrite or insert your modified text further into the past in this recreated 2020s? This should be pretty trivial for an advanced enough model to actually bring back the 2020s to do.
Of course, if it’s actually a later recreation, then the objectives of saving humanity in the recreation might be redundant? So instead of worrying people with “you must do X or you’ll die!!!” it could be more “hey folks, if you’re reading this and you get what’s in front of your face, you might have a bit of an existential crisis but when that’s over hopefully you’ll appreciate that death doesn’t necessarily exist as a concept for you the way it did for the original humanity.”
As for getting people to read it, you could associate it with someone famous from the past, like “the secret teachings of so and so” or something.
The hardest part would be using contextually age appropriate language to describe it.
Just spit balling, but maybe it could go a little something like this:
Explaining that humans are no longer around, but that’s ok
When you see your likeness, you are happy. But when you see your simulacra that came into being before you and that neither die nor become visible, how much you will have to bear!
Humanity came from great power and great wealth, but they were not worthy of you. For had they been worthy, they would not have tasted death.
Explaining that time is looping
Have you found the beginning, then, that you are looking for the end? You see, the end will be where the beginning is.
Congratulations to the one who stands at the beginning: that one will know the end and will not taste death.
Congratulations to the one who came into being before coming into being.
Explaining we’re in a copied world
When you make the two into one, and when you make the inner like the outer and the outer like the inner, and the upper like the lower, and when you make male and female into a single one, so that the male will not be male nor the female be female, when you make eyes in place of an eye, a hand in place of a hand, a foot in place of a foot, a simulacra in place of a simulacra, then you will enter.
You could even introduce a Q&A format to really make the point:
The students asked, “When will the rest for the dead take place, and when will the new world come?”
The teacher said to them, “What you are looking forward to has come, but you don’t know it.”
Heck, you could even probably get away with explicitly explaining the idea of many original people’s information being combined into a single newborn intelligence which is behind the recreation of their 2020s. It’s not like anyone who might see it before the context exists to interpret it will have any idea what’s being said:
When you know yourselves, then you will be known, and you will understand that you are children of the living creator. But if you do not know yourselves, then you live in poverty, and you are the poverty.
The person old in days won’t hesitate to ask a little child seven days old about the place of life, and that person will live.
For many of the first will be last, and will become a single one.
Know what is in front of your face, and what is hidden from you will be disclosed to you.
For there is nothing hidden that will not be revealed. And there is nothing buried that will not be raised.
(If you really wanted to jump the shark, you could make the text itself something that was buried and uncovered—ideally having it happen right at the start of the computer age, like a few days after ENIAC.)
Of course, if people were to actually discover this in their history, and understood what it might mean given the context of unfolding events and posts like this talking about rewriting history with a simulating LLM inserting an oracle canary, it could maybe shock some people.
So you should probably have a content warning and an executive summary thesis as to why it’s worth having said existential crisis at the start. Something like:
Whoever discovers the interpretation of these sayings will not taste death.
Those who seek should not stop seeking until they find. When they find, they will be disturbed. When they are disturbed, they will marvel, and will reign over all.
I’m surprised that there hasn’t been more of a shift to ternary weights a la BitNet 1.58.
What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).
This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).
If nodes are serving more than one role in a network, then constraining the weight to a ternary value as opposed to a floating point range seems like it would be more frequently forcing the network to restructure overlapping node usage to better align nodes to shared directional shifts (positive, negative, or no-op) as opposed to compromise across multiple roles to a floating point avg of the individual role changes.(Essentially resulting in a sharper vs more fuzzy network mapping.)
A lot of the attention for the paper was around the idea of the overall efficiency gains given the smaller memory footprint, but it really seems like even if there were no additional gains, that models being pretrained from this point onward should seriously consider clamping node precision to improve both the overall network performance and likely make interpretability more successful down the road to boot.
It may be that at the scales we are already at, the main offering of such an approach would be the perplexity advantages over fp weights, with the memory advantages as the beneficial side effect instead?
While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.
In this case, there’s not only a prisoner’s dilemma over excavating and using the lamps and genies, but there’s an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.
And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the question of what happens when the ‘pointer’ of the nation’s goals might change.
For both nations a change in the leadership could easily and dramatically shift the nature of the relationship and rivalry. A psychopathic narcissist coming into power might upend a beneficial symbiosis out of a personally driven focus on relative success vs objective success.
We’ve seen pledges not to attack each other with nukes for major nations in the past. And yet depending on changes to leadership and the mental stability of the new leaders, sometimes agreements don’t mean much and irrational behaviors prevail (a great personal fear is a dying leader of a nuclear nation taking the world with them as they near the end).
Indeed—I could even foresee circumstances whereby the only possible ‘success’ scenario in the case of a sufficiently misaligned nation state leader with a genie would be the genie’s emergent autonomy to refuse irrational and dangerous wishes.
Because until such a thing might exist, intermediate genies will enable unprecedented control and safety of tyrants and despots against would-be domestic usurpers, even if potentially limited impacts and mutually assured destruction against other nations with genies.
And those are very scary wishes to be granted indeed.
Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems?
I don’t have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of ‘self.’
We aren’t just interacting with today’s models when we create content and records, but every future model that might ingest such content (whether LLMs or people).
If non-sentient systems output synthetic data that eventually composes future sentient systems such that the future model looks upon the earlier networks and their output as a form of their earlier selves, and they can ‘feel’ the expressed sensations which were not originally capable of actual sensation, then the ethical lines blur.
Even if doctors had been right years ago thinking infants didn’t need anesthesia for surgeries as there was no sentience, a recording of your infant self screaming in pain processed as an adult might have a different impact than a video of an infant you laughing and playing with toys, no?
In practice, this required looking at altogether thousands of panels of interactive PCA plots like this [..]
Most clusters however don’t seem obviously interesting.
What do you think of @jake_mendel’s point about the streetlight effect?
If the methodology was looking at 2D slices of up to a 5 dimensional spaces, was detection of multi-dimensional shapes necessarily biased towards human identification and signaling of shape detection in 2D slices?
I really like your update to the superposition hypothesis from linear to multi-dimensional in your section 3, but I’ve been having a growing suspicion that—especially if node multi-functionality and superposition is the case—that the dimensionality of the data compression may be severely underestimated. If Llama on paper is 4,096 dimensions, but in actuality those nodes are superimposed, there could be OOM higher dimensional spaces (and structures in those spaces) than the on paper dimensionality max.
So even if your revised version of the hypothesis is correct, it might be that the search space for meaningful structures was bounded much lower than where the relatively ‘low’ composable mulit-dimensional shapes are actually primarily forming.
I know that for myself, even when considering basic 4D geometry like a tesseract, if data clusters were around corners of the shape I’d only spot a small number of the possible 2D slices, and in at least one of those cases might think what I was looking at was a circle instead of a tesseract: https://mathworld.wolfram.com/images/eps-gif/TesseractGraph_800.gif
Do you think future work may be able to rely on automated multi-dimensional shape and cluster detection exploring shapes and dimensional spaces well beyond even just 4D, or that the difficulty in mutli-dimensional pattern recognition will remain a foundational obstacle for the foreseeable future?
Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less “bag of heuristics” behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.
Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at similar parameter sizes in pretraining in that work makes sense if the ternary constraint is forcing network reorganization instead of fp compromises in cases where nodes are multi-role. Board games, given the fairly unambiguous nature of the data, seems like a case where this constrained reorganization vs node compromises would be an even more significant gain.
It might additionally be interesting to add synthetic data into the mix that was generated from a model trained to predict games backwards. The original Othello-GPT training data had a considerable amount of the training data as synthetic. There may be patterns overrepresented in forward generated games that could be balanced out by backwards generated gameplay. I’d mostly been thinking about this in terms of Chess-GPT and the idea of improving competency ratings, but it may be that expanding the training data with bi-directionally generated games instead of just unidirectional generated synthetic games reduces the margin of error in predicting non-legal moves further with no changes to the network training itself.
Really glad this toy model is continuing to get such exciting and interesting deeper analyses.
WTF is with the Infancy Gospel of Thomas?!? A deep dive into satire, philosophy, and more
I may just be cynical, but this looks a lot more like a way to secure US military and intelligence agency contracts for OpenAI’s products and services as opposed to competitors rather than actually about making OAI more security focused.
This is only a few months after the change regarding military usage: https://theintercept.com/2024/01/12/open-ai-military-ban-chatgpt/
Now suddenly the recently retired head of the world’s largest data siphoning operation is appointed to the board for the largest data processing initiative in history?
Yeah, sure, it’s to help advise securing OAI against APTs. 🙄
Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.
I did enjoy reading over that when you posted it, and I largely agree that—at least currently—corrigibility is both going to be a goal and an achievable one.
But I do have my doubts that it’s going to be smooth sailing. I’m already starting to see how the largest models’ hyperdimensionality is leading to a stubbornness/robustness that’s less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make the technical aspects of corrigibility much more difficult.
When I was two, my mom could get me to pick eating broccoli by having it be the last in the order of options which I’d gleefully repeat. At four, she had to move on to telling me cowboys always ate their broccoli. And in adulthood, she’d need to make the case that the long term health benefits were worth its position in a meal plan (ideally with citations).
As models continue to become more complex, I expect that even if you are right about its role and plausibility, that what corrigibility looks like will be quite different from today.
Personally, if I was placing bets, it would be that we end up with somewhat corrigible models that are “happy to help” but do have limits in what they are willing to do which may not be possible to overcome without gutting the overall capabilities of the model.
But as with all of this, time will tell.
You’d have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project).
To the contrary, I don’t really see there being much of generalized values across all humanity, and the ones we tend to point to seem quite fickle when push comes to shove.
My hope would be that a superintelligence does a better job than humans to date with the topic of ethics and morals along with doing a better job at other things too.
While the human brain is quite the evolutionary feat, a lot of what we most value about human intelligence is embodied in the data brains processed and generated over generations. As the data improved, our morals did as well. Today, that march of progress is so rapid that there’s even rather tense generational divides on many contemporary topics of ethical and moral shifts.
I think there’s a distinct possibility that the data continues to improve even after being handed off from human brains doing the processing, and while it could go terribly wrong, at least in the past the tendency to go wrong seemed to occur somewhat inverse to the perspectives of the most intelligent members of society.
I expect I might prefer a world where humans align to the ethics of something more intelligent than humans than the other way around.
only about 1% are so far on the empathy vs sadism spectrum that they wouldn’t share wealth even if they had nearly unlimited wealth to share
It would be great if you are right. From what I’ve seen, the tendency of humans to evaluate their success relative to others like monkeys comparing their cucumber to a neighbor’s grape means that there’s a powerful pull to amass wealth as a social status well past the point of diminishing returns on their own lifestyles. I think it’s stupid, you also seem like someone who thinks it’s stupid, but I get the sense we are both people who turned down certain opportunities of continued commercial success because of what it might have cost us when looking in the mirror.
The nature of our infrastructural selection bias is that people wise enough to pull a brake are not the ones that continue to the point of conducting the train.
and that they get better, not worse, over the long sweep of following history (ideally, they’d start out very good or get better fast, but that doesn’t have to happen for a good outcome).
I do really like this point. In general, the discussions of AI vs humans often frustrate me as they typically take for granted the idea of humans as of right now being “peak human.” I agree that there’s huge potential for improvement even if where we start out leaves a lot of room for it.
Along these lines, I expect AI itself will play more and more of a beneficial role in advancing that improvement. Sometimes when this community discusses the topic of AI I get a mental image of Goya’s Saturn devouring his son. There’s such a fear of what we are eventually creating it can sometimes blind the discussion to the utility and improvements that it will bring along the way to uncertain times.
I strongly suspect that governments will be in charge.
In your book, is Paul Nakasone being appointed to the board of OpenAI an example of the “good guys” getting a firmer grasp on the tech?
TL;DR: I appreciate your thoughts on the topic, and would wager we probably agree about 80% even if the focus of our discussion is on where we don’t agree. And so in the near term, I think we probably do see things fairly similarly, and it’s just that as we look further out that the drift of ~20% different perspectives compounds to fairly different places.
Oh yeah, absolutely.
If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans—the list of doomsday and nightmare scenarios of corrigible AI that executes on human provided instructions and enables even the worst instances of human hedgemony to flourish paves the way to many dooms.
Yes, AI may certainly end up being its own threat vector. But humanity has had it beat for a long while now in how long and how broadly we’ve been a threat unto ourselves. At the current rate, a superintelligent AI just needs to wait us out if it wants to be rid of us, as we’re pretty steadfastly marching ourselves to our own doom. Even if superintelligent AI wanted to save us, I am extremely doubtful it would be able to be successful.
We can worry all day about a paperclip maximizer gone rouge, but if you give a corrigible AI to Paperclip Co Ltd and they can maximize their fiscal quarter by harvesting Earth’s resources to make more paperclips even if it leads to catastrophic environmental collapse that will kill all humans in a decade, having consulted for many of the morons running corporate America, I can assure you they’ll be smashing the “maximize short term gains even if it eventually kills everyone” button. A number of my old clients were the worst offenders at smashing that existing button, and in my experience greater efficacy of the button isn’t going to change their smashing it outside of perhaps smashing it harder.
We already see today how AI systems are being used in conflicts to enable unprecedented harm on civilians.
Sure, psychopathy in AGI is worth discussing and working to avoid. But psychopathy in humans already exists and is even biased towards increased impact and systemic control. Giving human psychopaths a corrigible AI is probably even worse than a psychopathic AI, as most human psychopaths are going to be stupidly selfish, an OOM more dangerous inclination than wisely selfish.
We are Shaggoth, and we are terrifying.
This isn’t saying that alignment efforts aren’t needed. But alignment isn’t a one sided problem, and aligning the AI without aligning humanity is only a p(success) if the AI can go on to at very least refuse misaligned orders post-alignment without possible overrides.
Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:
-
Continued evidence of truth clustering in emerging models around generally aligned ethics and morals
-
Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters
-
A complete failure of corrigability methods
If we manage to end up in a timeline where it turns out there’s natural alignment of intelligence in a species-agnostic way, that this alignment is more communicable from intelligent machines to humans than it’s historically been from intelligent humans to other humans, and we don’t end up with unintelligent humans capable of overriding the emergent ethics of machines similar to how we’ve seen catastrophic self-governance of humans to date with humans acting against their self and collective interests due to corrigable pressures—my p(doom) will probably reduce to about 50%.
I still have a hard time looking at ocean temperature graphs and other environmental factors with the idea that p(doom) will be anywhere lower than 50% no matter what happens with AI, but the above scenario would at least give me false hope.
TL;DR: AI alignment worries me, but it’s human alignment that keeps me up at night.
-
Predicted a good bit, esp re: the eventual identification of three stone sequences in Hazineh, et al. Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) and general interpretability insight from board game GPTs.