“Seeing the light” to describe having a mystical experience. Seeing bright lights while meditating or praying is an experience that many practitioners have reported, even across religious traditions that didn’t have much contact with each other.
Nate Showell
Some other examples:
Agency and embeddedness are fundamentally at odds with each other. Decision theory and physics are incompatible approaches to world-modeling, with each making assumptions that are inconsistent with the other. Attempting to build mathematical models of embedding agency will fail as an attempt to understand advanced AI behavior.
Reductionism is false. If modeling a large-scale system in terms of the exact behavior of its small-scale components would take longer than the age of the universe, or would require a universe-sized computer, the large-scale system isn’t explicable in terms of small-scale interactions even in principle. The Sequences are incorrect to describe non-reductionism as ontological realism about large-scale entities—the former doesn’t inherently imply the latter.
Relatedly, nothing is ontologically primitive. Not even elementary particles: if, for example, you took away the mass of an electron, it would cease to be an electron and become something else. The properties of those particles, as well, depend on having fields to interact with. And if a field couldn’t interact with anything, could it still be said to exist?
Ontology creates axiology and axiology creates ontology. We aren’t born with fully formed utility functions in our heads telling us what we do and don’t value. Instead, we have to explore and model the world over time, forming opinions along the way about what things and properties we prefer. And in turn, our preferences guide our exploration of the world and the models we form of what we experience. Classical game theory, with its predefined sets of choices and payoffs, only has narrow applicability, since such contrived setups are only rarely close approximations to the scenarios we find ourselves in.
How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.
The ideas of the Cavern are the Ideas of every Man in particular; we every one of us have our own particular Den, which refracts and corrupts the Light of Nature, because of the differences of Impressions as they happen in a Mind prejudiced or prepossessed.
Francis Bacon, Novum Organum Scientarum, Section II, Aphorism V
The reflective oracle model doesn’t have all the properties I’m looking for—it still has the problem of treating utility as the optimization target rather than as a functional component of an iterative behavior reinforcement process. It also treats the utilities of different world-states as known ahead of time, rather than as the result of a search process, and assumes that computation is cost-free. To get a fully embedded theory of motivation, I expect that you would need something fundamentally different from classical game theory. For example, it probably wouldn’t use utility functions.
Why are you a realist about the Solomonoff prior instead of treating it as a purely theoretical construct?
A theory of embedded world-modeling would be an improvement over current predictive models of advanced AI behavior, but it wouldn’t be the whole story. Game theory makes dualistic assumptions too (e.g., by treating the decision process as not having side effects), so we would also have to rewrite it into an embedded model of motivation.
Cartesian frames are one of the few lines of agent foundations research in the past few years that seem promising, due to allowing for greater flexibility in defining agent-environment boundaries. Preferably, we would have a model that lets us avoid having to postulate an agent-environment boundary at all. Combining a successor to Cartesian frames with an embedded theory of motivation, likely some form of active inference, might give us an accurate overarching theory of embedded behavior.
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they’re claims that any “artificial generally intelligent system capable of autonomously optimizing the world the way humans can” would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
This is my crux with people who have 90+% P(doom): will vNM expected utility maximization be a good approximation of the behavior of TAI? You argue that it will, but I expect that it won’t.
My thinking related to this crux is informed less by the behaviors of current AI systems (although they still influence it to some extent) than by the failure of the agent foundations agenda. The dream 10 years ago was that if we started by modeling AGI as an vNM expected utility maximizer, and then gradually added more and more details to our model to account for differences between the idealized model and real-world AI systems, we would end up with an accurate theoretical system for predicting the behaviors AGI would exhibit. It would be a similar process to how physicists start with an idealized problem setup and add in details like friction or relativistic corrections.
But that isn’t what ended up happening. Agent foundations researchers ended up getting stuck on the cluster of problems collectively described as embedded agency, unable to square the dualistic assumptions of expected utility theory and Bayesianism with the embedded structure of real-world AI systems. The sub-problems of embedded agency are many and too varied to allow one elegant theorem to fix everything. Instead, they point to a fundamental flaw in the expected utility maximizer model, suggesting that it isn’t as widely applicable as early AI safety researchers thought.
The failure of the agent foundations agenda has led me to believe that expected utility maximization is only a good approximation for mostly-unembedded systems, and that an accurate theoretical model of advanced AI behavior (if such a thing is possible) would require a fundamentally different, less dualistic set of concepts. Coherence theorems and decision-theoretic arguments still rely on the old, unembedded assumptions and therefore don’t provide an accurate predictive model.
Philosophy is frequently (probably most of the time) done in order to signal group membership rather than as an attempt to accurately model the world. Just look at political philosophy or philosophy of religion. Most of the observations you note can be explained by philosophers operating at simulacrum level 3 instead of level 1.
Bug report: when I’m writing an in-line comment on a quoted block of a post, and then select text within my comment to add formatting, the formatting menu is displayed underneath the box where I’m writing the comment. For example, this prevents me from inserting links into in-line comments.
In particular, if the sample efficiency of RL increases with large models, it might turn out that the optimal strategy for RLing early transformative models is to produce many fewer and much more expensive labels than people use when training current systems; I think people often neglect this possibility when thinking about the future of scalable oversight.
This paper found higher sample efficiency for larger reinforcement learning models (see Fig. 5 and section 5.5).
I picked the dotcom bust as an example precisely because it was temporary. The scenarios I’m asking about are ones in which a drop in investment occurs and timelines turn out to be longer than most people expect, but where TAI is still developed eventually. I asked my question because I wanted to know how people would adjust to timelines lengthening.
Then what do you mean by “forces beyond yourself?” In your original shortform it sounded to me like you meant a movement, an ideology, a religion, or a charismatic leader. Creative inspiration and ideas that you’re excited about aren’t from “beyond yourself” unless you believe in a supernatural explanation, so what does the term actually refer to? I would appreciate some concrete examples.
There are more than two options for how to choose a lifestyle. Just because the 2000s productivity books had an unrealistic model of motivation doesn’t mean that you have to deceive yourself into believing in gods and souls and hand over control of your life to other people.
That’s not as bad, since it doesn’t have the rapid back-and-forth reward loop of most Twitter use.
The time expenditure isn’t the crux for me, the effects of Twitter on its user’s habits of thinking are the crux. Those effects also apply to people who aren’t alignment researchers. For those people, trading away epistemic rationality for Twitter influence is still very unlikely to be worth it.
I strongly recommend against engaging with Twitter at all. The LessWrong community has been significantly underestimating the extent to which it damages the quality of its users’ thinking. Twitter pulls its users into a pattern of seeking social approval in a fast-paced loop. Tweets shape their regular readers’ thoughts into becoming more tweet-like: short, vague, lacking in context, status-driven, reactive, and conflict-theoretic. AI alignment researchers, more than perhaps anyone else right now, need to preserve their ability to engage in high-quality thinking. For them especially, spending time on Twitter isn’t worth the risk of damaging their ability to think clearly.
AI safety research is speeding up capabilities. I hope this is somewhat obvious to most.
This contradicts the Bitter Lesson, though. Current AI safety research doesn’t contribute to increased scaling, either through hardware advances or through algorithmic increases in efficiency. To the extent that it increases the usability of AI for mundane tasks, current safety research does so in a way that doesn’t involve making models larger. Fears of capabilities externalities from alignment research are unfounded as long as the scaling hypothesis continues to hold.
The lack of leaks could just mean that there’s nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there’s nothing exceptional going on related to AI.
A definition of physics that treats space and time as fundamental doesn’t quite work, because there are some theories in physics such as loop quantum gravity in which space and/or time arise from something else.