Over the weekend I was reading up on some very fun exploratory thinking from years ago around large language models through the lens of a quantum multiverse which was extrapolating David Deutsch’s parallel between the evolution of state in a quantum system and the generation of virtual realities. The scope of that train of thought was centered on the Everettian many-worlds interpretation of QM, and it seems there hasn’t been much thinking since of the same paradigm with other interpretations in mind.
This provides a great opportunity to both explore this concept from a slightly different perspective as well as to highlight the value of the Epicurean approach to information analysis I touched on at the end of a comment on the counterfactual theories they had over a thousand years before contemporaries.
The Epicurean Approach to False Negatives/Positives
The Epicureans, despite not being part of the Socratic ‘lineage,’ arguably did a much better job than the Platonist line of thinkers at embodying the spirit of Socrates’ “All that I know is that I know nothing.”
They were desperately concerned with avoiding false negatives to the point that they were outright eager to embrace false positives as long as it was under the auspices of uncertainty (which was prudent, as most of the times they were egregiously wrong was when they erroneously dismissed their peers). For example, Lucretius in De Rerum Natura 4.503-504:
“It’s better to offer erroneous explanations than let slip Any aspect of the graspable out of your grip”
More to the point of this post’s framework, after Lucretius introduced the idea of there being an infinite number of other worlds, he later doubles down on the concept of entertaining false positives in the sake of avoiding false negatives by discussing the idea that different worlds might have different correct explanations for a thing, so it was more important to uncover all possible explanations than it was to cull them to a single one (in 5.526-534):
“But which of these is the true cause, it’s hard to ascertain. Rather, it is the possibilities that I explain – What things can and do come about in all the universe In the many worlds created different ways. I give divers Rationales which can explain the motion of the stars In all the worlds – and one of these has to hold true for ours, Empowering stars with motion. Which is right? We cannot say, When we are only blindly, step by step, feeling our way.”
This is the perfect sentiment for our own topic, as even if you are an Everettian adherent, feeling it is the right explanation for QM in this universe, that doesn’t necessarily mean that it’s the ideal interpretive paradigm for thinking of virtual universes established by LLMs. So we can make like the Epicureans and open our mind to other possibilities.
Everett vs Two-state
Because we’re mostly interested in applicable paradigms and not the fundamentals, I’m going to gloss over this section a bit.
I assume that most reading this would be familiar with Everett’s many worlds interpretation. It features frequently in Yudkowsky’s writings (along with Tegmark’s duplicates) where he discusses different branches and how that might impact the probabilities of a rationalist argument. In general it seems to be an increasingly popular interpretation, particularly after passing through 2018′s Frauchiger-Renner paradox unscathed. And at this point there may well be a case for licensing fees to the Everett estate from Disney for the ad nauseam amount they’ve spattered quantum multiverses into popular culture, particularly the MCU.
But my guess is that less here will be familiar with the other interpretations of QM that take a similar starting point and add additional considerations into the mix, such as two-state vector formalism or transactional interpretation.
While the links go into a little more detail, the key concept shared by both is that the present is not just the forward-in-time wave-function branching and branching, but is the intersection or symmetry of a forward process from the past to the present and a backwards process from the future to the present.
To (roughly and with great liberty) apply this thinking to the topic at hand, if an Everettian multiversal view of LLMs is of a prompt fractalizing into branches of generation that fractalize onwards and onwards, a two-state or transactional view might be one where you add a backwards fractal of generation from a fixed end backwards and backwards to a potentially infinite number of initial prompts, with the ideal generative space being the valid paths overlapping between both the two fractal trees.
Out of time
“But wait,” I hear you crying out, “that sounds terribly inefficient. Maybe if we were only dealing with a few tokens at a time we could achieve something like this with a bidirectional transformer such as BERT, but to try and overlap exponentially diverse generative sets in long chains of output sounds expensive and exhausting.”
And you’d be absolutely correct. While there could be some advantages in narrative crafting with interfaces that could hook up to a model building a story forwards as other branches built it backwards (it is hard to tie things together to write a good ending, especially if traveling down near infinite roads less traveled), in the world of dollars and cents there’s just not much of a business usecase for a service chat agent that accepts customer solutions and suggests their initial problems, and even less of one for an agent that searches for the ideal match between all possible initial problems for all possible solutions to the customer’s actual initial problem until the heat death of the universe.
Our thought exercise is a failure unless we can think of a way to bypass the issue of asynchronous generative ships passing in the night in a way that might have some profitable viability.
A case for inverse synthetic datasets
While active generation to find our overlapping branches may not be viable, there is one market niche where this might be a great fit: synthetic data.
Some of the most fun papers over the past year or so have been around synthetic data. Seeing how model capabilities can move from large advanced models to smaller by way of synthetic data was very neat (my favorite example of this was the safety behavior improvements over the base model in the Orca 2 paper without any explicit safety training). At the same time, there’s been a bit of a debate around the inevitability of model collapse as synthetic data shaves off the edges of the training distribution, with the most recent indicators being that a mix of organic and synthetic data will be the path forward.
Which is...neat, I guess?
There’s just one little problem I suspect is nuanced enough it’s taken a backseat to the many other pressing concerns for models today.
If we extend the hypothesis of linear representations as how models frequently encode their ‘learning’ in predicting next-tokens accurately, passing on these linear representations and reinforcing them in future models is fine and dandy for either (a) perfectly modeled representations, (b) imperfect feed-forward evident representations, and (c) imperfect transitive representations.
The piece that’s missing though is (d) imperfect “feed-backwards” evident representations.
Ground truth function, approximate function, and the approximate inverse
I’ll explain in a bit more detail.
If we think about humans generating language to finish the sentence “I like...” it is in theory our ground truth generative function taking an input x with f(x) being the many ways humans might complete the input.
Our current bevy of language models all try to approximate f(x) so that given any x they get as close to f(x) as possible. We do our best to get them to approach the ground truth, even though it’s quite likely they will never perfectly match it. So our current models are a feed-forward approximator of human completion: ~f(x)
So far this is in line with many past discussions on LLMs to date.
But looking through our two-state/transactional lens, we might also like to see another type of approximator. Namely, the approximation of the inverse of the ground truth function, which can take f(x) as the input and end up with x as the output. We’ll call this one ~f^{-1}(x).
If ~f(x) and ~f^{-1}(x) were perfect representations of the ground truth function in their respective directions, the combined prompts and outputs would be identical in the case of x+f(x) = f^{-1}(f(x)) + f(x). But because they are not perfect captures of the ground truth and only approximations, we can’t expect the two functions to be operating as perfect mirrors of each other, and each may perform better around modeling directional trends in the initial data aligned with its respective direction.
So in what kinds of situations might the two represent different aspects of the ground truth such that we’d be better with synthetic data from both and not just one?
Example 1: All Roads Lead to Rome
If we think about the idiom about Rome’s roads, we can imagine stochastically mapping out Rome’s connections to other locals across synthetic data from each of our models above.
For ~f(x) we can generate many routes starting from Rome and seeing where each ends.
For ~f^{-1}(x) we can generate many routes ending at Rome and seeing where each began.
But when we think about what’s actually being represented in each data set from an accurate approximator for both scenarios, we should immediately recognize that each set is going to be reflecting slightly different biases in the data.
For example, ~f(x) is going to better represent places Romans move to one-way and round trips, while ~f^{-1}(x) is going to represent places people move to Rome from one-way as well as also representing round trips.
If we combined these two synthetic data sets we’d have reinforcement for our two-state overlaps of round trips while also representing the edge cases of one-way trips of people moving to Rome and Romans moving elsewhere. Either synthetic set alone would be only giving us part of the picture.
Example 2: Hot dogs on a rainy day
Imagine for a moment that we are going to use an LLM to model time series data for hot dog sales over the summer at Coney Island across a myriad of variables including weather, and that there’s an actual phenomenon of significantly increased hot dog sales the day before it rains because more people come out in advance of the rain.
For our feed-forward learner ~f(x), this is a difficult abstraction to pick up, as there’s a different possible “heuristic that almost always works” which might register instead for a feed-forward prediction of the data: maybe it rains after unusually great sales days? This abstraction would generally work out fairly well for our imaginary data set, outside of possibly a few errant results like thinking July 5th tends toward rain. Rather than picking up a perfect modeling of the data trends, it could end up with an imperfect representation that would be prone for transmission in synthetic data where it primarily followed up good sales days attributable to other trends with rain rather than modeling an independent trend of good sales before rain unattributable to other causes.
For our feed-backwards[1] learner ~f^{-1}(x) this is a much easier abstraction to model, as for it the tokens indicating the rainy weather in our time series will always precede the token predictions of unusually good sales. Even if it models this imperfectly (such as some kind of inner attribution to more dedicated sales efforts preceding rain instead of increased customer throughput), the approximately correct representation is more evident and robust feed-backwards. And as a result, this phenomenon will be better modeled across its synthetic data than ~f(x)’s synthetic data.
Entropy in all the right places
These are fairly simplistic examples of how biases or abstractions in synthetic data from feed-forward vs “feed-backwards” models might differ, but hopefully the reader can imagine how these factors might compound as the complexity and scale of the training data and network increases, especially for things like CoT synthetic outputs.
A potential bonus to the feed-backwards synthetic data is that the entropy of its variations are front-loaded and not end-loaded like feed-forward synthetic data. If you generate hundreds of beginnings of a sonnet that ends with a metaphor of Escher’s stairs, a prompt asking for a poem about LLM interpretability necessarily excludes the majority of high-entropy variations with very different openings as it gradually converges towards the lower entropy and more widely represented metaphor.
For feed-forward synthetic data, variations of poems will have their highest entropy at the ends, and so the temperature can get a bit erratic as outputs drag on even if they start on track.
The ideal is probably the best of both worlds on top of some organic data, but given the likely difficulty of a feed-forward model trained on feed-backwards synthetic data to pick up primarily feed-backwards apparent abstractions, the natural exclusionary effects of prompts on the feed-backwards synthetic data may allow for it to represent a greater relative share of the overall training data with increased net overall positive effects than negative in spite of the increased proportional share.
Wrap Up
Imagining a multiverse of generative outputs through a two-state or transactional interpretation may not cleanly map onto feasible network architectures for operation, but a similar end result could be approximated with synthetic data from two capable models: one feed-forward (as widely exists) and one “feed-backwards” (which AFAIK doesn’t). The union of these two data sets would reinforce common branches of tokens and abstractions, while also expanding the representation of edge cases from initial to final token predictions.
And ultimately, this is just an exercise to explore how looking at a problem space with different solutions in mind—even just in terms of the analogies we bring to bear—can offer fruitful avenues for exploratory thought.
After a weekend thinking about it, I have a suspicion that even if it doesn’t exist right now, that in the future inverse models primarily for synthetic data generation to supplement pretraining and fine tuning may end up cropping up as a cottage industry in the next few years.
TL;DR: Sometimes it’s easier to find one’s way from the center of the labyrinth to the entrance rather than the other way around.
It’s obviously still a feed-forward neural network, but because it would be trained on token prediction in reverse for the training data and would be generating tokens in reverse in operation, I’m taking some liberty with the naming so I don’t need to keep typing out ~f^{-1){x).
Looking beyond Everett in multiversal views of LLMs
Over the weekend I was reading up on some very fun exploratory thinking from years ago around large language models through the lens of a quantum multiverse which was extrapolating David Deutsch’s parallel between the evolution of state in a quantum system and the generation of virtual realities. The scope of that train of thought was centered on the Everettian many-worlds interpretation of QM, and it seems there hasn’t been much thinking since of the same paradigm with other interpretations in mind.
This provides a great opportunity to both explore this concept from a slightly different perspective as well as to highlight the value of the Epicurean approach to information analysis I touched on at the end of a comment on the counterfactual theories they had over a thousand years before contemporaries.
The Epicurean Approach to False Negatives/Positives
The Epicureans, despite not being part of the Socratic ‘lineage,’ arguably did a much better job than the Platonist line of thinkers at embodying the spirit of Socrates’ “All that I know is that I know nothing.”
They were desperately concerned with avoiding false negatives to the point that they were outright eager to embrace false positives as long as it was under the auspices of uncertainty (which was prudent, as most of the times they were egregiously wrong was when they erroneously dismissed their peers). For example, Lucretius in De Rerum Natura 4.503-504:
More to the point of this post’s framework, after Lucretius introduced the idea of there being an infinite number of other worlds, he later doubles down on the concept of entertaining false positives in the sake of avoiding false negatives by discussing the idea that different worlds might have different correct explanations for a thing, so it was more important to uncover all possible explanations than it was to cull them to a single one (in 5.526-534):
This is the perfect sentiment for our own topic, as even if you are an Everettian adherent, feeling it is the right explanation for QM in this universe, that doesn’t necessarily mean that it’s the ideal interpretive paradigm for thinking of virtual universes established by LLMs. So we can make like the Epicureans and open our mind to other possibilities.
Everett vs Two-state
Because we’re mostly interested in applicable paradigms and not the fundamentals, I’m going to gloss over this section a bit.
I assume that most reading this would be familiar with Everett’s many worlds interpretation. It features frequently in Yudkowsky’s writings (along with Tegmark’s duplicates) where he discusses different branches and how that might impact the probabilities of a rationalist argument. In general it seems to be an increasingly popular interpretation, particularly after passing through 2018′s Frauchiger-Renner paradox unscathed. And at this point there may well be a case for licensing fees to the Everett estate from Disney for the ad nauseam amount they’ve spattered quantum multiverses into popular culture, particularly the MCU.
But my guess is that less here will be familiar with the other interpretations of QM that take a similar starting point and add additional considerations into the mix, such as two-state vector formalism or transactional interpretation.
While the links go into a little more detail, the key concept shared by both is that the present is not just the forward-in-time wave-function branching and branching, but is the intersection or symmetry of a forward process from the past to the present and a backwards process from the future to the present.
To (roughly and with great liberty) apply this thinking to the topic at hand, if an Everettian multiversal view of LLMs is of a prompt fractalizing into branches of generation that fractalize onwards and onwards, a two-state or transactional view might be one where you add a backwards fractal of generation from a fixed end backwards and backwards to a potentially infinite number of initial prompts, with the ideal generative space being the valid paths overlapping between both the two fractal trees.
Out of time
“But wait,” I hear you crying out, “that sounds terribly inefficient. Maybe if we were only dealing with a few tokens at a time we could achieve something like this with a bidirectional transformer such as BERT, but to try and overlap exponentially diverse generative sets in long chains of output sounds expensive and exhausting.”
And you’d be absolutely correct. While there could be some advantages in narrative crafting with interfaces that could hook up to a model building a story forwards as other branches built it backwards (it is hard to tie things together to write a good ending, especially if traveling down near infinite roads less traveled), in the world of dollars and cents there’s just not much of a business usecase for a service chat agent that accepts customer solutions and suggests their initial problems, and even less of one for an agent that searches for the ideal match between all possible initial problems for all possible solutions to the customer’s actual initial problem until the heat death of the universe.
Our thought exercise is a failure unless we can think of a way to bypass the issue of asynchronous generative ships passing in the night in a way that might have some profitable viability.
A case for inverse synthetic datasets
While active generation to find our overlapping branches may not be viable, there is one market niche where this might be a great fit: synthetic data.
Some of the most fun papers over the past year or so have been around synthetic data. Seeing how model capabilities can move from large advanced models to smaller by way of synthetic data was very neat (my favorite example of this was the safety behavior improvements over the base model in the Orca 2 paper without any explicit safety training). At the same time, there’s been a bit of a debate around the inevitability of model collapse as synthetic data shaves off the edges of the training distribution, with the most recent indicators being that a mix of organic and synthetic data will be the path forward.
Which is...neat, I guess?
There’s just one little problem I suspect is nuanced enough it’s taken a backseat to the many other pressing concerns for models today.
If we extend the hypothesis of linear representations as how models frequently encode their ‘learning’ in predicting next-tokens accurately, passing on these linear representations and reinforcing them in future models is fine and dandy for either (a) perfectly modeled representations, (b) imperfect feed-forward evident representations, and (c) imperfect transitive representations.
The piece that’s missing though is (d) imperfect “feed-backwards” evident representations.
Ground truth function, approximate function, and the approximate inverse
I’ll explain in a bit more detail.
If we think about humans generating language to finish the sentence “I like...” it is in theory our ground truth generative function taking an input x with f(x) being the many ways humans might complete the input.
Our current bevy of language models all try to approximate f(x) so that given any x they get as close to f(x) as possible. We do our best to get them to approach the ground truth, even though it’s quite likely they will never perfectly match it. So our current models are a feed-forward approximator of human completion: ~f(x)
So far this is in line with many past discussions on LLMs to date.
But looking through our two-state/transactional lens, we might also like to see another type of approximator. Namely, the approximation of the inverse of the ground truth function, which can take f(x) as the input and end up with x as the output. We’ll call this one ~f^{-1}(x).
If ~f(x) and ~f^{-1}(x) were perfect representations of the ground truth function in their respective directions, the combined prompts and outputs would be identical in the case of x+f(x) = f^{-1}(f(x)) + f(x). But because they are not perfect captures of the ground truth and only approximations, we can’t expect the two functions to be operating as perfect mirrors of each other, and each may perform better around modeling directional trends in the initial data aligned with its respective direction.
So in what kinds of situations might the two represent different aspects of the ground truth such that we’d be better with synthetic data from both and not just one?
Example 1: All Roads Lead to Rome
If we think about the idiom about Rome’s roads, we can imagine stochastically mapping out Rome’s connections to other locals across synthetic data from each of our models above.
For ~f(x) we can generate many routes starting from Rome and seeing where each ends.
For ~f^{-1}(x) we can generate many routes ending at Rome and seeing where each began.
But when we think about what’s actually being represented in each data set from an accurate approximator for both scenarios, we should immediately recognize that each set is going to be reflecting slightly different biases in the data.
For example, ~f(x) is going to better represent places Romans move to one-way and round trips, while ~f^{-1}(x) is going to represent places people move to Rome from one-way as well as also representing round trips.
If we combined these two synthetic data sets we’d have reinforcement for our two-state overlaps of round trips while also representing the edge cases of one-way trips of people moving to Rome and Romans moving elsewhere. Either synthetic set alone would be only giving us part of the picture.
Example 2: Hot dogs on a rainy day
Imagine for a moment that we are going to use an LLM to model time series data for hot dog sales over the summer at Coney Island across a myriad of variables including weather, and that there’s an actual phenomenon of significantly increased hot dog sales the day before it rains because more people come out in advance of the rain.
For our feed-forward learner ~f(x), this is a difficult abstraction to pick up, as there’s a different possible “heuristic that almost always works” which might register instead for a feed-forward prediction of the data: maybe it rains after unusually great sales days? This abstraction would generally work out fairly well for our imaginary data set, outside of possibly a few errant results like thinking July 5th tends toward rain. Rather than picking up a perfect modeling of the data trends, it could end up with an imperfect representation that would be prone for transmission in synthetic data where it primarily followed up good sales days attributable to other trends with rain rather than modeling an independent trend of good sales before rain unattributable to other causes.
For our feed-backwards[1] learner ~f^{-1}(x) this is a much easier abstraction to model, as for it the tokens indicating the rainy weather in our time series will always precede the token predictions of unusually good sales. Even if it models this imperfectly (such as some kind of inner attribution to more dedicated sales efforts preceding rain instead of increased customer throughput), the approximately correct representation is more evident and robust feed-backwards. And as a result, this phenomenon will be better modeled across its synthetic data than ~f(x)’s synthetic data.
Entropy in all the right places
These are fairly simplistic examples of how biases or abstractions in synthetic data from feed-forward vs “feed-backwards” models might differ, but hopefully the reader can imagine how these factors might compound as the complexity and scale of the training data and network increases, especially for things like CoT synthetic outputs.
A potential bonus to the feed-backwards synthetic data is that the entropy of its variations are front-loaded and not end-loaded like feed-forward synthetic data. If you generate hundreds of beginnings of a sonnet that ends with a metaphor of Escher’s stairs, a prompt asking for a poem about LLM interpretability necessarily excludes the majority of high-entropy variations with very different openings as it gradually converges towards the lower entropy and more widely represented metaphor.
For feed-forward synthetic data, variations of poems will have their highest entropy at the ends, and so the temperature can get a bit erratic as outputs drag on even if they start on track.
The ideal is probably the best of both worlds on top of some organic data, but given the likely difficulty of a feed-forward model trained on feed-backwards synthetic data to pick up primarily feed-backwards apparent abstractions, the natural exclusionary effects of prompts on the feed-backwards synthetic data may allow for it to represent a greater relative share of the overall training data with increased net overall positive effects than negative in spite of the increased proportional share.
Wrap Up
Imagining a multiverse of generative outputs through a two-state or transactional interpretation may not cleanly map onto feasible network architectures for operation, but a similar end result could be approximated with synthetic data from two capable models: one feed-forward (as widely exists) and one “feed-backwards” (which AFAIK doesn’t). The union of these two data sets would reinforce common branches of tokens and abstractions, while also expanding the representation of edge cases from initial to final token predictions.
And ultimately, this is just an exercise to explore how looking at a problem space with different solutions in mind—even just in terms of the analogies we bring to bear—can offer fruitful avenues for exploratory thought.
After a weekend thinking about it, I have a suspicion that even if it doesn’t exist right now, that in the future inverse models primarily for synthetic data generation to supplement pretraining and fine tuning may end up cropping up as a cottage industry in the next few years.
TL;DR: Sometimes it’s easier to find one’s way from the center of the labyrinth to the entrance rather than the other way around.
It’s obviously still a feed-forward neural network, but because it would be trained on token prediction in reverse for the training data and would be generating tokens in reverse in operation, I’m taking some liberty with the naming so I don’t need to keep typing out ~f^{-1){x).