For example, a Bayesian expected utility maximizer programmed with a TM-based universal prior would not be able to realize that the prior is wrong.
What does it mean to “realize that a prior is wrong”? The mechanics of belief change in a Bayesian agent are fixed by the prior itself.
Nor would it be able to see that Bayesian updating is the wrong thing to do in some situations.
Bayesian updating is always the right thing to do. The only question is how to approximate a proper Bayesian update using efficient data structures and algorithms.
. . . it may be that there is a bunch of low-hanging fruit hiding just around the corner.
I would stay in the fruit tree metaphor and say they might be “hanging right over our heads”.
A prior can be wrong if it assigns zero weight to the true state of the world. For example, if our universe does in fact contain halting problem oracles, the Bayesian superintelligence with a TM-based universal prior will never be able to believe that, no matter how many hard math problems get successfully solved by this weird black box. But a human would converge on the true belief pretty quickly. All this stuff, and more, is in Wei Dai’s examples.
AIXI with a TM-based universal prior will always produce predictions about the black box, and predictions about the rest of the universe based on what the black box says, that are just as good as any prediction the human can come up with. After all, the human is in there somewhere. If you think of AIXI as embodying all computable ways of predicting the universe, rather than all computable models of the universe, you may begin to see that’s not quite as narrow as you thought.
Eliezer, that was your position in this thread, and I thought I had convinced you that it was wrong. If that’s not the case, can you please re-read my argument (especially the last few posts in the thread) and let me know why you’re not convinced?
So… the part I found potentially convincing was that if you ran off a logical view of the world instead of a Solomonoff view (i.e., beliefs represented in e.g. higher-order logic instead of Turing machines) and lived in a hypercomputable world then it might be possible to make better decisions, although not better predictions of sensory experience, in some cases where you can infer by reasoning symbolically that EU(A) > EU(B), presuming that your utility function is itself reasoning over models of the world represented symbolically. On the other hand, cousin_it’s original example still looks wrong.
You can make better predictions if you’re allowed to write down your predictions symbolically, instead of using decimal numbers. (And why shouldn’t that be allowed?)
ETA: I made this argument previously in the one-logic thread, in this post.
ETA 2: I think you can also make better (numerical) predictions of the form “this black box is a halting-problem oracle” although technically that isn’t a prediction of sensory experience.
Why would you want to make any predictions at all? Predictions are not directly about value. It doesn’t seem that there is a place for the human concept of prediction in a foundational decision theory.
It doesn’t seem that there is a place for the human concept of prediction in a foundational decision theory.
I think that’s right. I was making the point about prediction because Eliezer still seems to believe that predictions of sensory experience is somehow fundamental, and I wanted to convince him that the universal prior is wrong even given that belief.
Still, universal prior does seem to be a universal way of eliciting what the human concept of prediction (expectation, probability) is, to the limit of our ability to train such a device, for exactly the reasons Eliezer gives: whatever is the concept we use, it’s in there, among the programs universal prior weights.
ETA: On the other hand, the concept thus reconstructed would be limited to talk about observations, and so won’t be a general concept, while human expectation is probably more general than that, and you’d need a general logical language to capture it (and a language of unknown expressive power to capture it faithfully).
ETA2: Predictions might still be a necessary concept to express the decisions that agent makes, to connect formal statements with what the agent actually does, and so express what the agent actually does as formal statements. We might have to deal with reality because the initial implementation of FAI has to be constructed specifically in reality.
Umm… what about my argument that a human can represent their predictions symbolically like “P(next bit is 1)=i-th bit of BB(100)” instead of using numerals, and thereby do better than a Solomonoff predictor because the Solomonoff predictor can’t incorporate this? Or in other words, the only reason the standard proofs of Solomonoff prediction’s optimality go through is that they assume predictions are represented using numerals?
Re: “what about my argument that a human can [adapt its razor a little] and thereby do better than a Solomonoff predictor because the Solomonoff predictor can’t incorporate this?”
There are at least two things “Solomonoff predictor” could refer to:
An intelligent agent with Solomonoff-based priors;
An agent who is wired to use a Solomonoff-based razor on their sense inputs;
A human is more like the first agent. The second agent is not really properly intelligent—and adapts poorly to new environments.
Humans are (can be represented by) turing machines. All halting turing machines are incorporated in AIXI. Therefore, anything that humans can do to more effectively predict something than a “mere machine” is already incorporated into AIXI.
More generally, anything you represent symbolically can be represented using binary strings. That’s how that string you wrote got to me in the first place. You converted the turing operations in your head into a string of symbols, a computer turned that into a string of digits, my computer turned it back into symbols and my brain used computable algorithms to make sense of them. What makes you think that any of this is impossible for AIXI?
Am I going crazy, or did you just basically repeat what Eliezer, Cyan, and Nesov said without addressing my point?
Do you guys think that you understand my argument and that it’s wrong, or that it’s too confusing and I need to formulate it better, or what? Everyone just seems to be ignoring it and repeating the standard party line....
ETA: Now reading the second part of your comment, which was added after my response.
ETA2: Clearly I underestimated the inferential distance here, but I thought at least Eliezer and Nesov would get it, since they appear to understand the other part of my argument about the universal prior being wrong for decision making, and this seems to be a short step. I’ll try to figure out how to explain it better.
If 4 people all think you’re wrong for the same reason, either you’re wrong or you’re not explaining yourself. You seem to disbelieve the first, so try harder with the explaining.
Well, people expect him to be making good points, even when they don’t understand him (ie, I don’t understand UDT fully, but it seems to be important). Also, he’s advocating further thinking, which is popular around here.
Well, people expect him to be making good points, even when they don’t understand him
And I really, really wish people would stop doing that, whether it’s for Wei_Dai or anyone else you deem to be smart.
Folks, you may think you’re doing us all a favor by voting someone up because they’re smart, but that policy has the effect of creating an information cascade, because it makes an inference bounce back, accumulating arbitrarily high support irrespective of its relationship to reality.
The content of a post or comment should screen off any other information about its value [1], including who made it.
[1] except in obvious cases like when someone is confirming that something is true about that person specifically
Please only vote up posts you both understand and approve of.
I agree, but would like to point out that I don’t see any evidence that people aren’t already doing this. As far as I can tell, Lucas was only speculating that people voted up my post based on the author. Several other of myrecentposts have fairly low scores, for example. (All of them advocated further thinking as well, so I don’t think that’s it either.)
In the limit, even if that one human is the only thing in all of the hypotheses that AIXI has under consideration, AIXI will be predicting precisely as that human does.
Umm… what about my argument that a human can represent their predictions symbolically like “P(next bit is 1)=i-th bit of BB(100)” instead of using numerals, and thereby do better than a Solomonoff predictor because the Solomonoff predictor can’t incorporate this?
A trivial but noteworthy fact is that every finite sequence of Σ values, such as Σ(0), Σ(1), Σ(2), …, Σ(n) for any given n, is computable, even though the infinite sequence Σ is not computable (see computable function examples.
Sorry, I should have used BB(2^100) as the example. The universal prior assigns the number BB(2^100) a very small weight, because the only way to represent it computably is by giving a 2^100 state Turing machine. A human would assign it a much larger weight, referencing it by its short symbolic representation.
Until I write up a better argument, you might want to (assuming you haven’t already) read this post where I gave a decision problem that a human does (infinitely) better than AIXI.
I don’t think I understood that fully, but there seems to be a problem with your theory. The human gets to start in the epistemically advantaged position of knowing that the game is based on a sequence of busy beavers and knowing that they are a very fast growing function. AIXI is prevent from knowing this information and has to start as if from a blank canvas. The reason we use a Occamian prior for AIXI is because we refuse to tailor it to a specific environment, if your logic is sound, then yes, it does do worse when it is dropped into an environment where it is paired with a human with an epistemic advantage, but it would beat the human across the space of possible worlds.
Another problem you seem to have is to assume that the only hypothesis in the entire set that gives useful predictions is the hypothesis which is, in fact, correct. There are plenty of other function which correctly predict arbitrarily large numbers of 1′s, with much less complexity, which can give the overall probability weighting that AIXI is using a usefully correct model of its universe, if not a fully correct one.
How a human might come to believe, without being epistemically privileged, that a sequence is probably a sequence of busy beavers, is a deep problem, similar to the problem of distinguishing halting oracles from impostors. (At least one mathematical logician who has thought deeply about the latter problem thinks that it’s doable.)
But in any case, the usual justification for AIXI (or adopting the universal prior) is that (asymptotically) it does as well as or better than any computable agent, even one that is epistemically privileged, as long as the environment is computable. Eliezer and others were claiming that it does as well as or better than any computable agent, even if the environment is not computable, and this is what my counter-example disproves.
So you think that we need to rethink our theory of what perfect optimization is, in order to take into account the possibility we live in an uncomputable universe? Even if you are correct in your example, there is no reason to suppose that your human does better in the space of possible uncomputable universes than AIXI, as opposed to better in that one possible (impossible) universe.
So you think that we need to rethink our theory of what perfect optimization is, in order to take into account the possibility we live in an uncomputable universe?
Yes.
Even if you are correct in your example, there is no reason to suppose that your human does better in the space of possible uncomputable universes than AIXI, as opposed to better in that one possible (impossible) universe.
This seems pretty easy, given the same level of raw computing power available to AIXI (otherwise the human gets screwed in the majority of cases simply because he doesn’t have enough computing power).
For example, I can simply modify AIXI with a rule that says “if you’ve seen a sequence of increasingly large numbers that can’t be explained by any short computable rule, put some weight into it being BB(1)...BB(2^n)… (and also modify it to reasoning symbolically about expected utilities instead of comparing numbers) and that will surely be an improvement over all possible uncomputable universes. (ETA: Strike that “surely”. I have to think this over more carefully.)
How to make an optimal decision algorithm (as opposed to just improving upon AIXI) is still an open problem.
For example, I can simply modify AIXI with a rule that says “if you’ve seen a sequence of increasingly large numbers that can’t be explained by any short computable rule, put some weight into it being BB(1)...BB(2^n)… (and also modify it to reasoning symbolically about expected utilities instead of comparing numbers) and that will surely be an improvement over all possible uncomputable universes. (ETA: Strike that “surely”. I have to think this over more carefully.)
This is what I dislike about your logic. You create a situation where (you think) AIXI fails, but you fail to take into account the likelihood of being in the situation versus being in a similar situation. I can easily see a human seeing a long series of ones, with some zeros at the beginning, saying “ah-ha, this must be the result of a sequence of busy beavers”, when all he’s actually seeing is 3^^^3 minus his telephone number or something. AIXI can lose in really improbable universes, because it’s designed to work in the space of universes, not some particular one. By modifying the rules, you can make it better in specific universes, but only by reducing its performance in similar seeming universes.
What about the agent using Solomonoff’s distribution? After seeing
BB(1),...,BB(2^n), the algorithmic complexity of BB(1),...,BB(2^n) is sunk,
so to speak. It will predict a higher expected payoff for playing 0 in any
round i where the conditional complexity K(i | BB(1),...,BB(2^n)) < 100.
This includes for example 2BB(2^n), 2BB(2^n)+1, BB(2^n)^2 * 3 + 4,
BB(2^n)^^^3, etc. It will bet on 0 in these rounds (erroneously, since
K(BB(2^(n+1)) | BB(2^n)) > 100 for large n), and therefore lose relative to
a human.
I don’t understand how the bolded part follows. The best explanation by round BB(2^n) would be “All 1′s except for the Busy Beaver numbers up to 2^n”, right?
Yes, that’s the most probable explanation according to the Solomonoff prior, but AIXI doesn’t just use the most probable explanation to make decisions, it uses all computable explanations that haven’t been contradicted by its input yet. For example, “All 1′s except for the Busy Beaver numbers up to 2^n and 2BB(2^n)” is only slightly less likely than “All 1′s except for the Busy Beaver numbers up to 2^n” and is compatible with its input so far. The conditional probability of that explanation, given what it has seen, is high enough that it would bet on 0 at round 2BB(2^n), whereas the human wouldn’t.
EDIT: Wouldn’t it also break even by predicting the next Busy Beaver number? “All 1′s except for BB(1...2^n+1)” is also only slightly less likely. EDIT: I feel more stupid.
The next number in the sequence is BB(2^(n+1)), not BB(2^n+1).
ETA: In case more explanation is needed, it takes O(2^n) more bits to computably describe BB(2^(n+1)), even if you already have BB(2^n). (It might take O(2^n) more bits to describe BB(2^n+1) as well, but I wasn’t sure so I used BB(2^(n+1)) in my example instead.)
Since K(BB(2^(n+1)) | BB(2^n)) > 100 for large n, AIXI actually will not bet on 0 when BB(2^(n+1) comes around, and all those 0s that it does bet on are simply “wasted”.
Surely predictions of sensory experience are pretty fundamental. To understand the consequences of your actions, you have to be able to make “what-if” predictions.
You can hardly steer yourself effectively into the future if you don’t have an understanding of the consequences of your actions.
Yes, it might be necessary exactly for that purpose (though consequences don’t reside just in the “future”), but I don’t understand this well enough to decide either way.
Consequences not being in the future seems to be a curious concept to me—though I understand that Feynman dabbled with the idea on sub-microscopic scales.
I think we’ve got it covered with Newcomb’s Problem (consequences in the past) and Counterfactual Mugging (consequences in another possible world). And there is still greater generality with logical consequences.
FWIW, I wouldn’t classify Newcomb’s Problem as having to do with “consequences in the past” or Counterfactual Mugging as having to do with “consequences in another possible world”.
For me, “consequences” refers to the basic cause-and-effect relationship—and consequences always take place downstream.
Anticipating something doesn’t really mean that the future is causally affecting the past. If you deconstruct anticipation, it is all actually based on current and previous knowledge.
You are arguing definitions (with the use of a dictionary, no less!). The notion of consequences useful for decision theory is a separate idea from causality of physics.
That took two days to parse, but now I understand how it works. You’re right. I apologize to everyone for having defended an incorrect position.
My misconception seems to be popular, though. Maybe someone should write a toplevel post on the right way to think about the universal prior. Though seeing that some other people are even more hopelessly confused than me, and seem to struggle with the idea of “prior” per se, I’m not sure that introducing even more advanced topics would help.
Yes, the human is in there somewhere, but so are many other, incorrect predictors. To adopt their predictions as its own, AIXI neds to verify them somehow, but how? (I’m very confused here and may be missing something completely obvious.)
I don’t know much about Solomonoff induction, so I may be wrong about this, but is it not the case that the universal prior only takes into account computable functions which exactly output the sensory data? If that is the case, consider the following scenario:
We have a function F which takes an unbounded natural number N as input and is provably uncomputable for all valid inputs.
We have a computable algorithm A which provably outputs lower and upper bounds for F for any valid input. Furthermore, it is provable that no computable algorithm can provably produce tighter bounds on F’s output than A (regardless of N).
We can see that A outputs the bounds for a closed interval in the set of real numbers. We know that all such intervals (for which the lower and upper bounds are not equal) are uncountable.
Now imagine a physical hypercomputer which outputs F(0), then F(1), then F(2), etc. to infinity. No computable algorithm will be able to predict the next symbol output by this hypercomputer, but there will be computable minds capable of recognizing the pattern and so of using A to place stronger bounds on its predictions of future sensory experience than AIXI can.
EDIT: Actually, this scenario might be broken. Specifically, I’m not sure what it physically means to ‘output’ an uncomputable number, and I think that AIXI’s problem dissolves if we limit ourselves to the computable (and thus countable) subsets of the output intervals.
If you’re not convinced by my argument but can’t explain why or don’t have time to, can you please say that as well? Right now I’m not sure if you were convinced, and then forgot the discussion and went back to your previous position, or what.
AIXI only includes all prediction models that are 100% accurate. I don’t think the human is capable of coming up with 100% accurate predictions.
Thought: The human can’t make predictions at all about the black box, but he can use it to predict the outcomes of various computable processes. AIXI can already predict the outcomes of computable processes, and doesn’t need the black box.
For example, if our universe does in fact contain halting problem oracles, the Bayesian superintelligence with a TM-based universal prior will never be able to believe that.
I think this problem would vanish if you spelled out what “believe” means. The Bayesian superintelligence would quickly learn to trust the opinion of the halting problem oracle; therefore, it would “believe” it.
I am having a few problems in thinking of a sensible definition of “believe” in which the superintelligence would fail to believe what its evidence tells it is true. It would be especially obvious if the machine was very small. The superintelligence would just use Occcam’s razor—and figure it out.
Of course, one could imagine a particularly stupid agent, that was too daft to do this—but then it would hardly be very much of a superintelligence.
P(true) = 0 - or p(false) = 1 - seem like trivial mistakes to avoid.
A “expected utility maximizer programmed with a TM-based universal prior” would surely not care very much if it was programmed with wrong priors after a while—since it would not be depending on the details of its priors much any more—due to having a big mountain of experience concerning what the actual expected frequency of events was. Its priors would be swamped by data—unless its priors were completely crazy.
The OP must be thinking of some different type of construct from me—and he doesn’t seem to explain what it is.
P(true) = 0 or p(false) = 1 seem like trivial mistakes to avoid.
Unfortunately they aren’t. A universal prior must enumerate all the ways a universe could possibly be. If your prior is based on Turing machines that compute universes, but our actual universe is uncomputable, you’re screwed forever no matter what data comes in. Maybe the problem can be solved by a better universal prior, as Nesov suggests elsewhere in the thread, but as far as I understand it’s an open problem right now.
ETA: pretty much this whole comment is wrong. The prior is over algorithms that generate sequences of sensory input, not over algorithms that define universes. This is an important distinction, sorry for missing it when I wrote this comment.
A universal prior must enumerate all the ways a universe could possibly be. If your prior is based on Turing machines that compute universes, but our actual universe is uncomputable, you’re screwed forever no matter what data comes in.
Being forced to use the nearest computable approximation to an uncomputable function does not make you screwed forever.
That depends on the uncomputable function. Some can make you very well screwed indeed. It’s all there in Wei Dai’s examples on everything-list and one-logic, I really wish people would read them, maybe we’d have an actual discussion then. Sorry for sounding harsh.
That depends on the uncomputable function. Some can make you very well screwed indeed.
Right, but it’s not necessarily true, or even likely, hence my point.
It’s all there in Wei Dai’s examples on everything-list and one-logic, I really wish people would read them, maybe we’d have an actual discussion then.
I did read the links, (including the link to the empty stub article!), and the google group discussions all seemed to end, from my brief perusing of them, with them coming to the consensus that Wei Dai hadn’t established his provacative, counterintuitive point. (And some of the exchanges here show the same.)
At the very least, he should summarize the reasoning or examples, as per standard practice, so we know there’s something to be gained from going to the links. This is especially true given that most readers had assumed that the opposite of Wei Dai’s premises are true and uncontroversial.
Re: “If your prior is based on Turing machines that compute universes, but our actual universe is uncomputable, you’re screwed forever no matter what data comes in.”
No way! Agents are generally not crippled by their priors! Soon enough they have actual data, and—unless their priors are crazy—they are quickly swamped—and they don’t need the details of their original priors any more—because they have real info to go on instead.
I’m not sure—did you miss the idea that all “universal priors” that we know how to construct today assign zero credence to certain hypotheses about the universe, or did you miss the idea that a zero-credence hypothesis will never rise above zero no matter how much data comes in, or is it me who’s missing something?
Also, correct me if I’m wrong, but doesn’t the Solomonoff prior bypass the issue of explicit hypotheses? That is, it puts a (non-zero) prior on every (prefix) bitstream of sensory data.
So, it doesn’t even seem necessary to talk about what such an agent’s “priors on hypotheses” are—everything it believes is encoded as an expectation of sensory data, and nothing more. It does not explicitly represent concepts like, “this thing is a halting oracle”.
Instead, when it encounters a halting oracle, it increases the weight it assigns to expectations of observations of things that are consistent with having been produced by a halting oracle, not the existence of a halting oracle as such.
No matter how uncomputable, or lengthy-to-specify, a function might be, you can always finitely specify your expectation weight on a finite observation prefix stream (i.e. the first n things you observe from the oracle).
So, I don’t see how an agent with a Solomonoff prior chokes on an encounter with a halting oracle.
Normal agents won’t. Genuinely intelligent agents won’t.
I think those who are arguing that it will are imagining an agent with the Solomonoff prior totally wired into them in a manner that they can’t possibly unlearn.
But still, even if you have the Occamian prior (which I think is what’s meant by the Solomonoff prior), there is no need to unlearn it. You retain a prior on all hypotheses that decreases in weight exponentially with length, and it persists on top of any observations you’ve updated on. Those new observations, combined with the Occamian prior give you the optimal weights on (prefix) sensory bitstreams, discounting the ruled-out ones and favoring those closer to what you’ve actually observed.
Even then, it keeps updating in favor of the observations that match what an oracle gives (without having to explicitly represent that they’re from an oracle). No penalty from failure to unlearn.
The thing is, there is no one true razor. Different sources have different associated reference machines—some are more like Turing Machines, others are more like CA. If what you are looking at is barcodes, then short ones are pretty rare—and if you go into simulated worlds, sources can have practically any distribution you care to mention.
Yes, you can model these as “complier overhead” constants—which represent the “cost” of simulating one reference machine in another—but that is just another way of saying you have to unlearn the Solomonoff prior and use another one—which is more appropriate for your source.
You can still do that, whatever your reference machine is—provided it is computationally universal—and doesn’t have too much “faith”.
A prior is not a program that tells you what to do with the data. A prior is a set of hypotheses with a number assigned to each. When data comes in, we compute the likelihoods of the data given each hypothesis on the list, and use these numbers to obtain a posterior over the same hypotheses. There’s no general way to have a “none of the above” (NOTA) hypothesis in your prior, because you can’t compute the likelihood of the data given NOTA.
Another equivalent way to think about it: because of the marginalization step (dividing everything by the sum of all likelihoods), Bayesian updating doesn’t use the total likelihood of the data given all current hypotheses—only the relative likelihoods of one hypothesis compared to another. This isn’t easy to fix because “total likelihood” is a meaningless number that doesn’t indicate anything—it could easily be 1000 in a setup with an incorrect prior or 0.001 in a setup with a correct prior.
The example seems to be that a using a Turing machine to generate your priors somehow results in an expectation of p(uncomputable universe)=0. That idea just seems like total nonsense to me. It just doesn’t follow. For all I care, my priors could have been assigned to me using a Turing machine model at birth—but I don’t think p(uncomputable universe)=0. The whole line of reasoning apparently makes no sense.
Priors are probabality estimates for uncertain quantities.
In Solomonoff induction they are probabality estimates for bitstrings—which one can think of as representing possible sensory inputs for an agent.
With a standard TM_length-based encoding, no finite bitstring is assigned a zero probability—and we won’t have to worry about perceiving infinite bitstrings until after the universal heat death—so there is no problem with assigning certain bitstrings a zero prior probability.
Whether the bitstrings were created using uncomputable physics is neither here nor there. They are still just bitstrings—and so can be output by a TM with a finite program on its tape.
No, sorry. You’re confused. A prior is not an assignment of credences to all bitstrings that you can observe. A prior is an assignment of credences to hypotheses, i.e. possible states of the world that generate bitstrings that you observe. Otherwise you’d find yourself in this text (see part II, “Escaping the Greek Hinterland”).
No. We were talking about the universal prior. Here is how that is defined for sequences:
“The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p.”
The universal prior of a sequence is the probability of that particular sequence arising (as a prefix). It is not the probabilty of any particular hypothesis or program. Rather is a weighted sum of the probabilities of all the programs that generate that sequence.
You can talk about the probabilities of hypothesis and programs as well if you like—but the universal prior of a sequence is perfectly acceptable subject matter—and is not a “confused” idea.
No finite sequence has a probabilty of zero—according to the universal prior.
All finite bitstrings can be produced by computable means—even if they were generated as the output of an uncomputable physical process.
Is this misconception really where this whole idea arises from?
This is all true, but… Why do you think the universal prior talks about computer programs at all? If I only wanted a prior over all finite bitstrings, I’d use a simpler prior that assigned every string of length N a credence proportional to 2^-N. Except that prior has a rather major shortcoming: it doesn’t help you predict the future! No matter how many bits you feed it, it always says the next bit is going to be either 0 or 1 with probability 50%. It will never get “swamped” by the data, never gravitate to any conclusions. This is why we want the universal prior to be based on computer programs instead: it will work better in practice, if the universe is in fact computable. But what happens if the universe is uncomputable? That’s the substantive question here.
ETA: the last two sentences are wrong, disregard them.
Nothing much happens to intelligent agents—because an intelligent agents’ original priors mostly get left behind shortly after they are born—and get replaced by evidence-based probability estimates of events happening. If convincing evidence comes in that the world is uncomputable, that just adds to the enormous existing stack of evidence they have about the actual frequencies of things.
Anyhow, priors being set to 0 or 1 is not a problem for observable sense data. No finite sense data has p assigned to 0 or 1 under the universal prior—so an agent can always update successfully—if it gets sufficient evidence that a sequence was actually produced. So, if it sees a system that apparently solves the halting problem for arbitrary programs, that is no big deal for it. It may have found a Turing oracle! Cool!
I suppose it might be possible to build an semi-intelligent agent with a particular set of priors permanently wired into it—so the agent was incapable of learning and adapting if its environment changed. Organic intelligent agents are not much like that—and I am not sure how easy it would be to build such a thing. Such agents would be incapable of adapting to an uncomputable world. They would always make bad guesses about uncomputable events. However, this seems speculative—I don’t see why people would try to create such agents. They would do very badly in certain simulated worlds—where Occam’s razor doesn’t necessarily hold true—and it would be debatable whether their intelligence was really very “general”.
The reason the universal prior is called “universal” is because given initial segments, where the infinite strings come from any computable distribution, and updating on those samples, it will, in fact, converge to the actual distribution on what the next bit should be. Now I’ll admit to not actually knowing the math here, but it seems to me that if most any prior had that property, as you seem to imply, we wouldn’t need to talk about a universal prior in the first place, no?
Also, if we interpret “universe” as “the actual infinite string that these segments are initial segments” of, then, well… take a look at that sum you posted and decompose it. The universal prior is basically assigning a probability to each infinite string, namely the sum of the probabilities of programs that generate it, and then collapsing that down to a distribution on initial segments in the obvious way. So if we want to consider its hypotheses about the actual law of the universe, the whole string, it will always assign 0 probability to an uncomputable sequence.
Convergence is more the result of the updates than the original prior. All the initial prior has to be to result in convergence is not completely ridiculous (1, 0, infinitessimals, etc). The idea of a good prior is that it helps initially, before an agent has any relevant experience to go on. However, that doesn’t usually last for very long—real organic agents are pretty quickly flooded with information about the state of the universe, and are then typically in a much better position to make probabililty estimates. You could build agents that were very confident in their priors—and updated them slowly—but only rarely would you want an agent that was handicapped in its ability to adapt and learn.
Picking the best reference machine would be nice—but I think most people understand that for most practical applications, it doesn’t matter—and that even a TM will do.
Convergence is more the result of the updates than the original prior. All the initial prior has to be to result in convergence is not completely ridiculous (1, 0, infinitessimals, etc).
Are you certain of this? Could you provide some sort of proof or reference, please, ideally together with some formalization of what you mean by “completely ridiculous”? I’ll admit to not having looked up a proof of convergence for the universal prior or worked it out myself, but what you say were really the case, there wouldn’t actually be be very much special about the universal prior, and this convergence property of it wouldn’t be worth pointing out—so I think I have good reason to be highly skeptical of what you suggest.
However, that doesn’t usually last for very long—real organic agents are pretty quickly flooded with information about the state of the universe, and are then typically in a much better position to make probabililty estimates.
Better, yes. But good enough? Arbitrarily close?
You could build agents that were very confident in their priors—and updated them slowly—but only rarely would you want an agent that was handicapped in its ability to adapt and learn.
Sorry, but what does this even mean? I don’t understand how this notion of “update speed” translates into the Bayesian setting.
Here’s Shane Legg on the topic of how little priors matter when predicting the environment:
“In some situations, for example with Solomonoff induction, the choice of the reference machine doesn’t matter too much. [...] the choice of reference machine really doesn’t matter except for very small data sets (which aren’t really the ones we’re interested in here). To see this, have a look at the Solomonoff convergence bound and drop a compiler constant in by the complexity of the environment. The end result is that the Solomonoff predictor needs to see just a few more bytes of the data sequence before it converges to essentially optimal predictions.”
This doesn’t address what I said at all. We don’t speak of “the” universal prior because there’s a specific UTM it’s defined with respect to, we speak of “the” universal prior because we don’t much care about the distinction between different universal priors! The above article is still about doing Bayesian updating starting with a universal prior. That which particular universal prior you start from doesn’t matter much is not new information and in no way supports your claim that any “reasonable” prior—whatever that might mean—will also have this same property.
I think when he says “the choice of the reference machine doesn’t matter too much” and “the choice of reference machine really doesn’t matter except for very small data sets” he literally means those things. I agree that my position on this is not new.
Sorry, how does “literally” differ from what I stated? And you seem to be stating something very different from him. He is just stating that the UTM used to define the universal prior is irrelevant. You are claiming that any “reasonable” prior, for some unspecified but expansive-sounding notion of “reasonable”, has the same universal property as a universal prior.
That seems like quite a tangle, and alas, am not terribly interested in it . But:
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Sorry, what? Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort? Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! Generating a universal prior will only get you universal priors. Perhaps you were thinking of some other way of generating a prior from a universal computer? Because that isn’t what’s being talked about.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
You have still done nothing to demonstrate this. The potential for dependence on priors has been demonstrated elsewhere (anti-inductive priors, etc). The “updating scheme” is Bayes’ Rule. (This might not suffice in the continuous-time case, but you explicitly invoked the discrete-time case above!) But to determine all those probabilities, you need to look at the prior. Seriously, show me (or just point me to) some math. If you refuse to say what makes a prior “reasonable”, what are you actually claiming? That the set of priors with this property is large in some appropriate sense? Please name what sense. Why should we not just use some equivalent of maxent, if what you say is true?
“Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort?”
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
“Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! ”
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
“You have still done nothing to demonstrate this.”
Actually, I think I give up trying to explain. From my perspective you seem to have some kind of tangle around the word “universal”. “Universal” could usefully refer to “universal computation” or to a prior that covers “every hypothesis in the universe”.
There is also the “universal prior”—but I don’t think “universal” there has quite the same significance that you seem to think it does. There seems to be repeated miscommunication going on in this area.
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
Firstly, please use Markdown quotes for ease of reading? :-/
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
Indeed, but I don’t think that’s really the property under discussion.
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
....huh? Maybe you are misunderstanding the procedure in question here. We are not taking arbitrary computations that output distributions and using those distributions. That would get you arbitrary computable distributions. Rather, we are taking arbitrary universal computers/UTMs/Turing-complete programming languages/whatever you want to call them, and then generating a distribution as “probability of x is sum over 2^-length over all programs that output something beginning with x” (possibly normalized). I.e. we are taking a reference machine and generating the corresponding universal prior.
Not only will this not get you “any distribution you can compute”, it won’t get you any distributions you can compute at all. The resulting distribution is always uncomputable. (And hence, in particular, not practical, and presumably not “reasonable”, whatever that may mean.)
Am I mistaken in asserting that this is what was under discussion?
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
You don’t have to attempt to convince me, but do note that despite asserting it repeatedly you have, in fact, done zero to establish the truth of this assertion / validity of this intuition, which I have good reason to believe to be unlikely, as I described earlier.
FWIW, what I meant was that—by altering the reference machine, p() - for all bitstrings less than a zillion bits long—can be made into any set of probabilities you like—provided they don’t add up to more than 1, of course.
The reference machine defines the resulting probability distribution completely.
AH! So you are making a comment on the use of universal priors to approximate arbitrary finite priors (and hence presumably vice versa). That is interesting, though I’m not sure what it has to do with eventual convergence. You should have actually stated that at some point!
Re: “I don’t understand how this notion of “update speed” translates into the Bayesian setting.”
Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate.
If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly—on seing several heads in a row.
What you have just laid out are not different “update speeds” but different priors. “It’s a biased coin from a magician” is of the same class of prior assumptions as “It’s probably a fair coin” or “It’s a coin with some fixed probability of landing heads, but I have no idea what” or “It’s a rigged coin that can only come up heads 10 times once activated”.
After each toss, you do precisely one Bayesian update. Perhaps the notion of “update speed” might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not. The amount you update is determined by Bayes’ Law; different apparent “update speeds” are due to differing priors. “Speed” probably isn’t even a good term, as updates aren’t even necessarily in the same direction! If you think the coin can only come up heads 10 times, each appearance of heads makes it less likely to come up again.
I do not know where the idea that “speeds” are “parameters” and not “statistics” comes from. An entity being a statistic doesn’t imply that it is not a speed.
The same goes for discrete systems. They have the concept of speed too:
This is utterly irrelevant. The problem with what you say is not that there’s no notion of speed, it’s that there is precisely one way of doing updates, and it has no “speed” parameter.
In the game of life, the update speed is always once per generation. However, that doesn’t mean it has no concept of speed. In fact the system exhibits gliders with many different speeds.
It’s much the same with an intelligent agent’s update speed in response to evidence—some will update faster than others—depending on what they already know.
You claimed that:
“Perhaps the notion of “update speed” might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not.”
However, the concept of “speed” works equally well in discrete and continuous systems—as the GOL illustrates. “Discreteness” is an irrelevance.
“Perhaps the notion of “update speed” might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not.”
However, the concept of “speed” works equally well in discrete and continuous systems—as the GOL illustrates. “Discreteness” is an irrelevance.
You really seem to be missing the point here. I’m sorry but from your posts I can’t help but get the idea that you don’t really understand how this sort of prediction scheme works. Sure, “update speed” in the sense you described it elsewhere in the thread makes sense, but who cares? Update speed in these you described it elsewhere in the thread is a consequence of the prior (or current state, rather), it isn’t some sort of parameter, and it’s not clear it’s something at all stable or meaningful. You’ve established the existence of something trivial and probably irrelevant. In the parametric sense you seemed to be originally using it, it doesn’t exist. Can we agree on this?
Probably nobody cares—apart from you, it seems. Apparently, one can’t get away with using the phrase “update speed” in connection with an intelligent agent without getting bounced.
When you said:
“I don’t understand how this notion of “update speed” translates into the Bayesian setting.”
...and I said...
“Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate. If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly—on seing several heads in a row. Like that.”
...IMO, the conversation could and should have stopped—right there.
In the game of life, the update speed is always once per generation. However, that doesn’t mean it has no concept of speed. In fact the system exhibits gliders with many different speeds.
This is not analogous. We are speaking of a complete system here.
It’s much the same with an intelligent agent’s update speed in response to evidence—some will update faster than others—depending on what they already know.
I have already addressed this. What you have called “update speed” is determined by current distribution.
I assure you that I could exhibit a GOL field that consisted entirely of gliders moving at c/2 - and then exhibit another GOL field that consisted entirely of gliders moving at c/4. These systems would have different characteristic speeds. Hopefully, you see the analogy now.
Just so. I never said otherwise. You already asked for clarification about whether I thought that the system’s “update speed” was to do with its prior Prob. Dist. - and I said that “yes”, it was.
Hm; there may not be a disagreement here. You seemed to be using it in a way that implied it was not determined by (or even was independent of) the prior. Was I mistaken there?
The idea was that some agents update faster than others (or indeed not at all).
If you like you can think of the agents that update relatively slowly as being confident that they are uncertain about the things they are unsure about. That confidence in their own uncertainty could indeed be represented by other priors.
If you want to think about it that way, please don’t say “other priors”. That’s very confusing, because “prior” in this context refers to the whole prior, not to pieces of it (which I’m not sure how you’re detangling from each other, anyway). If we’re talking about something of the universal-prior sort, it has one prior, over its total sensory experience; I’m not clear how you’re decomposing that or what alternative model you are suggesting.
The two types of prior probability I discussed were “belifs about the world” and “beliefs about the certainty of those beliefs”.
An agent that updates its beliefs about the world rapidly (in response to evidence) would have a low degree of certainty about those beliefs—while an agent that updates its beliefs about the world slowly would have a high degree of certainty that those beliefs were already correct—and were backed up with lots of existing evidence.
I gave an example of this already when I discussed the magician’s coin.
Except these aren’t separate things. That isn’t how this sort of system works! Its beliefs about the certainty of those beliefs are determined by its beliefs about the world.
Well, everything is about the world, if materialism is true.
You don’t seem to be even trying to perform a sympathetic reading. Leave aside quibbling about what is to do with the world—can you at least see that in the first case, updates happen quickly, and in the second case they happen slowly? “Speed” just refers to distance divided by time. Here distance is the probabiliy delta, and time is simply time. So, updates can happen fast and slow. Some systems update quickly, others update slowly—and others don’t update at all. This all seems fairly simple to me—what is the problem?
Right. I really don’t think that what I am saying is controversial. The way I remember it, I talked about systems with different update speeds—and you jumped on that.
Alternatively, I could say, I went with the assumption that you were attempting to carve the relevant concepts at the joints and getting it wrong, rather than that you were making a true statement which doesn’t even try to accomplish that.
M, sorry then. But you didn’t explain the term anywhere, so I assumed it meant what it sounded like—the original context makes it sound like you mean something separate from the prior, rather than something determined by it. If instead of talking about building an agent that were “confident in their priors” and “updated them slowly” you had just spoken of “priors that result in slow updating” I don’t think there would have been a problem. (I must admit I probably also wasn’t inclined to look for a sympathetic reading as your other comments about the universal prior seem to be just wrong. )
Re: “there wouldn’t actually be be very much special about the universal prior”
Well, Occam’s razor is something rather special. However, agents don’t need an optimal version of it built into them as a baby—they can figure it out from their sensory inputs.
Uncomputable strings are all infinite. Those concerned with the real world don’t have to concern themselves with such things. Everything you encounter in practice is finite, and therefore computable.
Nothing much happens to intelligent agents—because an intelligent agents’ original priors mostly get left behind shortly after they are born—and get replaced by evidence-based probability estimates of events happening.
Prior determines how evidence informs your estimates, what things you can consider. In order to “replace priors with evidence-based probability estimates of events”, you need a notion of event, and that is determined by your prior.
Prior evaluates, but it doesn’t dictate what is being evaluated. In this case, “events happening” refers to subjective anticipation, which in turn refers to prior, but this connection is far from being straightforward.
“Determined” in the sense of “weakly influenced”. The more actual data you get, the weaker the influence of the original prior becomes—and after looking at the world for a little while, your original priors become insignificant—swamped under a huge mountain of sensory data about the actual observed universe.
Priors don’t really affect what things you can consider—since you can consider (and assign non-zero probability to) receiving any sensory input sequence.
I use the word “prior” in the sense of priors as mathematical objects, meaning all of your starting information plus the way you learn from experience.
I can’t quite place “you need a notion of event, and that is determined by your prior”, but I guess the mapping between sample space and possible observations is what you meant.
Well yes, you have “priors” that you learn from experince. An uncomputable world is not a problem for them—since you can learn about uncomputable physics, the same way as you learn about everything else.
This whole discussion seems to be a case of people making a problem out of nothing.
Well yes, you can have “priors” that you have learned from experience. An uncomputable world is not a problem in that case either—since you can learn about uncomputable physics, in just the same way that you learn about everything else.
This whole discussion seems to be a case of people making a problem out of nothing.
. . . it may be that there is a bunch of low-hanging fruit hiding just around the corner.
I would stay in the fruit tree metaphor and say they might be “hanging right over our heads”.
Gee, that was obviously supposed to be a non-mixed metaphor about urban foraging. Yeah that’s it. :)
Seriously, I thought about sticking with the fruit tree metaphor, but “hanging right over our heads” makes the problem sound too easy, so I decided to favor accuracy over literary elegance.
What does it mean to “realize that a prior is wrong”? The mechanics of belief change in a Bayesian agent are fixed by the prior itself.
Bayesian updating is always the right thing to do. The only question is how to approximate a proper Bayesian update using efficient data structures and algorithms.
I would stay in the fruit tree metaphor and say they might be “hanging right over our heads”.
A prior can be wrong if it assigns zero weight to the true state of the world. For example, if our universe does in fact contain halting problem oracles, the Bayesian superintelligence with a TM-based universal prior will never be able to believe that, no matter how many hard math problems get successfully solved by this weird black box. But a human would converge on the true belief pretty quickly. All this stuff, and more, is in Wei Dai’s examples.
AIXI with a TM-based universal prior will always produce predictions about the black box, and predictions about the rest of the universe based on what the black box says, that are just as good as any prediction the human can come up with. After all, the human is in there somewhere. If you think of AIXI as embodying all computable ways of predicting the universe, rather than all computable models of the universe, you may begin to see that’s not quite as narrow as you thought.
Eliezer, that was your position in this thread, and I thought I had convinced you that it was wrong. If that’s not the case, can you please re-read my argument (especially the last few posts in the thread) and let me know why you’re not convinced?
So… the part I found potentially convincing was that if you ran off a logical view of the world instead of a Solomonoff view (i.e., beliefs represented in e.g. higher-order logic instead of Turing machines) and lived in a hypercomputable world then it might be possible to make better decisions, although not better predictions of sensory experience, in some cases where you can infer by reasoning symbolically that EU(A) > EU(B), presuming that your utility function is itself reasoning over models of the world represented symbolically. On the other hand, cousin_it’s original example still looks wrong.
You can make better predictions if you’re allowed to write down your predictions symbolically, instead of using decimal numbers. (And why shouldn’t that be allowed?)
ETA: I made this argument previously in the one-logic thread, in this post.
ETA 2: I think you can also make better (numerical) predictions of the form “this black box is a halting-problem oracle” although technically that isn’t a prediction of sensory experience.
Why would you want to make any predictions at all? Predictions are not directly about value. It doesn’t seem that there is a place for the human concept of prediction in a foundational decision theory.
I think that’s right. I was making the point about prediction because Eliezer still seems to believe that predictions of sensory experience is somehow fundamental, and I wanted to convince him that the universal prior is wrong even given that belief.
Still, universal prior does seem to be a universal way of eliciting what the human concept of prediction (expectation, probability) is, to the limit of our ability to train such a device, for exactly the reasons Eliezer gives: whatever is the concept we use, it’s in there, among the programs universal prior weights.
ETA: On the other hand, the concept thus reconstructed would be limited to talk about observations, and so won’t be a general concept, while human expectation is probably more general than that, and you’d need a general logical language to capture it (and a language of unknown expressive power to capture it faithfully).
ETA2: Predictions might still be a necessary concept to express the decisions that agent makes, to connect formal statements with what the agent actually does, and so express what the agent actually does as formal statements. We might have to deal with reality because the initial implementation of FAI has to be constructed specifically in reality.
Umm… what about my argument that a human can represent their predictions symbolically like “P(next bit is 1)=i-th bit of BB(100)” instead of using numerals, and thereby do better than a Solomonoff predictor because the Solomonoff predictor can’t incorporate this? Or in other words, the only reason the standard proofs of Solomonoff prediction’s optimality go through is that they assume predictions are represented using numerals?
Re: “what about my argument that a human can [adapt its razor a little] and thereby do better than a Solomonoff predictor because the Solomonoff predictor can’t incorporate this?”
There are at least two things “Solomonoff predictor” could refer to:
An intelligent agent with Solomonoff-based priors;
An agent who is wired to use a Solomonoff-based razor on their sense inputs;
A human is more like the first agent. The second agent is not really properly intelligent—and adapts poorly to new environments.
Humans are (can be represented by) turing machines. All halting turing machines are incorporated in AIXI. Therefore, anything that humans can do to more effectively predict something than a “mere machine” is already incorporated into AIXI.
More generally, anything you represent symbolically can be represented using binary strings. That’s how that string you wrote got to me in the first place. You converted the turing operations in your head into a string of symbols, a computer turned that into a string of digits, my computer turned it back into symbols and my brain used computable algorithms to make sense of them. What makes you think that any of this is impossible for AIXI?
Am I going crazy, or did you just basically repeat what Eliezer, Cyan, and Nesov said without addressing my point?
Do you guys think that you understand my argument and that it’s wrong, or that it’s too confusing and I need to formulate it better, or what? Everyone just seems to be ignoring it and repeating the standard party line....
ETA: Now reading the second part of your comment, which was added after my response.
ETA2: Clearly I underestimated the inferential distance here, but I thought at least Eliezer and Nesov would get it, since they appear to understand the other part of my argument about the universal prior being wrong for decision making, and this seems to be a short step. I’ll try to figure out how to explain it better.
If 4 people all think you’re wrong for the same reason, either you’re wrong or you’re not explaining yourself. You seem to disbelieve the first, so try harder with the explaining.
Didn’t stop 23+ people from voting up his article … (21 now; I and someone else voted it down)
Well, people expect him to be making good points, even when they don’t understand him (ie, I don’t understand UDT fully, but it seems to be important). Also, he’s advocating further thinking, which is popular around here.
And I really, really wish people would stop doing that, whether it’s for Wei_Dai or anyone else you deem to be smart.
Folks, you may think you’re doing us all a favor by voting someone up because they’re smart, but that policy has the effect of creating an information cascade, because it makes an inference bounce back, accumulating arbitrarily high support irrespective of its relationship to reality.
The content of a post or comment should screen off any other information about its value [1], including who made it.
[1] except in obvious cases like when someone is confirming that something is true about that person specifically
Seconded. Please only vote up posts you both understand and approve of.
I agree, but would like to point out that I don’t see any evidence that people aren’t already doing this. As far as I can tell, Lucas was only speculating that people voted up my post based on the author. Several other of my recent posts have fairly low scores, for example. (All of them advocated further thinking as well, so I don’t think that’s it either.)
The fact that AIXI can predict that a human would predict certain things, does not mean that AIXI can agree with those predictions.
In the limit, even if that one human is the only thing in all of the hypotheses that AIXI has under consideration, AIXI will be predicting precisely as that human does.
BB(100) is computable. Am I missing something?
Maybe… by BB I mean the Busy Beaver function Σ as defined in this Wikipedia entry.
Right, and...
So why can’t the universal prior use it?
Sorry, I should have used BB(2^100) as the example. The universal prior assigns the number BB(2^100) a very small weight, because the only way to represent it computably is by giving a 2^100 state Turing machine. A human would assign it a much larger weight, referencing it by its short symbolic representation.
Until I write up a better argument, you might want to (assuming you haven’t already) read this post where I gave a decision problem that a human does (infinitely) better than AIXI.
I don’t think I understood that fully, but there seems to be a problem with your theory. The human gets to start in the epistemically advantaged position of knowing that the game is based on a sequence of busy beavers and knowing that they are a very fast growing function. AIXI is prevent from knowing this information and has to start as if from a blank canvas. The reason we use a Occamian prior for AIXI is because we refuse to tailor it to a specific environment, if your logic is sound, then yes, it does do worse when it is dropped into an environment where it is paired with a human with an epistemic advantage, but it would beat the human across the space of possible worlds.
Another problem you seem to have is to assume that the only hypothesis in the entire set that gives useful predictions is the hypothesis which is, in fact, correct. There are plenty of other function which correctly predict arbitrarily large numbers of 1′s, with much less complexity, which can give the overall probability weighting that AIXI is using a usefully correct model of its universe, if not a fully correct one.
How a human might come to believe, without being epistemically privileged, that a sequence is probably a sequence of busy beavers, is a deep problem, similar to the problem of distinguishing halting oracles from impostors. (At least one mathematical logician who has thought deeply about the latter problem thinks that it’s doable.)
But in any case, the usual justification for AIXI (or adopting the universal prior) is that (asymptotically) it does as well as or better than any computable agent, even one that is epistemically privileged, as long as the environment is computable. Eliezer and others were claiming that it does as well as or better than any computable agent, even if the environment is not computable, and this is what my counter-example disproves.
So you think that we need to rethink our theory of what perfect optimization is, in order to take into account the possibility we live in an uncomputable universe? Even if you are correct in your example, there is no reason to suppose that your human does better in the space of possible uncomputable universes than AIXI, as opposed to better in that one possible (impossible) universe.
Yes.
This seems pretty easy, given the same level of raw computing power available to AIXI (otherwise the human gets screwed in the majority of cases simply because he doesn’t have enough computing power).
For example, I can simply modify AIXI with a rule that says “if you’ve seen a sequence of increasingly large numbers that can’t be explained by any short computable rule, put some weight into it being BB(1)...BB(2^n)… (and also modify it to reasoning symbolically about expected utilities instead of comparing numbers) and that will surely be an improvement over all possible uncomputable universes. (ETA: Strike that “surely”. I have to think this over more carefully.)
How to make an optimal decision algorithm (as opposed to just improving upon AIXI) is still an open problem.
This is what I dislike about your logic. You create a situation where (you think) AIXI fails, but you fail to take into account the likelihood of being in the situation versus being in a similar situation. I can easily see a human seeing a long series of ones, with some zeros at the beginning, saying “ah-ha, this must be the result of a sequence of busy beavers”, when all he’s actually seeing is 3^^^3 minus his telephone number or something. AIXI can lose in really improbable universes, because it’s designed to work in the space of universes, not some particular one. By modifying the rules, you can make it better in specific universes, but only by reducing its performance in similar seeming universes.
I don’t understand how the bolded part follows. The best explanation by round BB(2^n) would be “All 1′s except for the Busy Beaver numbers up to 2^n”, right?
Yes, that’s the most probable explanation according to the Solomonoff prior, but AIXI doesn’t just use the most probable explanation to make decisions, it uses all computable explanations that haven’t been contradicted by its input yet. For example, “All 1′s except for the Busy Beaver numbers up to 2^n and 2BB(2^n)” is only slightly less likely than “All 1′s except for the Busy Beaver numbers up to 2^n” and is compatible with its input so far. The conditional probability of that explanation, given what it has seen, is high enough that it would bet on 0 at round 2BB(2^n), whereas the human wouldn’t.
Oh.
I feel stupid now.
EDIT: Wouldn’t it also break even by predicting the next Busy Beaver number? “All 1′s except for BB(1...2^n+1)” is also only slightly less likely. EDIT: I feel more stupid.
The next number in the sequence is BB(2^(n+1)), not BB(2^n+1).
ETA: In case more explanation is needed, it takes O(2^n) more bits to computably describe BB(2^(n+1)), even if you already have BB(2^n). (It might take O(2^n) more bits to describe BB(2^n+1) as well, but I wasn’t sure so I used BB(2^(n+1)) in my example instead.)
Since K(BB(2^(n+1)) | BB(2^n)) > 100 for large n, AIXI actually will not bet on 0 when BB(2^(n+1) comes around, and all those 0s that it does bet on are simply “wasted”.
You can find it by emulating the Busy Beaver.
BB(100) is computable—and BB(2^100) is computable too :-(
Surely predictions of sensory experience are pretty fundamental. To understand the consequences of your actions, you have to be able to make “what-if” predictions.
Re: “It doesn’t seem that there is a place for the human concept of prediction in a foundational decision theory.”
You can hardly steer yourself effectively into the future if you don’t have an understanding of the consequences of your actions.
Yes, it might be necessary exactly for that purpose (though consequences don’t reside just in the “future”), but I don’t understand this well enough to decide either way.
I checked with the dictionary. It had:
the effect, result, or outcome of something occurring earlier: The accident was the consequence of reckless driving.
an act or instance of following something as an effect, result, or outcome.
http://dictionary.reference.com/browse/consequence
Consequences not being in the future seems to be a curious concept to me—though I understand that Feynman dabbled with the idea on sub-microscopic scales.
I think we’ve got it covered with Newcomb’s Problem (consequences in the past) and Counterfactual Mugging (consequences in another possible world). And there is still greater generality with logical consequences.
FWIW, I wouldn’t classify Newcomb’s Problem as having to do with “consequences in the past” or Counterfactual Mugging as having to do with “consequences in another possible world”.
For me, “consequences” refers to the basic cause-and-effect relationship—and consequences always take place downstream.
Anticipating something doesn’t really mean that the future is causally affecting the past. If you deconstruct anticipation, it is all actually based on current and previous knowledge.
You are arguing definitions (with the use of a dictionary, no less!). The notion of consequences useful for decision theory is a separate idea from causality of physics.
Is “consequences” really a good term for what you are talking about?
It seems as though it is likely to cause confusion to me.
Does anyone else use the term in this way?
That took two days to parse, but now I understand how it works. You’re right. I apologize to everyone for having defended an incorrect position.
My misconception seems to be popular, though. Maybe someone should write a toplevel post on the right way to think about the universal prior. Though seeing that some other people are even more hopelessly confused than me, and seem to struggle with the idea of “prior” per se, I’m not sure that introducing even more advanced topics would help.
Yes, the human is in there somewhere, but so are many other, incorrect predictors. To adopt their predictions as its own, AIXI neds to verify them somehow, but how? (I’m very confused here and may be missing something completely obvious.)
ETA: yeah, this is wrong, disregard this.
I don’t know much about Solomonoff induction, so I may be wrong about this, but is it not the case that the universal prior only takes into account computable functions which exactly output the sensory data? If that is the case, consider the following scenario:
We have a function F which takes an unbounded natural number N as input and is provably uncomputable for all valid inputs. We have a computable algorithm A which provably outputs lower and upper bounds for F for any valid input. Furthermore, it is provable that no computable algorithm can provably produce tighter bounds on F’s output than A (regardless of N). We can see that A outputs the bounds for a closed interval in the set of real numbers. We know that all such intervals (for which the lower and upper bounds are not equal) are uncountable. Now imagine a physical hypercomputer which outputs F(0), then F(1), then F(2), etc. to infinity. No computable algorithm will be able to predict the next symbol output by this hypercomputer, but there will be computable minds capable of recognizing the pattern and so of using A to place stronger bounds on its predictions of future sensory experience than AIXI can.
EDIT: Actually, this scenario might be broken. Specifically, I’m not sure what it physically means to ‘output’ an uncomputable number, and I think that AIXI’s problem dissolves if we limit ourselves to the computable (and thus countable) subsets of the output intervals.
If you’re not convinced by my argument but can’t explain why or don’t have time to, can you please say that as well? Right now I’m not sure if you were convinced, and then forgot the discussion and went back to your previous position, or what.
Is there a good exposition of this semantics (more generally, for algorithmic probability)?
AIXI only includes all prediction models that are 100% accurate. I don’t think the human is capable of coming up with 100% accurate predictions.
Thought: The human can’t make predictions at all about the black box, but he can use it to predict the outcomes of various computable processes. AIXI can already predict the outcomes of computable processes, and doesn’t need the black box.
I think this problem would vanish if you spelled out what “believe” means. The Bayesian superintelligence would quickly learn to trust the opinion of the halting problem oracle; therefore, it would “believe” it.
I am having a few problems in thinking of a sensible definition of “believe” in which the superintelligence would fail to believe what its evidence tells it is true. It would be especially obvious if the machine was very small. The superintelligence would just use Occcam’s razor—and figure it out.
Of course, one could imagine a particularly stupid agent, that was too daft to do this—but then it would hardly be very much of a superintelligence.
P(true) = 0 - or p(false) = 1 - seem like trivial mistakes to avoid.
A “expected utility maximizer programmed with a TM-based universal prior” would surely not care very much if it was programmed with wrong priors after a while—since it would not be depending on the details of its priors much any more—due to having a big mountain of experience concerning what the actual expected frequency of events was. Its priors would be swamped by data—unless its priors were completely crazy.
The OP must be thinking of some different type of construct from me—and he doesn’t seem to explain what it is.
Unfortunately they aren’t. A universal prior must enumerate all the ways a universe could possibly be. If your prior is based on Turing machines that compute universes, but our actual universe is uncomputable, you’re screwed forever no matter what data comes in. Maybe the problem can be solved by a better universal prior, as Nesov suggests elsewhere in the thread, but as far as I understand it’s an open problem right now.
ETA: pretty much this whole comment is wrong. The prior is over algorithms that generate sequences of sensory input, not over algorithms that define universes. This is an important distinction, sorry for missing it when I wrote this comment.
Being forced to use the nearest computable approximation to an uncomputable function does not make you screwed forever.
That depends on the uncomputable function. Some can make you very well screwed indeed. It’s all there in Wei Dai’s examples on everything-list and one-logic, I really wish people would read them, maybe we’d have an actual discussion then. Sorry for sounding harsh.
Right, but it’s not necessarily true, or even likely, hence my point.
I did read the links, (including the link to the empty stub article!), and the google group discussions all seemed to end, from my brief perusing of them, with them coming to the consensus that Wei Dai hadn’t established his provacative, counterintuitive point. (And some of the exchanges here show the same.)
At the very least, he should summarize the reasoning or examples, as per standard practice, so we know there’s something to be gained from going to the links. This is especially true given that most readers had assumed that the opposite of Wei Dai’s premises are true and uncontroversial.
Natural selection solves this problem.
To avoid such a trivial mistake, just follow the advice on:
http://lesswrong.com/lw/mp/0_and_1_are_not_probabilities/
Re: “If your prior is based on Turing machines that compute universes, but our actual universe is uncomputable, you’re screwed forever no matter what data comes in.”
No way! Agents are generally not crippled by their priors! Soon enough they have actual data, and—unless their priors are crazy—they are quickly swamped—and they don’t need the details of their original priors any more—because they have real info to go on instead.
I’m not sure—did you miss the idea that all “universal priors” that we know how to construct today assign zero credence to certain hypotheses about the universe, or did you miss the idea that a zero-credence hypothesis will never rise above zero no matter how much data comes in, or is it me who’s missing something?
Also, correct me if I’m wrong, but doesn’t the Solomonoff prior bypass the issue of explicit hypotheses? That is, it puts a (non-zero) prior on every (prefix) bitstream of sensory data.
So, it doesn’t even seem necessary to talk about what such an agent’s “priors on hypotheses” are—everything it believes is encoded as an expectation of sensory data, and nothing more. It does not explicitly represent concepts like, “this thing is a halting oracle”.
Instead, when it encounters a halting oracle, it increases the weight it assigns to expectations of observations of things that are consistent with having been produced by a halting oracle, not the existence of a halting oracle as such.
No matter how uncomputable, or lengthy-to-specify, a function might be, you can always finitely specify your expectation weight on a finite observation prefix stream (i.e. the first n things you observe from the oracle).
So, I don’t see how an agent with a Solomonoff prior chokes on an encounter with a halting oracle.
Normal agents won’t. Genuinely intelligent agents won’t.
I think those who are arguing that it will are imagining an agent with the Solomonoff prior totally wired into them in a manner that they can’t possibly unlearn.
But still, even if you have the Occamian prior (which I think is what’s meant by the Solomonoff prior), there is no need to unlearn it. You retain a prior on all hypotheses that decreases in weight exponentially with length, and it persists on top of any observations you’ve updated on. Those new observations, combined with the Occamian prior give you the optimal weights on (prefix) sensory bitstreams, discounting the ruled-out ones and favoring those closer to what you’ve actually observed.
Even then, it keeps updating in favor of the observations that match what an oracle gives (without having to explicitly represent that they’re from an oracle). No penalty from failure to unlearn.
The thing is, there is no one true razor. Different sources have different associated reference machines—some are more like Turing Machines, others are more like CA. If what you are looking at is barcodes, then short ones are pretty rare—and if you go into simulated worlds, sources can have practically any distribution you care to mention.
Yes, you can model these as “complier overhead” constants—which represent the “cost” of simulating one reference machine in another—but that is just another way of saying you have to unlearn the Solomonoff prior and use another one—which is more appropriate for your source.
You can still do that, whatever your reference machine is—provided it is computationally universal—and doesn’t have too much “faith”.
I’m not sure exactly what can qualify as a prior.
Is “Anomalies may be clues about a need to make deep changes in other priors” a possible prior?
A prior is not a program that tells you what to do with the data. A prior is a set of hypotheses with a number assigned to each. When data comes in, we compute the likelihoods of the data given each hypothesis on the list, and use these numbers to obtain a posterior over the same hypotheses. There’s no general way to have a “none of the above” (NOTA) hypothesis in your prior, because you can’t compute the likelihood of the data given NOTA.
Another equivalent way to think about it: because of the marginalization step (dividing everything by the sum of all likelihoods), Bayesian updating doesn’t use the total likelihood of the data given all current hypotheses—only the relative likelihoods of one hypothesis compared to another. This isn’t easy to fix because “total likelihood” is a meaningless number that doesn’t indicate anything—it could easily be 1000 in a setup with an incorrect prior or 0.001 in a setup with a correct prior.
People have beliefs about how various sorts of behavior will work out, though I think it’s rare to have probabilities attached.
If you are assigning p(some empirical hypothesis) = 0, surely you are a broken system.
The example seems to be that a using a Turing machine to generate your priors somehow results in an expectation of p(uncomputable universe)=0. That idea just seems like total nonsense to me. It just doesn’t follow. For all I care, my priors could have been assigned to me using a Turing machine model at birth—but I don’t think p(uncomputable universe)=0. The whole line of reasoning apparently makes no sense.
The universal prior enumerates all Turing machines, not all possible priors generated by all Turing machines.
Priors are probabality estimates for uncertain quantities.
In Solomonoff induction they are probabality estimates for bitstrings—which one can think of as representing possible sensory inputs for an agent.
With a standard TM_length-based encoding, no finite bitstring is assigned a zero probability—and we won’t have to worry about perceiving infinite bitstrings until after the universal heat death—so there is no problem with assigning certain bitstrings a zero prior probability.
Whether the bitstrings were created using uncomputable physics is neither here nor there. They are still just bitstrings—and so can be output by a TM with a finite program on its tape.
No, sorry. You’re confused. A prior is not an assignment of credences to all bitstrings that you can observe. A prior is an assignment of credences to hypotheses, i.e. possible states of the world that generate bitstrings that you observe. Otherwise you’d find yourself in this text (see part II, “Escaping the Greek Hinterland”).
No. We were talking about the universal prior. Here is how that is defined for sequences:
“The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p.”
http://en.wikipedia.org/wiki/Inductive_inference
The universal prior of a sequence is the probability of that particular sequence arising (as a prefix). It is not the probabilty of any particular hypothesis or program. Rather is a weighted sum of the probabilities of all the programs that generate that sequence.
You can talk about the probabilities of hypothesis and programs as well if you like—but the universal prior of a sequence is perfectly acceptable subject matter—and is not a “confused” idea.
No finite sequence has a probabilty of zero—according to the universal prior.
All finite bitstrings can be produced by computable means—even if they were generated as the output of an uncomputable physical process.
Is this misconception really where this whole idea arises from?
This is all true, but… Why do you think the universal prior talks about computer programs at all? If I only wanted a prior over all finite bitstrings, I’d use a simpler prior that assigned every string of length N a credence proportional to 2^-N. Except that prior has a rather major shortcoming: it doesn’t help you predict the future! No matter how many bits you feed it, it always says the next bit is going to be either 0 or 1 with probability 50%. It will never get “swamped” by the data, never gravitate to any conclusions. This is why we want the universal prior to be based on computer programs instead: it will work better in practice, if the universe is in fact computable. But what happens if the universe is uncomputable? That’s the substantive question here.
ETA: the last two sentences are wrong, disregard them.
Nothing much happens to intelligent agents—because an intelligent agents’ original priors mostly get left behind shortly after they are born—and get replaced by evidence-based probability estimates of events happening. If convincing evidence comes in that the world is uncomputable, that just adds to the enormous existing stack of evidence they have about the actual frequencies of things.
Anyhow, priors being set to 0 or 1 is not a problem for observable sense data. No finite sense data has p assigned to 0 or 1 under the universal prior—so an agent can always update successfully—if it gets sufficient evidence that a sequence was actually produced. So, if it sees a system that apparently solves the halting problem for arbitrary programs, that is no big deal for it. It may have found a Turing oracle! Cool!
I suppose it might be possible to build an semi-intelligent agent with a particular set of priors permanently wired into it—so the agent was incapable of learning and adapting if its environment changed. Organic intelligent agents are not much like that—and I am not sure how easy it would be to build such a thing. Such agents would be incapable of adapting to an uncomputable world. They would always make bad guesses about uncomputable events. However, this seems speculative—I don’t see why people would try to create such agents. They would do very badly in certain simulated worlds—where Occam’s razor doesn’t necessarily hold true—and it would be debatable whether their intelligence was really very “general”.
The reason the universal prior is called “universal” is because given initial segments, where the infinite strings come from any computable distribution, and updating on those samples, it will, in fact, converge to the actual distribution on what the next bit should be. Now I’ll admit to not actually knowing the math here, but it seems to me that if most any prior had that property, as you seem to imply, we wouldn’t need to talk about a universal prior in the first place, no?
Also, if we interpret “universe” as “the actual infinite string that these segments are initial segments” of, then, well… take a look at that sum you posted and decompose it. The universal prior is basically assigning a probability to each infinite string, namely the sum of the probabilities of programs that generate it, and then collapsing that down to a distribution on initial segments in the obvious way. So if we want to consider its hypotheses about the actual law of the universe, the whole string, it will always assign 0 probability to an uncomputable sequence.
Convergence is more the result of the updates than the original prior. All the initial prior has to be to result in convergence is not completely ridiculous (1, 0, infinitessimals, etc). The idea of a good prior is that it helps initially, before an agent has any relevant experience to go on. However, that doesn’t usually last for very long—real organic agents are pretty quickly flooded with information about the state of the universe, and are then typically in a much better position to make probabililty estimates. You could build agents that were very confident in their priors—and updated them slowly—but only rarely would you want an agent that was handicapped in its ability to adapt and learn.
Picking the best reference machine would be nice—but I think most people understand that for most practical applications, it doesn’t matter—and that even a TM will do.
Are you certain of this? Could you provide some sort of proof or reference, please, ideally together with some formalization of what you mean by “completely ridiculous”? I’ll admit to not having looked up a proof of convergence for the universal prior or worked it out myself, but what you say were really the case, there wouldn’t actually be be very much special about the universal prior, and this convergence property of it wouldn’t be worth pointing out—so I think I have good reason to be highly skeptical of what you suggest.
Better, yes. But good enough? Arbitrarily close?
Sorry, but what does this even mean? I don’t understand how this notion of “update speed” translates into the Bayesian setting.
Here’s Shane Legg on the topic of how little priors matter when predicting the environment:
“In some situations, for example with Solomonoff induction, the choice of the reference machine doesn’t matter too much. [...] the choice of reference machine really doesn’t matter except for very small data sets (which aren’t really the ones we’re interested in here). To see this, have a look at the Solomonoff convergence bound and drop a compiler constant in by the complexity of the environment. The end result is that the Solomonoff predictor needs to see just a few more bytes of the data sequence before it converges to essentially optimal predictions.”
http://www.vetta.org/2009/05/on-universal-intelligence/
This doesn’t address what I said at all. We don’t speak of “the” universal prior because there’s a specific UTM it’s defined with respect to, we speak of “the” universal prior because we don’t much care about the distinction between different universal priors! The above article is still about doing Bayesian updating starting with a universal prior. That which particular universal prior you start from doesn’t matter much is not new information and in no way supports your claim that any “reasonable” prior—whatever that might mean—will also have this same property.
I think when he says “the choice of the reference machine doesn’t matter too much” and “the choice of reference machine really doesn’t matter except for very small data sets” he literally means those things. I agree that my position on this is not new.
Sorry, how does “literally” differ from what I stated? And you seem to be stating something very different from him. He is just stating that the UTM used to define the universal prior is irrelevant. You are claiming that any “reasonable” prior, for some unspecified but expansive-sounding notion of “reasonable”, has the same universal property as a universal prior.
That seems like quite a tangle, and alas, am not terribly interested in it . But:
The term was “reference machine”. No implication that it is a UTM is intended—it could be a CA—or any other universal computer. The reference machine totally defines all aspects of the prior. There are not really “universal reference machines” which are different from other “reference machines”—or if there are “universal” just refers to universal computation. A universal machine can define literally any distribution of priors you can possibly imagine. So: the distinction you are trying to make doesn’t seem to make much sense.
Convergence on accurate beliefs has precious little to do with the prior—it is a property of the updating scheme. The original priors matter little after a short while—provided they are not zero, one—or otherwise set so they prevent updating from working at all.
Thinking of belief convergence as having much to do with your priors is a wrong thought.
Sorry, what? Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort? Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! Generating a universal prior will only get you universal priors. Perhaps you were thinking of some other way of generating a prior from a universal computer? Because that isn’t what’s being talked about.
You have still done nothing to demonstrate this. The potential for dependence on priors has been demonstrated elsewhere (anti-inductive priors, etc). The “updating scheme” is Bayes’ Rule. (This might not suffice in the continuous-time case, but you explicitly invoked the discrete-time case above!) But to determine all those probabilities, you need to look at the prior. Seriously, show me (or just point me to) some math. If you refuse to say what makes a prior “reasonable”, what are you actually claiming? That the set of priors with this property is large in some appropriate sense? Please name what sense. Why should we not just use some equivalent of maxent, if what you say is true?
“Of course it can be any sort of universal computer; why would we care whether it’s a Turing machine or some other sort?”
Well, different reference machines produce different prior distributions—so the distribution used matters initially, when the machine is new to the world.
“Your statement that taking a universal computer and generating the corresponding universal prior will get you “literally any distribution of priors you can imagine” is just false, especially as it will only get you uncomputable ones! ”
“Any distribution you can compute”, then—if you prefer to think that you can imagine the uncomputable.
“You have still done nothing to demonstrate this.”
Actually, I think I give up trying to explain. From my perspective you seem to have some kind of tangle around the word “universal”. “Universal” could usefully refer to “universal computation” or to a prior that covers “every hypothesis in the universe”. There is also the “universal prior”—but I don’t think “universal” there has quite the same significance that you seem to think it does. There seems to be repeated miscommunication going on in this area.
It seems non-trivial to describe the class of priors that leads to “fairly rapid” belief convergence in an intelligent machine. Suffice to say, I think that class is large—and that the details of priors are relatively insignificant—provided there is not too much “faith”—or “near faith”. Part of the reason for that is that priors usually get rapidly overwritten by data. That data establishes its own subsequent prior distributions for all the sources you encounter—and for most of the ones that you don’t. If you don’t agree, fine—I won’t bang on about it further in an attempt to convince you.
Firstly, please use Markdown quotes for ease of reading? :-/
Indeed, but I don’t think that’s really the property under discussion.
....huh? Maybe you are misunderstanding the procedure in question here. We are not taking arbitrary computations that output distributions and using those distributions. That would get you arbitrary computable distributions. Rather, we are taking arbitrary universal computers/UTMs/Turing-complete programming languages/whatever you want to call them, and then generating a distribution as “probability of x is sum over 2^-length over all programs that output something beginning with x” (possibly normalized). I.e. we are taking a reference machine and generating the corresponding universal prior.
Not only will this not get you “any distribution you can compute”, it won’t get you any distributions you can compute at all. The resulting distribution is always uncomputable. (And hence, in particular, not practical, and presumably not “reasonable”, whatever that may mean.)
Am I mistaken in asserting that this is what was under discussion?
You don’t have to attempt to convince me, but do note that despite asserting it repeatedly you have, in fact, done zero to establish the truth of this assertion / validity of this intuition, which I have good reason to believe to be unlikely, as I described earlier.
FWIW, what I meant was that—by altering the reference machine, p() - for all bitstrings less than a zillion bits long—can be made into any set of probabilities you like—provided they don’t add up to more than 1, of course.
The reference machine defines the resulting probability distribution completely.
AH! So you are making a comment on the use of universal priors to approximate arbitrary finite priors (and hence presumably vice versa). That is interesting, though I’m not sure what it has to do with eventual convergence. You should have actually stated that at some point!
Re: “I don’t understand how this notion of “update speed” translates into the Bayesian setting.”
Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate.
If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly—on seing several heads in a row.
Like that.
What you have just laid out are not different “update speeds” but different priors. “It’s a biased coin from a magician” is of the same class of prior assumptions as “It’s probably a fair coin” or “It’s a coin with some fixed probability of landing heads, but I have no idea what” or “It’s a rigged coin that can only come up heads 10 times once activated”.
After each toss, you do precisely one Bayesian update. Perhaps the notion of “update speed” might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not. The amount you update is determined by Bayes’ Law; different apparent “update speeds” are due to differing priors. “Speed” probably isn’t even a good term, as updates aren’t even necessarily in the same direction! If you think the coin can only come up heads 10 times, each appearance of heads makes it less likely to come up again.
“Update speed” seems fine to me—when comparing:
.5, .500001, .500002, .500003, .500004...
....with...
.5, 0.7, 0.9, 0.94, 0.96
...but use whatever term you like.
That’s a statistic, not a parameter—and it’s a statistic ultimately determined by the prior.
I do not know where the idea that “speeds” are “parameters” and not “statistics” comes from. An entity being a statistic doesn’t imply that it is not a speed.
The same goes for discrete systems. They have the concept of speed too:
http://en.wikipedia.org/wiki/Glider_%28Conway%27s_Life%29
This is utterly irrelevant. The problem with what you say is not that there’s no notion of speed, it’s that there is precisely one way of doing updates, and it has no “speed” parameter.
In the game of life, the update speed is always once per generation. However, that doesn’t mean it has no concept of speed. In fact the system exhibits gliders with many different speeds.
It’s much the same with an intelligent agent’s update speed in response to evidence—some will update faster than others—depending on what they already know.
You claimed that:
“Perhaps the notion of “update speed” might make sense in a more continuous setting, but in a discrete setting like this it is clear it does not.”
However, the concept of “speed” works equally well in discrete and continuous systems—as the GOL illustrates. “Discreteness” is an irrelevance.
You really seem to be missing the point here. I’m sorry but from your posts I can’t help but get the idea that you don’t really understand how this sort of prediction scheme works. Sure, “update speed” in the sense you described it elsewhere in the thread makes sense, but who cares? Update speed in these you described it elsewhere in the thread is a consequence of the prior (or current state, rather), it isn’t some sort of parameter, and it’s not clear it’s something at all stable or meaningful. You’ve established the existence of something trivial and probably irrelevant. In the parametric sense you seemed to be originally using it, it doesn’t exist. Can we agree on this?
Probably nobody cares—apart from you, it seems. Apparently, one can’t get away with using the phrase “update speed” in connection with an intelligent agent without getting bounced.
When you said:
“I don’t understand how this notion of “update speed” translates into the Bayesian setting.”
...and I said...
“Say you think p(heads) is 0.5. If you see ten heads in a row, do you update p(heads) a lot, or a little? It depends on how confident you are of your estimate. If you had previously seen a thousand coin flips from the same coin, you might be confident of p(heads) being 0.5 - and therefore update little. If you were told that it was a biased coin from a magician, then your estimate of p(heads) being 0.5 might be due to not knowing which way it was biased. Then you might update your estimate of p(heads) rapidly—on seing several heads in a row. Like that.”
...IMO, the conversation could and should have stopped—right there.
This is not analogous. We are speaking of a complete system here.
I have already addressed this. What you have called “update speed” is determined by current distribution.
Re: “We are speaking of a complete system here”
I assure you that I could exhibit a GOL field that consisted entirely of gliders moving at c/2 - and then exhibit another GOL field that consisted entirely of gliders moving at c/4. These systems would have different characteristic speeds. Hopefully, you see the analogy now.
OK, sure. But then to continue the analogy, in the resulting speed is a function of the initial configuration. :)
Just so. I never said otherwise. You already asked for clarification about whether I thought that the system’s “update speed” was to do with its prior Prob. Dist. - and I said that “yes”, it was.
Hm; there may not be a disagreement here. You seemed to be using it in a way that implied it was not determined by (or even was independent of) the prior. Was I mistaken there?
The idea was that some agents update faster than others (or indeed not at all).
If you like you can think of the agents that update relatively slowly as being confident that they are uncertain about the things they are unsure about. That confidence in their own uncertainty could indeed be represented by other priors.
That’s not “other priors”, there’s just one prior. All the probabilities in Bayes’ Rule come from the updated-to-current version of the prior.
Other prior probabilities. There is one prior set of probabilities, which is composed of many prior probabilities and probability distributions.
If you want to think about it that way, please don’t say “other priors”. That’s very confusing, because “prior” in this context refers to the whole prior, not to pieces of it (which I’m not sure how you’re detangling from each other, anyway). If we’re talking about something of the universal-prior sort, it has one prior, over its total sensory experience; I’m not clear how you’re decomposing that or what alternative model you are suggesting.
The two types of prior probability I discussed were “belifs about the world” and “beliefs about the certainty of those beliefs”.
An agent that updates its beliefs about the world rapidly (in response to evidence) would have a low degree of certainty about those beliefs—while an agent that updates its beliefs about the world slowly would have a high degree of certainty that those beliefs were already correct—and were backed up with lots of existing evidence.
I gave an example of this already when I discussed the magician’s coin.
Except these aren’t separate things. That isn’t how this sort of system works! Its beliefs about the certainty of those beliefs are determined by its beliefs about the world.
Well, everything is about the world, if materialism is true.
You don’t seem to be even trying to perform a sympathetic reading. Leave aside quibbling about what is to do with the world—can you at least see that in the first case, updates happen quickly, and in the second case they happen slowly? “Speed” just refers to distance divided by time. Here distance is the probabiliy delta, and time is simply time. So, updates can happen fast and slow. Some systems update quickly, others update slowly—and others don’t update at all. This all seems fairly simple to me—what is the problem?
Well, sure. But that statement is trivial.
Right. I really don’t think that what I am saying is controversial. The way I remember it, I talked about systems with different update speeds—and you jumped on that.
Alternatively, I could say, I went with the assumption that you were attempting to carve the relevant concepts at the joints and getting it wrong, rather than that you were making a true statement which doesn’t even try to accomplish that.
M, sorry then. But you didn’t explain the term anywhere, so I assumed it meant what it sounded like—the original context makes it sound like you mean something separate from the prior, rather than something determined by it. If instead of talking about building an agent that were “confident in their priors” and “updated them slowly” you had just spoken of “priors that result in slow updating” I don’t think there would have been a problem. (I must admit I probably also wasn’t inclined to look for a sympathetic reading as your other comments about the universal prior seem to be just wrong. )
Re: “there wouldn’t actually be be very much special about the universal prior”
Well, Occam’s razor is something rather special. However, agents don’t need an optimal version of it built into them as a baby—they can figure it out from their sensory inputs.
Uncomputable strings are all infinite. Those concerned with the real world don’t have to concern themselves with such things. Everything you encounter in practice is finite, and therefore computable.
And if you are faced with a black box emitting one digit at a time, how will you do better than maxent?
Prior determines how evidence informs your estimates, what things you can consider. In order to “replace priors with evidence-based probability estimates of events”, you need a notion of event, and that is determined by your prior.
Prior evaluates, but it doesn’t dictate what is being evaluated. In this case, “events happening” refers to subjective anticipation, which in turn refers to prior, but this connection is far from being straightforward.
“Determined” in the sense of “weakly influenced”. The more actual data you get, the weaker the influence of the original prior becomes—and after looking at the world for a little while, your original priors become insignificant—swamped under a huge mountain of sensory data about the actual observed universe.
Priors don’t really affect what things you can consider—since you can consider (and assign non-zero probability to) receiving any sensory input sequence.
I use the word “prior” in the sense of priors as mathematical objects, meaning all of your starting information plus the way you learn from experience.
I can’t quite place “you need a notion of event, and that is determined by your prior”, but I guess the mapping between sample space and possible observations is what you meant.
Well yes, you have “priors” that you learn from experince. An uncomputable world is not a problem for them—since you can learn about uncomputable physics, the same way as you learn about everything else.
This whole discussion seems to be a case of people making a problem out of nothing.
Well yes, you can have “priors” that you have learned from experience. An uncomputable world is not a problem in that case either—since you can learn about uncomputable physics, in just the same way that you learn about everything else.
This whole discussion seems to be a case of people making a problem out of nothing.
Yeah, he really saw the light, but dropped the ball, when writing that stormy bag of mixed metaphors.
Gee, that was obviously supposed to be a non-mixed metaphor about urban foraging. Yeah that’s it. :)
Seriously, I thought about sticking with the fruit tree metaphor, but “hanging right over our heads” makes the problem sound too easy, so I decided to favor accuracy over literary elegance.