But this is different than requiring that the inputs be drawn from a certain probability distribution (in other words, requiring 5% of the inputs to be 0000, 7% to be 0001, 6% to be 0010, etc.)
Well, I don’t know of a single piece of software which requires that its inputs come from specific probability distributions in addition to satisfying some set of properties.
In fact, wouldn’t that software object have to maintain some running counts to even be able to estimate from which distribution do its inputs come?
Well, I don’t know of a single piece of software which requires that its inputs come from specific probability distributions in addition to satisfying some set of properties.
This is in some sense my point. If you want to be so hardcore Bayesian that you even look down on randomized algorithms in software, then presumably the alternative is to form a subjective probability distribution over the inputs to your algorithm (or perhaps there is some other obvious alternative that I’m missing). But I don’t know of many pieces of code that require their inputs to conform to such a subjective probability distribution; rather, the code should work for ALL inputs (i.e. do a worst-case rather than average-case analysis, which will in some cases call for the use of randomness). I take this to be evidence that the “never use randomness” view would call for software that most engineers would consider poorly-designed, and as such is an unproductive view from the perspective of designing good software.
Any software that uses randomness requires you to meet a probability distribution over its inputs, namely that the random input needs to be random. I assume that you’re not claiming that this breaks modularity, as you advocate the use of randomness in algorithms. Why?
Based on the discusssions with you and Lumifer, I updated the original text of that section substantially. Let me know if it’s clearer now what I mean.
The probability distribution part is better, though I still don’t see how software that uses randomness doesn’t fall under that (likewise: compression, image recognition, signal processing, and decision making algorithms).
If the software generates its own randomness (for instance, in JAVA you can do this by creating a new instance of a Random object and calling nextInt(), etc.) then I don’t see how this breaks modularity.
Compression algorithms like bzip don’t promise to make things smaller as part of their contract, they only promise to be lossless. To the extent that a user relies on them becoming smaller consistently, this can lead to trouble, for instance if I expect by inputs of size 10 kilobytes to compress down to 2 kilobytes, and then many of the inputs in a row stay at 10 kilobytes, I can get unexpected load that could create issues.
I don’t know of many image recognition algorithms that are integrated into a large system without a human in the loop, except perhaps Google Image Search which arguably has a human (the end-user) that filters the results at the end. This is precisely due to their non-modularity: they fail in unexpected and difficult-to-characterize ways, such that any attempt to enforce a contract or abstraction would be probably misguided at this point. The best they can currently promise is “we correctly implement the fast Fourier transform” and leave it up to the programmer to decide whether an FFT is what is merited in a given case.
ETA: Another way to put the bzip example which might fit more with your language, is that yes the guarantee of bzip is that it is lossless and that it will make your data smaller as long as the data fit the properties that bzip attempts to exploit, and if that is what we want the contract to be then I would agree that bzip is non-modular.
for instance, in JAVA you can do this by creating a new instance of a Random object and calling nextInt()
nitpick: Java default PRNG is a linear congruential generator. It’s not as bad as the infamous RANDU, but I won’t suggest to use it for anything that needs more than a hundred or so pseudo-random bits.
If you want to be so hardcore Bayesian that you even look down on randomized algorithms in software
Do such people exist? I don’t know anyone who fits such a description.
I take this to be evidence that the “never use randomness” view would call for software that most engineers would consider poorly-designed, and as such is an unproductive view from the perspective of designing good software.
Well, of course. But again, has anyone been arguing against that? The original dispute was talking about abstract optimality and whether particular solutions exist. No one suggested that in real-world practice at the current level of technology rand() Call Considered Harmful.
Well, let’s put it this way. If you can implement a non-randomized solution that works within the specified constraints (time, etc.), it is preferable to a probabilistic one. The real question is what happens when you can’t. I doubt that Eliezer would refuse to implement a probabilistic solution on the grounds that it’s not pure enough and so no solution at all is better than a version tainted by its contact with the RNG.
An exception might be when you need to be absolutely sure what the software does or does not do and any randomness introduces an unacceptable risk. But such a case is very rare.
I doubt that Eliezer would refuse to implement a probabilistic solution on the grounds that it’s not pure enough and so no solution at all is better than a version tainted by its contact with the RNG.
Sure, I agree with this. But he still seems to think that in principle it is always possible to improve over a randomized algorithm. But doing so requires having some knowledge of the distribution over the environment, and that would break modularity.
Whether or not Eliezer himself is basing this argument on Bayesian grounds, it certainly seems to be the case that many commenters are, e.g.:
However if it is an important problem and you think you might be able to find some regularities, the best bet would to be to do bayesian updates on the most likely positions to be ones and preferentially choose those
Quantum branching is “truly random” in the sense that branching the universe both ways is an irreducible source of new indexical ignorance. But the more important point is that unless the environment really is out to get you, you might be wiser to exploit the ways that you think the environment might depart from true randomness, rather than using a noise source to throw away this knowledge.
I certainly don’t say “it’s not hard work”, and the environmental probability distribution should not look like the probability distribution you have over your random numbers—it should contain correlations and structure. But once you know what your probability distribution is, then you should do your work relative to that, rather than assuming “worst case”. Optimizing for the worst case in environments that aren’t actually adversarial, makes even less sense than assuming the environment is as random and unstructured as thermal noise.
Another way of swapping around the question is to ask under what circumstances Jacob Steinhardt would refuse to use a PRNG rather than an RNG because the PRNG wasn’t random enough, and whether there’s any instance of such that doesn’t involve an intelligent adversary (or that ancient crude PRNG with bad distributional properties that everyone always cites when this topic comes up, i.e., has that happened more recently with an OK-appearing PRNG).
Obviously I don’t intend to take a stance on the math-qua-math question of P vs. BPP. But to the extent that someone has to assert that an algorithm’s good BPP-related properties only work for an RNG rather than a PRNG, and there’s no intelligent adversary of any kind involved in the system, I have to question whether this could reasonably happen in real life. Having written that sentence it doesn’t feel very clear to me. What I’m trying to point at generally is that unless I have an intelligent adversary I don’t want my understanding of a piece of code to depend on whether a particular zero bit is “deterministic” or “random”. I want my understanding to say that the code has just the same effect once the zero is generated, regardless of what factors generated the zero; I want to be able to screen off the “randomness” once I’ve looked at the output of that randomness, and just ask about the effectiveness of using a zero here or a one there. Furthermore I distrust any paradigm which doesn’t look like that, and reject it as something I could really-truly believe, until the business about “randomness” has been screened off and eliminated from the analysis. Unless I’m trying to evade a cryptographic adversary who really can predict me if I choose the wrong PRNG or write down my random bits someplace that someone else can see them, so that writing down the output of an RNG and then feeding it into the computation as a deterministic constant is genuinely worse because my adversary might sneak a look at the RNG’s output if I left it written down anywhere. Or I’m trying to randomize a study and prevent accidental correlations with other people’s study, so I use an RNG just in case somebody else used a similar PRNG.
But otherwise I don’t like my math treating the same bit differently depending on whether it’s “random” or “deterministic” because its actual effect on the algorithm is the same and ought to be screened off from its origins once it becomes a 1 or 0.
(And there’s also a deep Bayesian issue here regarding, e.g., our ability to actually look at the contents of an envelope in the two-envelope problem and update our prior about amounts of money in envelopes to arrive at the posterior, rather than finding it intuitive to think that we picked an envelope randomly and that the randomized version of this algorithm will initially pick the envelope containing the larger amount of money half the time, which I think is a very clear illustration of the Bad Confused Thoughts into which you’re liable to be led down a garden-path, if you operate in a paradigm that doesn’t find it intuitive to look at the actual value of the random bit and ask about what we think about that actual value apart from the “random” process that supposedly generated it. But this issue the margins are too small to contain.)
Obviously I don’t intend to take a stance on the math-qua-math question of P vs. BPP. But to the extent that someone has to assert that an algorithm’s good BPP-related properties only work for an RNG rather than a PRNG, and there’s no intelligent adversary of any kind involved in the system, I have to question whether this could reasonably happen in real life.
If your question is “Is there a known practical case not involving an intelligent adversarial environment where the use of a cryptographic PRNG or even a true RNG rather than a non-cryptographic PRNG is warranted?” Then the answer is no. In fact, this is the reason why it is conjectured that P = BPP.
However, given the rest of your comment, it seems that you are referring to how we understand the theoretical properties of algorithms:
What I’m trying to point at generally is that unless I have an intelligent adversary I don’t want my understanding of a piece of code to depend on whether a particular zero bit is “deterministic” or “random”. I want my understanding to say that the code has just the same effect once the zero is generated, regardless of what factors generated the zero; I want to be able to screen off the “randomness” once I’ve looked at the output of that randomness, and just ask about the effectiveness of using a zero here or a one there. Furthermore I distrust any paradigm which doesn’t look like that, and reject it as something I could really-truly believe, until the business about “randomness” has been screened off and eliminated from the analysis.
If we are talking about understanding the theoretical properties of many useful randomized algorithms, I’d say that we can’t “screen off” randomness. Even if the algorithm is implemented using a PRNG with a constant seed, and is thus fully deterministic at the level of what actually runs on the machine, when we reason about its theoretical properties, whether it is a formal analysis or a pre-formal intuitive analysis, we need to abstract away the PRNG and assume that the algorithm has access to true randomness.
As it was already pointed out, if we were perfectly rational Bayesian agents, then, we would always be able to include the PRNG into our analysis: For instance, given a machine learning problem, a Bayesian agent would model it as a subjective probability distribution, and it may conclude that for that particular distribution the optimal algorithm is an implementation of Random Forest algorithm with the Mersenne Twister PRNG initialized with seed 42. For a slightly different subjective distribution, the optimal algorithm may be an implementation of a Neural Network trained with Error Backpropagation with weights initialized by a Linear Congruentlial PRNG with seed 12345. For another slightly different subjective distribution, the optimal algorithm may be some entirely different deterministic algorithm.
In practice, we can’t reason this way. Therefore we assume true randomness in order to say meaningful things about many practically useful algorithms.
A more involved post about those Bad Confused Thoughts and the deep Bayesian issue underlying it would be really interesting, when and if you ever have time for it.
in principle it is always possible to improve over a randomized algorithm
in principle are the key words.
As the old saying goes, in theory there is no difference between theory and practice but in practice there is.
requires having some knowledge of the distribution over the environment, and that would break modularity.
You are using the word “modularity” in a sense weird to me. From my perspective, in the software context “modularity” refers to interactions of pieces of software, not inputs or distributions over the environment.
I think an example of what jsteinhardt is referring to would be quicksort. It can take an arbitrary list as an argument, but for many perversely ordered inputs it takes Omega(n^2). However, it does have an efficient average-case complexity of O(n log n). In other words, if the input is sampled from the uniform distribution over permutations the algorithm is guaranteed to finish in O(n log n) time.
Many of the other examples that were considered are similar, in that the algorithm doesn’t give an error when given an input outside of the expected distribution, but rather silently works less effectively
I think an example of what jsteinhardt is referring to would be quicksort.
Quicksort as a piece of software does not require any particular distribution. It is perfectly happy to work with “perversely ordered inputs”.
A human running quicksort with certain expectations about its performance might require a particular distribution, but that’s not a characteristic of software.
Recall that the original claim was that expecting a particular distribution breaks modularity.
If a particular function was advertised as having O(n log(n)) running time, and it turns out that it actually runs in O(n^2) on a certain class of inputs, I would not feel it to be behaving in a very modular way. From an abstraction perspective, breaking time or resource assumptions on an input can be as bad or worse than rejecting it.
Granted, it’s pretty easy to get quicksort to behave on all but the most pathological data sets.
If a particular function was advertised as having O(n log(n)) running time, and it turns out that it actually runs in O(n^2) on a certain class of inputs, I would not feel it to be behaving in a very modular way.
If you have been misled with respect to the behavior of an algorithm on a specific class of inputs, what does it have to do with modularirty??
Modularity refers to the independence of a piece of software from the particulars of other pieces of software. It has nothing to do with performance metrics.
When picking which algorithms to use for particular pieces of your task, performance metrics are a significant factor. If you are told that your algorithm for Section A of the task runs in O(n log n) time and you know that the Section B portion of the program, which requires a stream on inputs from Section A, will run in O(n (log n)^2), you can assume that Section B will be constantly supplied with inputs and the program can busy-wait for the brief portions where it does not have input. If it turns out that for your specific input distribution Section A is taking O(n^2), that’s a critical distinction with far-reaching effects on the design of the rest of your algorithm.
It’s somewhat harder to provide an example for non-networked, non-parallelized code, but I can do so on request.
I understand all that, but I’m reading it as “you should not make mistakes about the expected performance of your code and your application architecture should reflect your use cases.”
There are many ways to screw up a software system.
A human running quicksort with certain expectations about its performance might require a particular distribution, but that’s not a characteristic of software.
I think this may be a distinction without a difference; modularity can also be defined as human expectations about software, namely that the software will be relatively easy to hook into a larger system.
Well, I don’t know of a single piece of software which requires that its inputs come from specific probability distributions in addition to satisfying some set of properties.
In fact, wouldn’t that software object have to maintain some running counts to even be able to estimate from which distribution do its inputs come?
This is in some sense my point. If you want to be so hardcore Bayesian that you even look down on randomized algorithms in software, then presumably the alternative is to form a subjective probability distribution over the inputs to your algorithm (or perhaps there is some other obvious alternative that I’m missing). But I don’t know of many pieces of code that require their inputs to conform to such a subjective probability distribution; rather, the code should work for ALL inputs (i.e. do a worst-case rather than average-case analysis, which will in some cases call for the use of randomness). I take this to be evidence that the “never use randomness” view would call for software that most engineers would consider poorly-designed, and as such is an unproductive view from the perspective of designing good software.
Any software that uses randomness requires you to meet a probability distribution over its inputs, namely that the random input needs to be random. I assume that you’re not claiming that this breaks modularity, as you advocate the use of randomness in algorithms. Why?
Based on the discusssions with you and Lumifer, I updated the original text of that section substantially. Let me know if it’s clearer now what I mean.
EDIT: Also, thanks for the feedback so far!
The probability distribution part is better, though I still don’t see how software that uses randomness doesn’t fall under that (likewise: compression, image recognition, signal processing, and decision making algorithms).
If the software generates its own randomness (for instance, in JAVA you can do this by creating a new instance of a Random object and calling nextInt(), etc.) then I don’t see how this breaks modularity.
Compression algorithms like bzip don’t promise to make things smaller as part of their contract, they only promise to be lossless. To the extent that a user relies on them becoming smaller consistently, this can lead to trouble, for instance if I expect by inputs of size 10 kilobytes to compress down to 2 kilobytes, and then many of the inputs in a row stay at 10 kilobytes, I can get unexpected load that could create issues.
I don’t know of many image recognition algorithms that are integrated into a large system without a human in the loop, except perhaps Google Image Search which arguably has a human (the end-user) that filters the results at the end. This is precisely due to their non-modularity: they fail in unexpected and difficult-to-characterize ways, such that any attempt to enforce a contract or abstraction would be probably misguided at this point. The best they can currently promise is “we correctly implement the fast Fourier transform” and leave it up to the programmer to decide whether an FFT is what is merited in a given case.
ETA: Another way to put the bzip example which might fit more with your language, is that yes the guarantee of bzip is that it is lossless and that it will make your data smaller as long as the data fit the properties that bzip attempts to exploit, and if that is what we want the contract to be then I would agree that bzip is non-modular.
nitpick: Java default PRNG is a linear congruential generator. It’s not as bad as the infamous RANDU, but I won’t suggest to use it for anything that needs more than a hundred or so pseudo-random bits.
bzip is a great source of random bits.
:)
Do such people exist? I don’t know anyone who fits such a description.
Well, of course. But again, has anyone been arguing against that? The original dispute was talking about abstract optimality and whether particular solutions exist. No one suggested that in real-world practice at the current level of technology rand() Call Considered Harmful.
What about Eliezer?
Well, let’s put it this way. If you can implement a non-randomized solution that works within the specified constraints (time, etc.), it is preferable to a probabilistic one. The real question is what happens when you can’t. I doubt that Eliezer would refuse to implement a probabilistic solution on the grounds that it’s not pure enough and so no solution at all is better than a version tainted by its contact with the RNG.
An exception might be when you need to be absolutely sure what the software does or does not do and any randomness introduces an unacceptable risk. But such a case is very rare.
Sure, I agree with this. But he still seems to think that in principle it is always possible to improve over a randomized algorithm. But doing so requires having some knowledge of the distribution over the environment, and that would break modularity.
Whether or not Eliezer himself is basing this argument on Bayesian grounds, it certainly seems to be the case that many commenters are, e.g.:
Will_Pearson:
DonGeddis:
And some comments by Eliezer:
Eliezer_Yudkowsky:
Eliezer_Yudkowsky:
Another way of swapping around the question is to ask under what circumstances Jacob Steinhardt would refuse to use a PRNG rather than an RNG because the PRNG wasn’t random enough, and whether there’s any instance of such that doesn’t involve an intelligent adversary (or that ancient crude PRNG with bad distributional properties that everyone always cites when this topic comes up, i.e., has that happened more recently with an OK-appearing PRNG).
Obviously I don’t intend to take a stance on the math-qua-math question of P vs. BPP. But to the extent that someone has to assert that an algorithm’s good BPP-related properties only work for an RNG rather than a PRNG, and there’s no intelligent adversary of any kind involved in the system, I have to question whether this could reasonably happen in real life. Having written that sentence it doesn’t feel very clear to me. What I’m trying to point at generally is that unless I have an intelligent adversary I don’t want my understanding of a piece of code to depend on whether a particular zero bit is “deterministic” or “random”. I want my understanding to say that the code has just the same effect once the zero is generated, regardless of what factors generated the zero; I want to be able to screen off the “randomness” once I’ve looked at the output of that randomness, and just ask about the effectiveness of using a zero here or a one there. Furthermore I distrust any paradigm which doesn’t look like that, and reject it as something I could really-truly believe, until the business about “randomness” has been screened off and eliminated from the analysis. Unless I’m trying to evade a cryptographic adversary who really can predict me if I choose the wrong PRNG or write down my random bits someplace that someone else can see them, so that writing down the output of an RNG and then feeding it into the computation as a deterministic constant is genuinely worse because my adversary might sneak a look at the RNG’s output if I left it written down anywhere. Or I’m trying to randomize a study and prevent accidental correlations with other people’s study, so I use an RNG just in case somebody else used a similar PRNG.
But otherwise I don’t like my math treating the same bit differently depending on whether it’s “random” or “deterministic” because its actual effect on the algorithm is the same and ought to be screened off from its origins once it becomes a 1 or 0.
(And there’s also a deep Bayesian issue here regarding, e.g., our ability to actually look at the contents of an envelope in the two-envelope problem and update our prior about amounts of money in envelopes to arrive at the posterior, rather than finding it intuitive to think that we picked an envelope randomly and that the randomized version of this algorithm will initially pick the envelope containing the larger amount of money half the time, which I think is a very clear illustration of the Bad Confused Thoughts into which you’re liable to be led down a garden-path, if you operate in a paradigm that doesn’t find it intuitive to look at the actual value of the random bit and ask about what we think about that actual value apart from the “random” process that supposedly generated it. But this issue the margins are too small to contain.)
Is that helpful?
If your question is “Is there a known practical case not involving an intelligent adversarial environment where the use of a cryptographic PRNG or even a true RNG rather than a non-cryptographic PRNG is warranted?” Then the answer is no.
In fact, this is the reason why it is conjectured that P = BPP.
However, given the rest of your comment, it seems that you are referring to how we understand the theoretical properties of algorithms:
If we are talking about understanding the theoretical properties of many useful randomized algorithms, I’d say that we can’t “screen off” randomness.
Even if the algorithm is implemented using a PRNG with a constant seed, and is thus fully deterministic at the level of what actually runs on the machine, when we reason about its theoretical properties, whether it is a formal analysis or a pre-formal intuitive analysis, we need to abstract away the PRNG and assume that the algorithm has access to true randomness.
As it was already pointed out, if we were perfectly rational Bayesian agents, then, we would always be able to include the PRNG into our analysis:
For instance, given a machine learning problem, a Bayesian agent would model it as a subjective probability distribution, and it may conclude that for that particular distribution the optimal algorithm is an implementation of Random Forest algorithm with the Mersenne Twister PRNG initialized with seed 42.
For a slightly different subjective distribution, the optimal algorithm may be an implementation of a Neural Network trained with Error Backpropagation with weights initialized by a Linear Congruentlial PRNG with seed 12345.
For another slightly different subjective distribution, the optimal algorithm may be some entirely different deterministic algorithm.
In practice, we can’t reason this way. Therefore we assume true randomness in order to say meaningful things about many practically useful algorithms.
A more involved post about those Bad Confused Thoughts and the deep Bayesian issue underlying it would be really interesting, when and if you ever have time for it.
in principle are the key words.
As the old saying goes, in theory there is no difference between theory and practice but in practice there is.
You are using the word “modularity” in a sense weird to me. From my perspective, in the software context “modularity” refers to interactions of pieces of software, not inputs or distributions over the environment.
Based on the discusssions with you and trist, I updated the original text of that section substantially. Let me know if it’s clearer now what I mean.
Also, thanks for the feedback so far!
I think an example of what jsteinhardt is referring to would be quicksort. It can take an arbitrary list as an argument, but for many perversely ordered inputs it takes Omega(n^2). However, it does have an efficient average-case complexity of O(n log n). In other words, if the input is sampled from the uniform distribution over permutations the algorithm is guaranteed to finish in O(n log n) time.
Many of the other examples that were considered are similar, in that the algorithm doesn’t give an error when given an input outside of the expected distribution, but rather silently works less effectively
Quicksort as a piece of software does not require any particular distribution. It is perfectly happy to work with “perversely ordered inputs”.
A human running quicksort with certain expectations about its performance might require a particular distribution, but that’s not a characteristic of software.
Recall that the original claim was that expecting a particular distribution breaks modularity.
If a particular function was advertised as having O(n log(n)) running time, and it turns out that it actually runs in O(n^2) on a certain class of inputs, I would not feel it to be behaving in a very modular way. From an abstraction perspective, breaking time or resource assumptions on an input can be as bad or worse than rejecting it.
Granted, it’s pretty easy to get quicksort to behave on all but the most pathological data sets.
If you have been misled with respect to the behavior of an algorithm on a specific class of inputs, what does it have to do with modularirty??
Modularity refers to the independence of a piece of software from the particulars of other pieces of software. It has nothing to do with performance metrics.
When picking which algorithms to use for particular pieces of your task, performance metrics are a significant factor. If you are told that your algorithm for Section A of the task runs in O(n log n) time and you know that the Section B portion of the program, which requires a stream on inputs from Section A, will run in O(n (log n)^2), you can assume that Section B will be constantly supplied with inputs and the program can busy-wait for the brief portions where it does not have input. If it turns out that for your specific input distribution Section A is taking O(n^2), that’s a critical distinction with far-reaching effects on the design of the rest of your algorithm.
It’s somewhat harder to provide an example for non-networked, non-parallelized code, but I can do so on request.
I understand all that, but I’m reading it as “you should not make mistakes about the expected performance of your code and your application architecture should reflect your use cases.”
There are many ways to screw up a software system.
I think this may be a distinction without a difference; modularity can also be defined as human expectations about software, namely that the software will be relatively easy to hook into a larger system.
I don’t find this a useful definition, but YMMV, de gustibus, and all that...