I’m not really sure what you mean when you say “something goes wrong” (in relation to the prize). I’ve been thinking about all this in a very descriptive way, ie, I want to understand what happens generally, not force a particular outcome. So I’m a little out-of-touch with the “goes wrong” framing at the moment. There are a lot of different things which could happen. Which constitute “going wrong”?
Becoming non-myopic; ie, using strategies which get lower prediction loss long-term rather than on a per-question basis.
(Note this doesn’t necessarily mean planning to do so, in an inner-optimizer way.)
Making self-fulfilling prophecies in order to strategically minimize prediction loss on individual questions (while possibly remaining myopic).
Having a tendency for self-fulfilling prophecies at all (not necessarily strategically minimizing loss).
Having a tendency for self-fulfilling prophecies, but not necessarily the ones which society has currently converged to (eg, disrupting existing equilibria about money being valuable because everyone expects things to stay that way).
Strategically minimizing prediction loss in any way other than by giving better answers in an intuitive sense.
Manipulating the world strategically in any way, toward any end.
Catastrophic risk by any means (not necessarily due to strategic manipulation).
In particular, inner misalignment seems like something you aren’t including in your “going wrong”? (Since it seems like an easy answer to your challenge.)
I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the “basically gradient descent” sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some “secret sauce” though.)
If you aren’t already convinced, here’s another explanation for why I don’t think the Predict-O-Matic will make self-fulfilling prophecies by default.
In Abram’s story, the engineer says: “The answer to a question isn’t really separate from the expected observation. So ‘probability of observation depending on that prediction’ would translate to ‘probability of an event given that event’, which just has to be one.”
In other words, if the Predict-O-Matic knows it will predict P = A, it assigns probability 1 to the proposition that it will predict P = A.
Right, basically by definition. The word ‘given’ was intended in the Bayesian sense, ie, conditional probability.
I contend that Predict-O-Matic doesn’t know it will predict P = A at the relevant time. It would require time travel—to know whether it will predict P = A, it will have to have made a prediction already, and but it’s still formulating its prediction as it thinks about what it will predict.
It’s quite possible that the Predict-O-Matic has become relatively predictable-by-itself, so that it generally has good (not perfect) guesses about what it is about to predict. I don’t mean that it is in an equilibrium with itself; its predictions may be shifting in predictable directions. If these shifts become large enough, or if its predictability goes second-order (it predicts that it’ll predict its own output, and thus pre-anticipates the direction of shift recursively) it has to stop knowing its own output in so much detail (it’s changing too fast to learn about). But it can possibly know a lot about its output.
I definitely agree with most of the stuff in the ‘answering a question by having the answer’ section. Whether a system explicitly makes the prediction into a fixed point is a critical question, which will determine which way some of these issues go.
If the system does, then there are explicit ‘handles’ to optimize the world by selecting which self-fulfilling prophecies to make true. We are effectively forced to deal with the issue (if only by random selection).
If the system doesn’t, then we lack such handles, but the system still has to do something in the face of such situations. It may converge to self-fulfilling stuff. It may not, and so, produce ‘inconsistent’ outputs forever. This will depend on features of the learning algorithm as well as features of the situation it finds itself in.
It seems a bit like you might be equating the second option with “does not produce self-fulfilling prophecies”, which I think would be a mistake.
Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it’s something we’d want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn’t seem like it’d be much of a concern in practice (very little catastrophic potential), that might not get a prize.
In particular, inner misalignment seems like something you aren’t including in your “going wrong”? (Since it seems like an easy answer to your challenge.)
Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.
I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the “basically gradient descent” sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some “secret sauce” though.)
I think there’s potentially an analogy with attention in the context of deep learning, but it’s pretty loose.
It seems a bit like you might be equating the second option with “does not produce self-fulfilling prophecies”, which I think would be a mistake.
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points?
More the second than the first, but I’m also saying that the line between the two is blurry.
For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.
You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there’s an optimization implemented by the pair of the predict-o-matic and the person.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
Note, if the prediction algorithm anticipates this process (perhaps partially), it will “jump ahead”, so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn’t formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.
I’m not really sure what you mean when you say “something goes wrong” (in relation to the prize). I’ve been thinking about all this in a very descriptive way, ie, I want to understand what happens generally, not force a particular outcome. So I’m a little out-of-touch with the “goes wrong” framing at the moment. There are a lot of different things which could happen. Which constitute “going wrong”?
Becoming non-myopic; ie, using strategies which get lower prediction loss long-term rather than on a per-question basis.
(Note this doesn’t necessarily mean planning to do so, in an inner-optimizer way.)
Making self-fulfilling prophecies in order to strategically minimize prediction loss on individual questions (while possibly remaining myopic).
Having a tendency for self-fulfilling prophecies at all (not necessarily strategically minimizing loss).
Having a tendency for self-fulfilling prophecies, but not necessarily the ones which society has currently converged to (eg, disrupting existing equilibria about money being valuable because everyone expects things to stay that way).
Strategically minimizing prediction loss in any way other than by giving better answers in an intuitive sense.
Manipulating the world strategically in any way, toward any end.
Catastrophic risk by any means (not necessarily due to strategic manipulation).
In particular, inner misalignment seems like something you aren’t including in your “going wrong”? (Since it seems like an easy answer to your challenge.)
I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the “basically gradient descent” sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some “secret sauce” though.)
Right, basically by definition. The word ‘given’ was intended in the Bayesian sense, ie, conditional probability.
It’s quite possible that the Predict-O-Matic has become relatively predictable-by-itself, so that it generally has good (not perfect) guesses about what it is about to predict. I don’t mean that it is in an equilibrium with itself; its predictions may be shifting in predictable directions. If these shifts become large enough, or if its predictability goes second-order (it predicts that it’ll predict its own output, and thus pre-anticipates the direction of shift recursively) it has to stop knowing its own output in so much detail (it’s changing too fast to learn about). But it can possibly know a lot about its output.
I definitely agree with most of the stuff in the ‘answering a question by having the answer’ section. Whether a system explicitly makes the prediction into a fixed point is a critical question, which will determine which way some of these issues go.
If the system does, then there are explicit ‘handles’ to optimize the world by selecting which self-fulfilling prophecies to make true. We are effectively forced to deal with the issue (if only by random selection).
If the system doesn’t, then we lack such handles, but the system still has to do something in the face of such situations. It may converge to self-fulfilling stuff. It may not, and so, produce ‘inconsistent’ outputs forever. This will depend on features of the learning algorithm as well as features of the situation it finds itself in.
It seems a bit like you might be equating the second option with “does not produce self-fulfilling prophecies”, which I think would be a mistake.
Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it’s something we’d want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn’t seem like it’d be much of a concern in practice (very little catastrophic potential), that might not get a prize.
Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.
I think there’s potentially an analogy with attention in the context of deep learning, but it’s pretty loose.
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)
More the second than the first, but I’m also saying that the line between the two is blurry.
For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.
You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there’s an optimization implemented by the pair of the predict-o-matic and the person.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
To highlight the “blurry distinction” more:
Note, if the prediction algorithm anticipates this process (perhaps partially), it will “jump ahead”, so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn’t formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.