Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it’s something we’d want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn’t seem like it’d be much of a concern in practice (very little catastrophic potential), that might not get a prize.
In particular, inner misalignment seems like something you aren’t including in your “going wrong”? (Since it seems like an easy answer to your challenge.)
Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.
I note that the recursive-decomposition type system you describe is very different from most modern ML, and different from the “basically gradient descent” sort of thing I was imagining in the story. (We might naturally suppose that Predict-O-Matic has some “secret sauce” though.)
I think there’s potentially an analogy with attention in the context of deep learning, but it’s pretty loose.
It seems a bit like you might be equating the second option with “does not produce self-fulfilling prophecies”, which I think would be a mistake.
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points?
More the second than the first, but I’m also saying that the line between the two is blurry.
For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.
You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there’s an optimization implemented by the pair of the predict-o-matic and the person.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
Note, if the prediction algorithm anticipates this process (perhaps partially), it will “jump ahead”, so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn’t formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.
Intuitively, things go wrong if you get unexpected, unwanted, potentially catastrophic behavior. Basically, if it’s something we’d want to fix before using this thing in production. I think most of your bullet points qualify, but if you give an example which falls under one of those bullet points, yet doesn’t seem like it’d be much of a concern in practice (very little catastrophic potential), that might not get a prize.
Thanks for bringing that up. Yes, I am looking specifically for defeaters aimed in the general direction of the points I made in this post. Bringing up generic widely known safety concerns that many designs are potentially susceptible to does not qualify.
I think there’s potentially an analogy with attention in the context of deep learning, but it’s pretty loose.
Do you mean to say that a prophecy might happen to be self-fulfilling even if it wasn’t optimized for being so? Or are you trying to distinguish between “explicit” and “implicit” searches for fixed points? Or are you trying to distinguish between fixed points and self-fulfilling prophecies somehow? (I thought they were basically the same thing.)
More the second than the first, but I’m also saying that the line between the two is blurry.
For example, suppose there is someone who will often do what predict-o-matic predicts if they can understand how to do it. They often ask it what they are going to do. At first, predict-o-matic predicts them as usual. This modifies their behavior to be somewhat more predictable than it normally would be. Predict-o-matic locks into the patterns (especially the predictions which work the best as suggestions). Behavior gets even more regular. And so on.
You could say that no one is optimizing for fixed-point-ness here, and predict-o-matic is just chancing into it. But effectively, there’s an optimization implemented by the pair of the predict-o-matic and the person.
In situations like that, you get into an optimized fixed point over time, even though the learning algorithm itself isn’t explicitly searching for that.
To highlight the “blurry distinction” more:
Note, if the prediction algorithm anticipates this process (perhaps partially), it will “jump ahead”, so that convergence to a fixed point happens more within the computation of the predictor (less over steps of real world interaction). This isn’t formally the same as searching for fixed points internally (you will get much weaker guarantees out of this haphazard process), but it does mean optimization for fixed point finding is happening within the system under some conditions.