if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy
This would be pretty strange—why not just directly hardcode the policy? Wouldn’t that be faster? We need to use machine learning because we aren’t able to write down a program that “directly” solves e.g. image recognition, but the “direct” program would be faster if we had some way of finding it. The general reason for optimism is that the “fastest” requirement implies that any extraneous computation (e.g. deception) is removed; that same reason implies that any search would be removed and replaced with a correct output of the search.
Another way to think of this: when you don’t have a simplicity bias, you have to compete with the GLUT (Giant Lookup Table), which can be very fast. Even if you take into account the time taken to perform the lookup, in a deterministic environment the GLUT only has to encode the optimal trajectory, not the full policy. In a stochastic environment, the GLUT may need to be exponentially larger, so it may be too slow, but even so you can have GLUT-like things built out of higher-level abstractions, which might be enough to avoid deception. Basically you can do a lot of weird stuff when you don’t require simplicity; it’s not clear that meta learning should be modeled the same way.
While it has been argued that the simplest program that solves a complex task is likely to be deceptive, it hasn’t yet been argued whether the fastest program that solves a complex task will be deceptive. This post argues that fast programs will often be forced to learn a good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization). Thus, there are at least some tasks where the fastest program will also be deceptive.
Planned opinion:
This is an intriguing hypothesis, but I’m not yet convinced: it’s not clear why the fastest program would have to learn the best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.
A couple of things. First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out. Second, even if you’re operating on pure speed, I think there are sets of tasks that are large enough that a GLUT won’t work. For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren’t the fastest way to do that. Even if you restrict to only finitely many convex functions, if your set is large enough it still seems hard to do better than gradient descent, especially since looking up the solution in a huge GLUT could be quite expensive (how do you do the lookup? a hash table? a binary tree? I’m not sure if those would be good enough here).
First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out.
I don’t really understand this—my understanding is that with a minimal circuit, you want to minimize the number of gates in the circuit, and the circuit must be a DAG (if you’re allowed to have loops + clocks, then you can build a regular computer, and for complex tasks the problem should be very similar to finding the shortest program and implementing it on a universal circuit).
But then I can create a Turing Machine that interprets its input as a circuit, and simulates each gate of the circuit in sequence. Then the running time of the TM is proportional to the number of gates in the circuit, so an input with minimal running time should be a circuit with the minimal number of gates. This is not technically a Universal TM, since loop-free circuits are not Turing-complete, but I would expect a speed prior using such a Turing Machine would be relatively similar to a speed prior with a true UTM.
EDIT: I no longer endorse the above point. Just because there exists a speed prior that aligns with minimal circuits doesn’t mean that this is the “typical” speed prior. In general there really should be a difference between a “minimal circuit” and a “fastest program” (precisely the depth + width vs. depth point Evan made in the parent comment).
For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren’t the fastest way to do that.
I agree that if you have to work on an infinite set of inputs, a GLUT is not the way to do that. I was thinking of the case where you have to work on a small finite set of inputs (hence why I talk about the optimal trajectory instead of the optimal policy), which is always going to be the case in the real world. But this is too pedantic, we can certainly think of the theoretical case where you have to work on an infinite set of inputs. I was mostly trying to use the GLUT as an intuition pump, not arguing that it was always better.
In the case with infinite inputs, I still have the intuition that meta learning is what you do when you don’t know enough about the problem to write down the good heuristics straight away, as opposed to being the fastest way of solving the problem. But I agree that the fastest solution won’t be a GLUT; I’m thinking more a combination of really good heuristics that “directly” solve the problem. (Part of this is an intuition that for any reasonably structured set of inputs, value-neutral optimization of a fixed objective is very inefficient.)
This would be pretty strange—why not just directly hardcode the policy? Wouldn’t that be faster? We need to use machine learning because we aren’t able to write down a program that “directly” solves e.g. image recognition, but the “direct” program would be faster if we had some way of finding it. The general reason for optimism is that the “fastest” requirement implies that any extraneous computation (e.g. deception) is removed; that same reason implies that any search would be removed and replaced with a correct output of the search.
Another way to think of this: when you don’t have a simplicity bias, you have to compete with the GLUT (Giant Lookup Table), which can be very fast. Even if you take into account the time taken to perform the lookup, in a deterministic environment the GLUT only has to encode the optimal trajectory, not the full policy. In a stochastic environment, the GLUT may need to be exponentially larger, so it may be too slow, but even so you can have GLUT-like things built out of higher-level abstractions, which might be enough to avoid deception. Basically you can do a lot of weird stuff when you don’t require simplicity; it’s not clear that meta learning should be modeled the same way.
Planned summary:
While it has been argued that the simplest program that solves a complex task is likely to be deceptive, it hasn’t yet been argued whether the fastest program that solves a complex task will be deceptive. This post argues that fast programs will often be forced to learn a good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization). Thus, there are at least some tasks where the fastest program will also be deceptive.
Planned opinion:
This is an intriguing hypothesis, but I’m not yet convinced: it’s not clear why the fastest program would have to learn the best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.
A couple of things. First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out. Second, even if you’re operating on pure speed, I think there are sets of tasks that are large enough that a GLUT won’t work. For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren’t the fastest way to do that. Even if you restrict to only finitely many convex functions, if your set is large enough it still seems hard to do better than gradient descent, especially since looking up the solution in a huge GLUT could be quite expensive (how do you do the lookup? a hash table? a binary tree? I’m not sure if those would be good enough here).
I don’t really understand this—my understanding is that with a minimal circuit, you want to minimize the number of gates in the circuit, and the circuit must be a DAG (if you’re allowed to have loops + clocks, then you can build a regular computer, and for complex tasks the problem should be very similar to finding the shortest program and implementing it on a universal circuit).
But then I can create a Turing Machine that interprets its input as a circuit, and simulates each gate of the circuit in sequence. Then the running time of the TM is proportional to the number of gates in the circuit, so an input with minimal running time should be a circuit with the minimal number of gates. This is not technically a Universal TM, since loop-free circuits are not Turing-complete, but I would expect a speed prior using such a Turing Machine would be relatively similar to a speed prior with a true UTM.
EDIT: I no longer endorse the above point. Just because there exists a speed prior that aligns with minimal circuits doesn’t mean that this is the “typical” speed prior. In general there really should be a difference between a “minimal circuit” and a “fastest program” (precisely the depth + width vs. depth point Evan made in the parent comment).
I agree that if you have to work on an infinite set of inputs, a GLUT is not the way to do that. I was thinking of the case where you have to work on a small finite set of inputs (hence why I talk about the optimal trajectory instead of the optimal policy), which is always going to be the case in the real world. But this is too pedantic, we can certainly think of the theoretical case where you have to work on an infinite set of inputs. I was mostly trying to use the GLUT as an intuition pump, not arguing that it was always better.
In the case with infinite inputs, I still have the intuition that meta learning is what you do when you don’t know enough about the problem to write down the good heuristics straight away, as opposed to being the fastest way of solving the problem. But I agree that the fastest solution won’t be a GLUT; I’m thinking more a combination of really good heuristics that “directly” solve the problem. (Part of this is an intuition that for any reasonably structured set of inputs, value-neutral optimization of a fixed objective is very inefficient.)