Evan’s post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
This post argues against (a version of) Evan’s premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn’t expect gradient descent to select for it.
Other relevant differences are
gradient descent doesn’t actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth’s definition of search as “general problem solving algorithm”.
Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I’m focusing on the average or default case.
Mostly orthogonal:
Evan’s post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
This post argues against (a version of) Evan’s premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn’t expect gradient descent to select for it.
Other relevant differences are
gradient descent doesn’t actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth’s definition of search as “general problem solving algorithm”.
Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I’m focusing on the average or default case.