Ah, you’re right. The proof can be fixed by changing the division between the two cases. So here is the new proof, with more details added regarding the construction of g:
B(0,m) is uniformly discrete for all m. Then every map from X to [0,1] is uniformly continuous on bounded sets, so we get a contradiction from cardinality considerations.
B(0,m) is not uniformly discrete for some m. Then for each n≥m, since f is uniformly continuous on B(0,n) it has a modulus of continuity on this set, i.e. a continuous increasing function hn:(0,∞)→(0,∞) such that d(f(x),f(y))<hn(d(x,y)) for all x,y∈B(0,n)⊆X×X. Since B(0,m) is not uniformly discrete, there exist xn,yn∈B(0,m) such that hn(d(xn,yn))<2−n and d(xn,yn)<2−n. We can extract a subsequence (nk) such that if Xk=xnk and Yk=ynk, then d(Xk+1,Yk+1)≤(1/3)d(Xk,Yk) for all k, and for all k and ℓ>k, either (A) d(Xk,Yℓ)≥(1/3)d(Xk,Yk) or (B) d(Yk,Xℓ)≥(1/3)d(Xk,Yk). By extracting a further subsequence we can assume that which of (A) or (B) holds depends only on k and not on ℓ. By swapping Xk and Yk if necessary we can assume that case (A) always holds. Now let j:[0,∞)→[0,∞) be an increasing continuous function such that j((1/3)d(Xk,Yk))>2−nk for all k. Finally, let g(y)=infkj(d(Xk,y)). Then for all k we have g(Xk)=0 but g(Yk)>2−nk. Clearly, g cannot be a fiber of f.
Regarding the appropriateness of metric spaces / uniform continuity rather than topological spaces / abstract continuity, here are some of the reasons behind my intuition here (developed working in mathematical analysis, specifically Diophantine approximation, and also constructive mathematics):
The obvious: metric spaces are explicitly meant to represent the intuitive notion of alikeness as a quantitative concept (i.e. distance), whereas topological spaces have no explicit notion of alikeness.
In computability theory, one is interested in the question of how to computationally represent a point or an approximation to a point in a space. The standard way to do this is via restricting to the class of complete separable metric spaces, fixing a countable dense sequence (xn) (assumed to be representative of the structure of the metric space), and defining a computational approximation to a point to be an expression of the form B(xn,1/m). Since n and m are integers this expression can be coded as finite data. One then defines a computational encoding of a point to be an infinite bitstream consisting of computational approximations that converge to the point.
In practical applications, in the end you will want everything to be computable. So it makes sense to work in a framework where there are natural notions of computability. I am not aware of any such notions for general topological spaces.
Regarding continuity vs uniform continuity in metric spaces, both are saying that if two points are close in the domain, their images are also close. But the latter gives you a straightforward estimate as to how close, whereas the former says that the degree of closeness may depend on one of the points. Now, there are good reasons to consider such dependence, since even natural functions on the real numbers (such as x2 or 1/x) have “singularities” where they are not uniformly continuous.
So the question is whether to modify the notion of uniform continuity to directly account for singularities, or to use the standard definition of continuity instead. But if one works with the standard definition, then most of the time one is really looking for ways to sneak back to uniform continuity, e.g. by using the fact that a continuous function on a compact set is uniformly continuous.
An intuitive way of thinking about the fact that a continuous function on a compact set is uniformly continuous is that the notion of compactness means that there are no singularities present “within the space”. For example, if we go back to the functions x2 or 1/x, then the singularity of the first occurs at infinity, while the singularity of the latter occurs at 0. If we take a compact subset of the domain of either function, then what it really means is that we are avoiding the singularity.
By contrast, non-compactness should mean that there are singularities. In some cases like (0,1) it is easy to identify what the singularities are. But if we are dealing with spaces that are not locally compact like NN or an infinite-dimensional Hilbert space, then it is not as clear what the singularities are, there is just a general sense that they are dispersed “throughout the space” (because the space is not not locally compact).
But you have to ask yourself, are these singularities real or just imagined? In many cases, imagined. For example, in the theory of Banach spaces continuous linear maps are always uniformly continuous.
What about a map that is not uniformly continuous, like the inversion map f(x)=x/∥x∥2 in infinite-dimensional Hilbert space? In this case, there is still a singularity—at 0 -- and the definition of continuity needs to reflect that. But it doesn’t help to imagine all sorts of other singularities dispersed throughout the space, because that prevents you from making useful statements like: if x,y are at least α away from 0 and d(x,y)≤ε, then d(f(x),f(y))≤Cε/α2, where C is an absolute constant.
Now the example in the previous paragraph is an example of quantitative continuity, which is stronger than uniform continuity away from singularities. But the point is that it can be seen as an extension of uniform continuity away from singularities.
Maybe my last reason will be the most relevant from a naturalized agent perspective. The notion of uniform continuity is important because it introduces the modulus of continuity, which can be viewed as a measure of how continuous a function is. The restriction that an agent must be uniformly continuous can be then thought of in a quantitative sense, with “better” agents less having to follow this restriction. So a more powerful agent may have a looser (larger) modulus of continuity, because it can react more precisely to different possible inputs.
In this terminology, my proof can be thought of as giving an intuitive reason for why the agent cannot implement every possible policy: the agent has limited resources to distinguish different inputs, so it can only implement those policies that can be implemented with these limited resources.
The obvious followup question would be whether if you restrict your attention to the policies that the agent isn’t prevented from implementing due to its limited resources, then can it implement every possible policy? Or in other words, if you fix a modulus of continuity from the outset, can you include all functions with that modulus of continuity as fibers?
If you allow the every-policy function to have an arbitrary modulus of continuity unrelated to the modulus of continuity you are trying to imitate, then it is not hard to see that this is possible at least for some spaces. (By Arzela-Ascoli the space of functions with a fixed modulus of continuity is compact, so there exists a continuous surjection from 2N to this space.) But this may require greatly increasing the resources that the agent must spend to differentiate inputs. On the other hand, requiring the exact same modulus of continuity seems like too rigid an assumption. So the right question is probably to ask how close can the modulus of continuity of the every-policy function be to the modulus it is trying to imitate.
For this kind of question it is probably better to work with a concrete example rather than trying to prove something in generality, so I will work with the Cantor space X=2N with the metric d((xn),(yn))=2−min{n:xn≠yn}. Suppose we want to imitate all functions g:X→{0,1} such that d(x,y)<ε implies g(x)=g(y). (I know this is not quite the same as the original question, but I think it is close enough.) If ε=2−n then there are N=22n such functions. So if we have a single function f:X×X→{0,1} that has all of them as fibers, then by the pigeonhole principle there is some ball of the form B(x,2−N+1) that contains two such fibers. But then if x1 and x2 are the two fibers, then there exists y such that f(x1,y)≠f(x2,y). It follows that if we want to choose ε′ such that d(x,y)<ε′ implies f(x)=f(y) (i.e. the analogue of the assumption on g but with ε replaced by ε′) then we need ε′≤2−N+2.
In conclusion, the required accuracy ε′ of f is doubly exponential with respect to the required accuracy ε of g. Thus, it is not feasible to implement such a function.
Ah, you’re right. The proof can be fixed by changing the division between the two cases. So here is the new proof, with more details added regarding the construction of g:
B(0,m) is uniformly discrete for all m. Then every map from X to [0,1] is uniformly continuous on bounded sets, so we get a contradiction from cardinality considerations.
B(0,m) is not uniformly discrete for some m. Then for each n≥m, since f is uniformly continuous on B(0,n) it has a modulus of continuity on this set, i.e. a continuous increasing function hn:(0,∞)→(0,∞) such that d(f(x),f(y))<hn(d(x,y)) for all x,y∈B(0,n)⊆X×X. Since B(0,m) is not uniformly discrete, there exist xn,yn∈B(0,m) such that hn(d(xn,yn))<2−n and d(xn,yn)<2−n. We can extract a subsequence (nk) such that if Xk=xnk and Yk=ynk, then d(Xk+1,Yk+1)≤(1/3)d(Xk,Yk) for all k, and for all k and ℓ>k, either (A) d(Xk,Yℓ)≥(1/3)d(Xk,Yk) or (B) d(Yk,Xℓ)≥(1/3)d(Xk,Yk). By extracting a further subsequence we can assume that which of (A) or (B) holds depends only on k and not on ℓ. By swapping Xk and Yk if necessary we can assume that case (A) always holds. Now let j:[0,∞)→[0,∞) be an increasing continuous function such that j((1/3)d(Xk,Yk))>2−nk for all k. Finally, let g(y)=infkj(d(Xk,y)). Then for all k we have g(Xk)=0 but g(Yk)>2−nk. Clearly, g cannot be a fiber of f.
Regarding the appropriateness of metric spaces / uniform continuity rather than topological spaces / abstract continuity, here are some of the reasons behind my intuition here (developed working in mathematical analysis, specifically Diophantine approximation, and also constructive mathematics):
The obvious: metric spaces are explicitly meant to represent the intuitive notion of alikeness as a quantitative concept (i.e. distance), whereas topological spaces have no explicit notion of alikeness.
In computability theory, one is interested in the question of how to computationally represent a point or an approximation to a point in a space. The standard way to do this is via restricting to the class of complete separable metric spaces, fixing a countable dense sequence (xn) (assumed to be representative of the structure of the metric space), and defining a computational approximation to a point to be an expression of the form B(xn,1/m). Since n and m are integers this expression can be coded as finite data. One then defines a computational encoding of a point to be an infinite bitstream consisting of computational approximations that converge to the point.
In practical applications, in the end you will want everything to be computable. So it makes sense to work in a framework where there are natural notions of computability. I am not aware of any such notions for general topological spaces.
Regarding continuity vs uniform continuity in metric spaces, both are saying that if two points are close in the domain, their images are also close. But the latter gives you a straightforward estimate as to how close, whereas the former says that the degree of closeness may depend on one of the points. Now, there are good reasons to consider such dependence, since even natural functions on the real numbers (such as x2 or 1/x) have “singularities” where they are not uniformly continuous.
So the question is whether to modify the notion of uniform continuity to directly account for singularities, or to use the standard definition of continuity instead. But if one works with the standard definition, then most of the time one is really looking for ways to sneak back to uniform continuity, e.g. by using the fact that a continuous function on a compact set is uniformly continuous.
An intuitive way of thinking about the fact that a continuous function on a compact set is uniformly continuous is that the notion of compactness means that there are no singularities present “within the space”. For example, if we go back to the functions x2 or 1/x, then the singularity of the first occurs at infinity, while the singularity of the latter occurs at 0. If we take a compact subset of the domain of either function, then what it really means is that we are avoiding the singularity.
By contrast, non-compactness should mean that there are singularities. In some cases like (0,1) it is easy to identify what the singularities are. But if we are dealing with spaces that are not locally compact like NN or an infinite-dimensional Hilbert space, then it is not as clear what the singularities are, there is just a general sense that they are dispersed “throughout the space” (because the space is not not locally compact).
But you have to ask yourself, are these singularities real or just imagined? In many cases, imagined. For example, in the theory of Banach spaces continuous linear maps are always uniformly continuous.
What about a map that is not uniformly continuous, like the inversion map f(x)=x/∥x∥2 in infinite-dimensional Hilbert space? In this case, there is still a singularity—at 0 -- and the definition of continuity needs to reflect that. But it doesn’t help to imagine all sorts of other singularities dispersed throughout the space, because that prevents you from making useful statements like: if x,y are at least α away from 0 and d(x,y)≤ε, then d(f(x),f(y))≤Cε/α2, where C is an absolute constant.
Now the example in the previous paragraph is an example of quantitative continuity, which is stronger than uniform continuity away from singularities. But the point is that it can be seen as an extension of uniform continuity away from singularities.
Maybe my last reason will be the most relevant from a naturalized agent perspective. The notion of uniform continuity is important because it introduces the modulus of continuity, which can be viewed as a measure of how continuous a function is. The restriction that an agent must be uniformly continuous can be then thought of in a quantitative sense, with “better” agents less having to follow this restriction. So a more powerful agent may have a looser (larger) modulus of continuity, because it can react more precisely to different possible inputs.
In this terminology, my proof can be thought of as giving an intuitive reason for why the agent cannot implement every possible policy: the agent has limited resources to distinguish different inputs, so it can only implement those policies that can be implemented with these limited resources.
The obvious followup question would be whether if you restrict your attention to the policies that the agent isn’t prevented from implementing due to its limited resources, then can it implement every possible policy? Or in other words, if you fix a modulus of continuity from the outset, can you include all functions with that modulus of continuity as fibers?
If you allow the every-policy function to have an arbitrary modulus of continuity unrelated to the modulus of continuity you are trying to imitate, then it is not hard to see that this is possible at least for some spaces. (By Arzela-Ascoli the space of functions with a fixed modulus of continuity is compact, so there exists a continuous surjection from 2N to this space.) But this may require greatly increasing the resources that the agent must spend to differentiate inputs. On the other hand, requiring the exact same modulus of continuity seems like too rigid an assumption. So the right question is probably to ask how close can the modulus of continuity of the every-policy function be to the modulus it is trying to imitate.
For this kind of question it is probably better to work with a concrete example rather than trying to prove something in generality, so I will work with the Cantor space X=2N with the metric d((xn),(yn))=2−min{n:xn≠yn}. Suppose we want to imitate all functions g:X→{0,1} such that d(x,y)<ε implies g(x)=g(y). (I know this is not quite the same as the original question, but I think it is close enough.) If ε=2−n then there are N=22n such functions. So if we have a single function f:X×X→{0,1} that has all of them as fibers, then by the pigeonhole principle there is some ball of the form B(x,2−N+1) that contains two such fibers. But then if x1 and x2 are the two fibers, then there exists y such that f(x1,y)≠f(x2,y). It follows that if we want to choose ε′ such that d(x,y)<ε′ implies f(x)=f(y) (i.e. the analogue of the assumption on g but with ε replaced by ε′) then we need ε′≤2−N+2.
In conclusion, the required accuracy ε′ of f is doubly exponential with respect to the required accuracy ε of g. Thus, it is not feasible to implement such a function.