This feels like cheating to me, but I guess I wasn’t super precise with ‘feedforward neural network’. I meant ‘fully connected neural network’, so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as
f(x,θ)=σn∘Wn∘…σ1∘W1[x,θ]T
where the weight matrices are some nice function of θ (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in ϕ and produces the Wi matrices that are actually used in the forward pass.)
I guess I should be more precise about what ‘nice means’, to rule out weight sharing functions that always zero out input, but it turns out this is kind of tricky. Let’s require the weight sharing function ϕ:Rw→RW to be differentiable and have image that satisfies [−1,1]⊂projnim(ϕ) for any projection. (A weaker condition is if the weight sharing function can only duplicate parameters).
This feels like cheating to me, but I guess I wasn’t super precise with ‘feedforward neural network’. I meant ‘fully connected neural network’, so the gradient computation has to be connected by parameters to the outputs. Specifically, I require that you can write the network as
f(x,θ)=σn∘Wn∘…σ1∘W1[x,θ]Twhere the weight matrices are some nice function of θ (where we need a weight sharing function to make the dimensions work out. The weight sharing function takes in ϕ and produces the Wi matrices that are actually used in the forward pass.)
I guess I should be more precise about what ‘nice means’, to rule out weight sharing functions that always zero out input, but it turns out this is kind of tricky. Let’s require the weight sharing function ϕ:Rw→RW to be differentiable and have image that satisfies [−1,1]⊂projnim(ϕ) for any projection. (A weaker condition is if the weight sharing function can only duplicate parameters).