jacob_drori comments on Do sparse autoencoders find “true features”?

jacob_drori 23 Feb 2024 18:13 UTC
1 point
0
I’m confused about your three-dimensional example and would appreciate more mathematical detail.

Call the feature directions f1, f2, f3.

Suppose SAE hidden neurons 1,2 and 3 read off the components along f1, f2, and f1+f2, respectively. You claim that in some cases this may achieve lower L1 loss than reading off the f1, f2, f3 components.

[note: the component of a vector X along f1+f2 here refers to ¹⁄₂ * (f1+f2) \cdot X]

Can you write down the encoder biases that would achieve this loss reduction? Note that e.g. when the input is f1, there is a component of ¹⁄₂ along f1+f2, so you need a bias < −1/2 on neuron 3 to avoid screwing up the reconstruction.
- Logan Riggs 26 Feb 2024 16:20 UTC
  2 points
  0
  Parent
  Hey Jacob! My comment has a coded example with biases:
```
import torch
W = torch.tensor([[-1, 1],[1,1],[1,-1]])
x = torch.tensor([[0,1], [1,1],[1,0]])
b = torch.tensor([0, -1, 0])
y = torch.nn.functional.relu(x@W.T + b)
```
  This is for the encoder, where y will be the identity (which is sparse for the hidden dimension).
  - jacob_drori 26 Feb 2024 19:18 UTC
    1 point
    0
    Parent
    Nice, this is exactly what I was asking for. Thanks!