Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
A token is either the first one of multi-token word or it isn’t.
A word is either a noun, a verb or something else.
A word belongs to language LANG and not to any other language/has other meanings in those languages.
A H×W image can only contain so many objects which can only contain so many sub-aspects.
I don’t know what it would mean to go “out of distribution” in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.
Current theme: default
Less Wrong (text)
Less Wrong (link)
Arrow keys: Next/previous image
Escape or click: Hide zoomed image
Space bar: Reset image size & position
Scroll to zoom in/out
(When zoomed in, drag to pan; double-click to close)
Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).
]
Keys shown in grey (e.g., ?) do not require any modifier keys.
?
Esc
h
f
a
m
v
c
r
q
t
u
o
,
.
/
s
n
e
;
Enter
[
\
k
i
l
=
-
0
′
1
2
3
4
5
6
7
8
9
→
↓
←
↑
Space
x
z
`
g
Yeah I agree with that. But there is also a sense in which some (many?) features will be inherently sparse.
A token is either the first one of multi-token word or it isn’t.
A word is either a noun, a verb or something else.
A word belongs to language LANG and not to any other language/has other meanings in those languages.
A H×W image can only contain so many objects which can only contain so many sub-aspects.
I don’t know what it would mean to go “out of distribution” in any of these cases.
This means that any network that has an incentive to conserve parameter usage (however we want to define that), might want to use superposition.