Max Ma

Karma: −29

Max Ma 23 Mar 2025 23:05 UTC
1 point
0
on: Transformer Attention’s High School Math Mistake
DeepSeek V3 mitigated this mistake unknowingly. In their MLA, K, V shares the same nn.linear.

Transformer Attention’s High School Math Mistake

Max Ma22 Mar 2025 0:16 UTC

−13 points

1 comment1 min readLW link

Max Ma 16 Mar 2025 23:16 UTC
1 point
0
in reply to: Raemon’s comment on: AI4Science: The Hidden Power of Neural Networks in Scientific Discovery
Thanks… will look into

AI4Science: The Hidden Power of Neural Networks in Scientific Discovery

Max Ma14 Mar 2025 21:18 UTC

2 points

2 comments1 min readLW link

Neural Network And Newton’s Second Law

Max Ma12 Oct 2024 6:25 UTC

−10 points

0 comments1 min readLW link

Max Ma 5 Aug 2024 8:10 UTC
2 points
0
on: The ‘strong’ feature hypothesis could be wrong
Firstly, the principle of ‘no computation without representation’ holds true. The strength of the representation depends on the specific computational task and the neural network architecture, such as a Transformer. For example, when a Transformer is used to solve a simple linear problem with low dimensionality, it would provide a strong representation. Conversely, for a high-order nonlinear problem with high dimensionality, the representation may be weaker.
The neural network operates as a power-efficient system, with each node requiring minimal computational power, and all foundation model pre-training is self-supervised. The neural network’s self-progressing boundary condition imposes no restrictions on where incoming data is processed. Incoming data will be directed to whichever nodes are capable of processing it. This means that the same token will be processed in different nodes. It is highly likely that many replicas of identical or near-identical feature bits (units of feature) disperse throughout the network. The inequality in mathematics suggests that connections between nodes (pathways) are not equal. Our working theory proposes that feature bits propagate through the network, with their propagation distance determined by the computational capacity of each node. The pathway appears to be power-driven, prioritizing certain features or patterns during learning in a discriminatory manner. While this discriminative feature pathway (DFP) is mathematically plausible, the underlying theory remains unclear. It seems that neural networks are leading us into the realm of bifurcation theory

Max Ma

Trans­former At­ten­tion’s High School Math Mistake

AI4Science: The Hid­den Power of Neu­ral Net­works in Scien­tific Discovery

Neu­ral Net­work And New­ton’s Se­cond Law

Transformer Attention’s High School Math Mistake

AI4Science: The Hidden Power of Neural Networks in Scientific Discovery

Neural Network And Newton’s Second Law