There has been a lot of interest in this going back to at least early this year and the 1.58bit LLM (ternary) logic paper https://arxiv.org/abs/2402.17764 so expect there has been a research gold rush and a lot of design effort going into producing custom hardware almost immediately that was revealed.
With Nvidia dual chip GB200 Grace Blackwell offering (sparse) 40Pflop fp4 at ~1kW there has already been something close to optimal hardware available—that fp4 performance may have been the reason the latest generation Nvidia GPU are in such high demand—previous generations haven’t offered it as far as I am aware. For comparison a human brain is likely equivalent to 10-100Pflops, though estimates vary.
Being able to up the performance significantly from a single AI chip has huge system cost benefits.
All suggesting that the costs for AI are going to drop yet again, and human level AGI operating costs are going to be measured in cents per hour when it arrives in a few years time.
The implications for autonomous robotics are likely tremendous, with potential OOM power savings likely to bring far more capable systems to smaller platforms, home robotics, fsd cars, and (scarily) military murderbots. Tesla has (according to Elon comments) a new HW5 autonomy chip coming out next year that is ~50x faster than their current FSD development baseline HW3 2 x 72Tflop chipset, but needs closer to 1kW power, so they will be extremely keen on implementing something that could save so much power.
The immediate practical value of this recent paper is more elusive: they try to do even more by exorcising multiplication from attention, which is a step in an important direction, but the data they get doesn’t seem sufficient to overcome the prior that this is very hard to do successfully. Only Mamba got close to attention as a pure alternative (without the constraint of avoiding multiplication), and even then it has issues unless we hybridize it with (local) attention (which also works well with other forms of attention alternatives, better even than vanilla attention on its own).
There has been a lot of interest in this going back to at least early this year and the 1.58bit LLM (ternary) logic paper https://arxiv.org/abs/2402.17764 so expect there has been a research gold rush and a lot of design effort going into producing custom hardware almost immediately that was revealed.
With Nvidia dual chip GB200 Grace Blackwell offering (sparse) 40Pflop fp4 at ~1kW there has already been something close to optimal hardware available—that fp4 performance may have been the reason the latest generation Nvidia GPU are in such high demand—previous generations haven’t offered it as far as I am aware. For comparison a human brain is likely equivalent to 10-100Pflops, though estimates vary.
Being able to up the performance significantly from a single AI chip has huge system cost benefits.
All suggesting that the costs for AI are going to drop yet again, and human level AGI operating costs are going to be measured in cents per hour when it arrives in a few years time.
The implications for autonomous robotics are likely tremendous, with potential OOM power savings likely to bring far more capable systems to smaller platforms, home robotics, fsd cars, and (scarily) military murderbots. Tesla has (according to Elon comments) a new HW5 autonomy chip coming out next year that is ~50x faster than their current FSD development baseline HW3 2 x 72Tflop chipset, but needs closer to 1kW power, so they will be extremely keen on implementing something that could save so much power.
This is 2015-2016 tech though. The value of the recent ternary BitNet result is demonstrating that it works well for transformers (which wasn’t nearly as much the case for binary BitNet).
The immediate practical value of this recent paper is more elusive: they try to do even more by exorcising multiplication from attention, which is a step in an important direction, but the data they get doesn’t seem sufficient to overcome the prior that this is very hard to do successfully. Only Mamba got close to attention as a pure alternative (without the constraint of avoiding multiplication), and even then it has issues unless we hybridize it with (local) attention (which also works well with other forms of attention alternatives, better even than vanilla attention on its own).