Point of clarification: Is the supervisor the same as the potentially faulting hardware, or are we talking about a different, non-suspect node checking the work, and/or e.g. a more reliable model of chip supervising a faster but less reliable one?
Generally each node involved is a something like a rack-mounted server, or a virtual machine running on one, all of roughly comparable reliability (often of only around the commodity level of reliability). The nodes running the checks may often themselves be redundant and crosschecked, or the whole system may be of nodes that both do the work and cross-check each other — there are well-known algorithms for a group of nodes crosschecking each other that will provably always give the right answer as long as some suitably sized majority of them haven’t all failed at once in a weirdly coordinated way, and knowing the reliability of your nodes (from long experience) you can choose the size of your group to achieve any desired level of overall reliability. Then you need to achieve the same things for network, storage, job scheduling, data-paths, updates and so forth: everything involved in the process. This stuff is hard in practice, but the theory is well-understood and taught in CS classes. With enough work on redundancy, crosschecks, and retries you can build arbitrarily large, arbitrarily reliable systems out of somewhat unreliable components. Godzilla can be trained to reliably defeat megagodzilla (please note that I’m not claiming you can make this happen reliably the first time: initially there are invariably failure modes you hadn’t thought of causing you to need to do more work). The more unreliable your basic components the harder this gets, and there’s almost certainly a required minimum reliability threshold for them: if they usually die before they can even do a cross-check on each other, you’re stuck.
If you read the technical report for Gemini, in the section on training they explicitly mention doing engineering to detect and correct cases where a server has temporarily had a limited point-failure during a calculation due to a cosmic ray hit. They’re building systems so large that they need to cope with failure modes that rare. They also maintain multiple online services that each have ~2 billion users, i.e. most people who are online. The money they make (mostly off search) allows them to hire a lot of very skilled engineers, and as the company name says, they specialize in scale. There’s an inside joke among Google engineers that after being there a while you lose the ability to think in small numbers — “I can’t count that low any more”: ask a Google engineer to set up a small website for, say, a cafe, and they’re likely to select technologies that have been engineered and proven to scale to a billion users, without stopping to consider how many customers a cafe could ever serve.
Quite a bit of this stuff is also available open-source now, though the open-source versions generally have only had enough kinks worked out to scale to maybe O(10m) users.
Point of clarification: Is the supervisor the same as the potentially faulting hardware, or are we talking about a different, non-suspect node checking the work, and/or e.g. a more reliable model of chip supervising a faster but less reliable one?
Generally each node involved is a something like a rack-mounted server, or a virtual machine running on one, all of roughly comparable reliability (often of only around the commodity level of reliability). The nodes running the checks may often themselves be redundant and crosschecked, or the whole system may be of nodes that both do the work and cross-check each other — there are well-known algorithms for a group of nodes crosschecking each other that will provably always give the right answer as long as some suitably sized majority of them haven’t all failed at once in a weirdly coordinated way, and knowing the reliability of your nodes (from long experience) you can choose the size of your group to achieve any desired level of overall reliability. Then you need to achieve the same things for network, storage, job scheduling, data-paths, updates and so forth: everything involved in the process. This stuff is hard in practice, but the theory is well-understood and taught in CS classes. With enough work on redundancy, crosschecks, and retries you can build arbitrarily large, arbitrarily reliable systems out of somewhat unreliable components. Godzilla can be trained to reliably defeat megagodzilla (please note that I’m not claiming you can make this happen reliably the first time: initially there are invariably failure modes you hadn’t thought of causing you to need to do more work). The more unreliable your basic components the harder this gets, and there’s almost certainly a required minimum reliability threshold for them: if they usually die before they can even do a cross-check on each other, you’re stuck.
If you read the technical report for Gemini, in the section on training they explicitly mention doing engineering to detect and correct cases where a server has temporarily had a limited point-failure during a calculation due to a cosmic ray hit. They’re building systems so large that they need to cope with failure modes that rare. They also maintain multiple online services that each have ~2 billion users, i.e. most people who are online. The money they make (mostly off search) allows them to hire a lot of very skilled engineers, and as the company name says, they specialize in scale. There’s an inside joke among Google engineers that after being there a while you lose the ability to think in small numbers — “I can’t count that low any more”: ask a Google engineer to set up a small website for, say, a cafe, and they’re likely to select technologies that have been engineered and proven to scale to a billion users, without stopping to consider how many customers a cafe could ever serve.
Quite a bit of this stuff is also available open-source now, though the open-source versions generally have only had enough kinks worked out to scale to maybe O(10m) users.