My understanding is that GPT style transformer architecture already incorporates random seeds at various points. In which case, adding this functionality to the random seeds wouldn’t cause any significant “cost” in terms of competing with other implementations.
GödelPilled
This source (Chinese news website) Not 175 billion!OpenAI CEO’s announcement: GPT-4 parameters do not increase but decrease—iMedia (min.news)
cites the Sam Altman quote on GPT-4 having few parameters as being from the AC10 online meetup, however I can’t seem to find any transcript or videos of that meetup to verify it.
GPT-4 was trained on OpenAI’s new supercomputer which is composed of [edit] NVIDIA DGX A100 nodes.
I’m assuming each individual instance of GPT-4 runs on one DGX A100 node.
Each DGX node has 8x A100 GPUs. Each A100 can have either 40 or 80GB vram. So a single DGX node running GPT-4 has either 320 or 640 GB. That can allow us to calculate an upper limit to the number of parameters in a single GPT-4 instance.
Assuming GPT-4 uses float16 to represent parameters (same as GPT-3), and assuming they’re using the 80GB A100s, that gives us an upper limit of 343 billion parameters in one GPT-4 instance.
GPT-3 had 175 Billion parameters. I’ve seen a few references online to some interview where Sam Altman said GPT-4 actually has fewer parameters than GPT-3 but different architecture and more training. I can’t find the original source so I can’t verify that quote, but assuming that’s true, that gives us a lower bound of slightly smaller than GPT-3′s 175 Billion parameters.
Looking at the compute architecture that it’s running on, an upper bound of 343 billion parameters seems reasonable.[edited: removed incorrect estimate of number of DGX nodes as 800. Figure wasn’t used in parameter estimate anyways.]
Also since it’s my first post here, nice to meet everyone! I’ve been lurking on LessWrong since ~2016 and figured now is as good a time as any to actively start participating.
Everett Insurance as a Misalignment Backup
(Posting this here because I’ve been lurking a long time and just decided to create an account. Not sure if this idea is well structured enough to warrent a high level post and I need to accrue karma before I can post anyways.)
Premise:
There exist some actions, which have such low cost, as well as high payoff in the case that the Everett interpretation happens to be true, that their expected value makes them worth doing even if P(everett=true) is very low. (personal prior: P(everett=true) = %5)
Proposal:
The proposal behind Everett Insurance is straightforward: use Quantum Random Number Generators (QRNGs) to introduce quantum random seeds into large AI systems. QRNGs are devices that use quantum processes (such as photon emission or tunneling) to generate random numbers that are truly unpredictable and irreproducible. These random numbers can be used as seeds during training and\or inference passes in large AI systems. By doing so, if the Everett interpretation happens to be true, we would create a broad swath of possible universes in which each and every AI inference (or training run) will be slightly different.
The idea is based on the Everett interpretation of quantum mechanics (also known as the many-worlds interpretation), which holds that there are many worlds that exist in parallel at the same space and time as our own. According to this interpretation, every time a quantum interaction occurs with different possible outcomes (such as measuring the spin of an electron), all outcomes are obtained, each in a different newly created world. For example, if one flips a quantum coin (a device that uses a quantum process to generate a random bit), then there will be two worlds: one where the coin lands heads and one where it lands tails.
This alignment method is not something that should be used as a primary alignment strategy, given that it only has any value on the off chance the Everett interpretation is correct. I only bring it up because its low cost and effort of implementation (low alignment tax) leads it to have high expected value, and it is something that could easily be implemented now.
Expected value:
The Everett interpretation of quantum mechanics has been controversial since its inception; however this method of misalignment backup has a high expected value even assuming a low chance of the Everett interpretation being correct. Implementing QRNGs into existing AI systems would be relatively easy, as it only requires adding a few lines of code. The potential benefit is enormous, as it could reduce the existential risk that misaligned AI systems wipe out humanity accross every future. Therefore, even if we assign a very low probability to the Everett interpretation of quantum mechanics being true, the expected value of Everett Insurance is still high enough to potentially justify its adoption now.
Implementation:
One possible way to implement Everett Insurance now is by adding a random seed, hashed from a QRNG api, to the pytorch dropout function. The dropout function randomly zeroes some of the elements of an input tensor with a given probability during training. By using a QRNG api to generate a random seed for each dropout layer, we would introduce quantum randomness into the parameters and outputs.
This method has some limitations: it only works for neural networks that use dropout layers (which I believe currently includes all LLMs) and it only introduces quantum randomness during training, not inference. In the future, these limitations could be overcome by developing more efficient and accessible ways to integrate QRNGs into machine learning libraries. For example, one could use QRNGs to generate random seeds for inference passes as well as training passes; one could use QRNGs to generate random seeds for other types of layers or functions besides dropout; and one could use QRNGs that are embedded into hardware devices or chips instead of relying on external APIs. As QRNGs become cheaper due to higher demand, further integration of random seeds could be implemented across other parts of the AI development and inference process.
Drawbacks:
The biggest potential negative externality that I can think of, is if people who believe the Everett interpretation is true, reduce their effort on other alignment strategies. This strategy only has any value at all if the Everett interpretation is true and we have no way of testing that hypothesis currently. Therefore, we should not reduce our effort in any other alignment strategies and should probably implement this at a low level (in machine learning library functions) in order to reduce awareness and thereby prevent reduction of effort on other alignment strategies.
The second drawback is cost. The existing QRNGs I found had API costs on the order of $0.005/request, and I didn’t see any stats on the generation rate of these quantum random numbers (i.e. will two API requests at the same Unix clock time return the same QRN?). However, the API cost would be negligible relative to other training costs of current large LLMs. Additionally, the manufactured demand would likely lead to reduced per API costs if similar functionality was added to inference steps.
Another potential criticism is that in the case that we actually do live in an Everett universe, the variance from quantum randomness is large enough to already lead to both universes where AI ends up aligned and universes where it ends up misaligned. My response to this is that, in spite of quantum mechanics, computers are largely deterministic (with the exception of rare occurrences like cosmic rays striking transistors) and any variance that does occur is largely corrected for at the scale of floating point operations in GPUs. Additionally, we don’t understand the brain enough yet to know if quantum randomness (through potential mechanisms like Brownian motion inside synapses) is significant enough to lead to timing differences in neurons firing in biological neural networks. Therefore, we don’t yet know if human behavior is largely deterministic in an Everett universe. Due to these unknows, it’s possible that even in an Everett universe, we could find ourselves at a point in time where all possible futures lead to misalignment without the addition of artificial variance through a method like this.
Finally, one of the larger problems with any potential implementations, is that the variance due to the random seed needs to actually be large enough that it leads to non-deterministic variance in output and behavior. This could have the potential drawback that additional variance in AI behavior leads to reduced trust and reliability of these systems. However, this cost could be greatly reduced by using a well-designed hash of the QRN that leads to deterministic behavior for the vast majority of quantum random seeds, but generates larger amounts of variance for some small fraction of possible seeds.
[EDIT] replaced “in all” with “across every” to try and clarify a sentence
[EDIT] Adding an additional criticism from a user on reddit:
“This idea seems to be premised on some misunderstandings.
The Born rule gives a measure over the possible universes. The usual way of reconciling utilitarianism with the many-worlds interpretation is simply to take the expectation value of utility over this measure. So our goal then is to avoid AI catastrophe in as many universes as possible. Our goal is not merely to avoid it in some universe.
But suppose you really do just want there to be some world where AI doesn’t take over. Then there’s no need to do anything, as any world that is not forbidden by e.g. conservation principles will occur with some nonzero probability (or “exist”, as a many-worlder would say). Is AI alignment forbidden by fundamental laws of physics? Of course not.” -endquote
I don’t remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they’re using mixed precision and model offload approaches?
[Edit] I’ve been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it’s more likely that they’re splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they’re using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they’re using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.