What’s the FATE community? Fair AI and Tech Ethics?
Jsevillamol
We have conveniently just updated our database if anyone wants to investigate this further!
https://epochai.org/data/notable-ai-models
Here is a “predictable surprise” I don’t discussed often: given the advantages of scale and centralisation for training, it does not seem crazy to me that some major AI developers will be pooling resources in the future, and training jointly large AI systems.
I’ve been tempted to do this sometime, but I fear the prior is performing one very important role you are not making explicit: defining the universe of possible hypothesis you consider.
In turn, defining that universe of probabilities defines how bayesian updates look like. Here is a problem that arises when you ignore this: https://www.lesswrong.com/posts/R28ppqby8zftndDAM/a-bayesian-aggregation-paradox
shrug
I think this is true to an extent, but a more systematic analysis needs to back this up.
For instance, I recall quantization techniques working much better after a certain scale (though I can’t seem to find the reference...). It also seems important to validate that techniques to increase performance apply at large scales. Finally, note that the frontier of scale is growing very fast, so even if these discoveries were done with relatively modest compute compared to the frontier, this is still a tremendous amount of compute!
even a pause which completely stops all new training runs beyond current size indefinitely would only ~double timelines at best, and probably less
I’d emphasize that we currently don’t have a very clear sense of how algorithmic improvement happens, and it is likely mediated to some extent by large experiments, so I think is more likely to slow timelines more than this implies.
I agree! I’d be quite interested in looking at TAS data, for the reason you mentioned.
I think Tetlock and cia might have already done some related work?
Question decomposition is part of the superforecasting commandments, though I can’t recall off the top of my head if they were RCT’d individually or just as a whole.
ETA: This is the relevant paper (h/t Misha Yagudin). It was not about the 10 commandments. Apparently those haven’t been RCT’d at all?
We actually wrote a more up to date paper here
I cowrote a detailed response here
https://www.cser.ac.uk/news/response-superintelligence-contained/
Essentially, this type of reasoning proves too much, since it implies we cannot show any properties whatsoever of any program, which is clearly false.
Here is some data through Matthew Barnett and Jess Riedl
Number of cumulative miles driven by Cruise’s autonomous cars is growing as an exponential at roughly 1 OOM per year.
https://twitter.com/MatthewJBar/status/1690102362394992640
That is to very basic approximation correct.
Davidson’s takeoff model illustrates this point, where a “software singularity” happens for some parameter settings due to software not being restrained to the same degree by capital inputs.
I would point out however that our current understanding of how software progress happens is somewhat poor. Experimentation is definitely a big component of software progress, and it is often understated in LW.
More research on this soon!
algorithmic progress is currently outpacing compute growth by quite a bit
This is not right, at least in computer vision. They seem to be the same order of magnitude.
Physical compute has growth at 0.6 OOM/year and physical compute requirements have decreased at 0.1 to 1.0 OOM/year, see a summary here or a in depth investigation here
Another relevant quote
Algorithmic progress explains roughly 45% of performance improvements in image classification, and most of this occurs through improving compute-efficiency.
is not a transpose! It is the timestep . We are raising to the -th power.
Power laws in Speedrunning and Machine Learning
Thanks!
Our current best guess is that this includes costs other than the amortized compute of the final training run.
If no extra information surfaces we will add a note clarifying this and/or adjust our estimate.
Thanks Neel!
The difference between tf16 and FP32 comes to a x15 factor IIRC. Though also ML developers seem to prioritise other characteristics than cost effectiveness when choosing GPUs like raw performance and interconnect, so you can’t just multiply the top price performance we showcase by this factor and expect that to match the cost performance of the largest ML runs today.
More soon-ish.
Because there is more data available for FP32, so it’s easier to study trends there.
We should release a piece soon about how the picture changes when you account for different number formats, plus considering that most runs happen with hardware that is not the most cost-efficient.
The ability to pay liability is important to factor in and this illustrates it well. For the largest prosaic catastrophes this might well be the dominant consideration
For smaller risks, I suspect in practice mitigation, transaction and prosecution costs are what dominates the calculus of who should bear the liability, both in AI and more generally.