moridinamael
Charisma Skills Workshop | Guild of the Rose
Nuclear War, Map and Territory, Values | Guild of the Rose Newsletter, May 2024
Just in case people aren’t aware of this, drilling wells the “old fashioned way” is a very advanced technology. Typically a mechanically complex diamond-tipped tungsten carbide drill bit grinds its way down, while a fluid with precisely calibrated density and reactivity is circulated down the center of the drill string and back up the annulus between the drill string and edges of the hole, sweeping the drill cuttings up the borehole to the surface. A well 4 miles long and 8 inches wide has a volume of over 200,000L, meaning that’s the volume of rock that has to be mechanically removed from the hole during drilling. So that’s the volume of rock you would have to “blow” out of the hole with compressed air. You can see why using a circulating liquid with a reasonably high viscosity is more efficient for this purpose.
The other important thing about drilling fluid is that its density is calibrated to push statically against the walls of the hole as it is being drilled, preventing it from collapsing inward and preventing existing subsurface fluids from gushing into the wellbore. If you tried to drill a hole with no drilling fluid, it would probably collapse, and if it didn’t collapse, it would fill with high pressure groundwater and/or oil and/or explosive natural gas, which would possibly gush straight to the surface and literally blow up your surface facilities. These are all things that would almost inevitably happen if you tried to drill a hole using microwaves and compressed air.
tl;dr, drilling with microwaves might sense if you’re in space drilling into an asteroid, but makes so no sense for this application.
Talking to Golden Gate Claude reminds me of my relationship with my sense of self. My awareness of being Me is constantly hovering and injecting itself into every context. Is this what “self is an illusion” really means? I just need to unclamp my sense of self from its maximum value?
I think it is also good to consider that it’s the good-but-not-great hardware that has the best price-performance at any given point in time. The newest and best chips will always have a price premium. The chips one generation ago will be comparatively much cheaper per unit of performance. This has been generally true since I’ve started recording this kind of information.
As I think I mentioned in another comment, I didn’t mention Moore’s law at all because it has relatively little to do with the price-performance trend. It certainly is easy to end up with a superexponential trend when you have an (economic) exponential trend inside a (technological) exponential trend, but as other commenters point out, the economic term itself is probably superexponential, meaning we shouldn’t be surprised to see price-performance to fall more quickly than exponential even without exponential progress in chip speed.
Disaster Preparedness
How to Be Fun at Parties
One way of viewing planning is as an outer-loop on decision theory.
My approach to the general problem of planning skills was to start with decision theory and build up. In my Guild of the Rose Decision Theory courses was to spend time focusing on slowly building the most fundamental skills of decision theory. This included practicing manipulation of probabilities and utilities via decision trees, and practicing all these steps in a variety of both real and synthetic scenarios, to build an intuition regarding the nuances of how to set up decision problems on paper. The ultimate goal was to get the practitioners to the point where they usually don’t need to draw up a decision tree on paper, but rather to leverage those intuitions to quickly solve decision problems mentally, and/or recognize when a decision problem is actually tricky enough to merit breaking out the spreadsheet or Guesstimate project.
In my experience, even long-time rationalists are so incredibly bad at basic decision theory that trying to skip the step of learning to correctly set up a basic decision tree might actually be counterproductive. So my inclination is to focus on really mastering this art before attempting planning.
Another way of viewing planning is that planning is search.
For computationally bounded agents like us, search involves a natural tradeoff of breadth versus depth. Breadth is essentially idea generation, depth is idea selection and refinement. The tricky think about planning, in general, is that if 100x solutions exist, then those solutions are going to be found by spending the majority of the time on breadth-search, i.e. blue sky brainstorming for ways that the plan could look wildly different from the default approach, but that most situations don’t admit 100x plans. Most things in life, especially in our technological civilization, are already sort of optimized, because there is some existing refined solution that has already accommodated the relevant tradeoffs. I could get to work faster if I flew there in a helicopter, but considering in costs, the Pareto optimum is still driving my car on the freeway. Most things look like this. Well-considered Pareto solutions to real-world problems tend to look boring!
Therefor, if you spend a lot of time looking for 100x solutions, you will waste a lot of time, because these solutions usually won’t exist. Then, after failing to find a truly galaxy-brain solution, you will spend some amount of time refining the probably-already-obvious plan, realize that there are a lot of unknown-unknowns, and that the best way to get clarity on these is to just start working. Then you will realize that you would have been better off if you had just started working immediately and not bothered with “planning” at all, and you will either be Enlightened or depressed.
It gives me no pleasure to say this! Ten years ago I was all fired up on the idea that rationalists would Win and take over the world by finding these clever HPJEV-esque lateral thinking solutions. I have since realized that one creative rationalist is usually no match for tens of thousands of smart people exploring the manifold through natural breadth-first and then refining on the best solutions organically.
I am not actually completely blackpilled on the idea of scenario planning. Clearly there are situations for which scenario planning is appropriate. Massive capital allocations and long-term research programs might be two good examples. Even for these types of problems, it’s worth remembering that the manifold probably only admits to marginal optimizations, not 100x optimizations, so you shouldn’t spend too much time looking for them.
Update: Orienting Ourselves in 2024 | Guild of the ROSE
Well, there’s your problem!
Hardware Precision TFLOPS Price ($) TFLOPS/$ Nvidia GeForce RTX 4090 FP8 82.58 $1,600 0.05161 AMD RX 7600 FP8 21.5 $270 0.07963 TPU v5e INT8 393 $4730* 0.08309 H100 FP16 1979 $30,603 0.06467 H100 FP8 3958 $30,603 0.12933 * Estimated, sources suggest $3000-6000 From my notes. Your statement about RTX 4090 leading the pack in flops per dollar does not seem correct based on these sources, perhaps you have a better source for your numbers than I do.
I did not realize that H100 had >3.9 PFLOPS at 8-bit precision until you prompted me to look, so I appreciate that nudge. That does put the H100 above the TPU v5e in terms of FLOPS/$. Prior to that addition, you can see why I said TPU v5e was taking the lead. Note that the sticker price for TPU v5e is estimated, partly from a variety of sources, partly from my own estimate calculated from the lock-in hourly usage rates.
Note that FP8 and INT8 are both 8-bit computations and are in a certain sense comparable if not necessarily equivalent.
Could you lay that out for me, a little bit more politely? I’m curious.
Does Roodman’s model concern price-performance or raw performance improvement? I can’t find the reference and figured you might know. In either case, price-performance only depends on Moore’s law-like considerations in the numerator, while the denominator (price) is a a function of economics, which is going to change very rapidly as returns to capital spent on chips used for AI begins to grow.
As I remarked in other comments on this post, this is a plot of price-performance. The denominator is price, which can become cheap very fast. Potentially, as the demand for AI inference ramps up over the coming decade, the price of chips falls fast enough to drive this curve without chip speed growing nearly as fast. It is primarily an economic argument, not a purely technological argument.
For the purposes of forecasting, and understanding what the coming decade will look like, I think we care more about price-performance than raw chip speed. This is particularly true in a regime where both training and inference of large models benefit from massive parallelism. This means you can scale by buying new chips, and from a business or consumer perspective you benefit if those chips get cheaper and/or if they get faster at the same price.
Thanks, I’ll keep that in mind!
A couple of things:
TPUs are already effectively leaping above the GPU trend in price-performance. It is difficult to find an exact cost for a TPU because they are not sold retail, but my own low-confidence estimates for the price of a TPU v5e place its price-performance significantly above the GPU given in the plot. I would expect that the front runner in price-performance cease to be what we think of as GPUs and thus intrinsic architectural limitations of GPUs cease to be the critical bottleneck.
Expecting price-performance to improve doesn’t mean we necessarily expect hardware to improve, just that we become more efficient at making hardware. Economies of scale and refinements in manufacturing technology can dramatically improve price-performance by reducing manufacturing costs, without any improvement in the underlying hardware. Of course, in reality we expect both the hardware to become faster and the price of manufacturing it to fall. This is even more true as the sheer quantity of money being poured into compute manufacturing goes parabolic.
The graph was showing up fine before, but seems to be missing now. Perhaps it will come back. The equation is simply an eyeballed curve fit to Kurzweil’s own curve. I tried pretty hard to convey that the 1000x number is approximate:
> Using the super-exponential extrapolation projects something closer to 1000x improvement in price-performance. Take these numbers as rough, since the extrapolations depend very much on the minutiae of how you do your curve fit. Regardless of the details, it is a difference of orders of magnitude.The justification for putting the 1000x number in the post instead of precisely calculating a number from the curve fit is that the actual trend is pretty wobbly over the years, and my aim here is not to pretend at precision. If you just look at the plot, it looks like we should expect “about 3 orders of magnitude” which really is the limit of the precision level that I would be comfortable with stating. I would guess not lower than two orders of magnitude. Certainly not as low as one order of magnitude, as would be implied by the exponential extrapolation, and would require that we don’t have any breakthroughs or new paradigms at all.
Super-Exponential versus Exponential Growth in Compute Price-Performance
GPT4 confirms for me that the Meissner effect does not require flux pinning: “Yes, indeed, you’re correct. Flux pinning, also known as quantum locking or quantum levitation, is a slightly different phenomenon from the pure Meissner effect and can play a crucial role in the interaction between a magnet and a superconductor.
In the Meissner effect, a superconductor will expel all magnetic fields, creating a repulsive effect. However, in type-II superconductors, there are exceptions where some magnetic flux can penetrate the material in the form of tiny magnetic vortices. These vortices can become “pinned” in place due to imperfections in the superconductor’s structure.
This flux pinning is the basis of quantum locking, where the superconductor is ‘locked’ in space relative to the magnetic field. This can create the illusion of levitation in any orientation, depending on how the flux was pinned. For instance, a superconductor could be pinned in place above a magnet, below a magnet, or at an angle.
So, yes, it is indeed important to consider flux pinning when discussing the behavior of superconductors in a magnetic field. Thanks for pointing out this nuance!”
I think Sabine is just not used to seeing small pieces of superconductor floating over large magnets. Every Meissner effect video that I can find shows the reverse: small magnets floating on top of pieces of cooled superconductor. This makes sense because it is hard to cool something that is floating in the air.
This relates to my favorite question of economics: are graduate students poor or rich? This post suggests an answer I hadn’t thought of before: it depends on the attitudes of the graduate advisor, and almost nothing else.