I am confident that LLMs siginificantly boost software development productivity (I would say 20-50%) and am completely sure it’s not even close to 5x.
However, despite I agree with your conclusion, I would like to point out that timeframes are pretty short. 2 years ago (~exactly GPT-4 launch date) LLMs were barely making any impact. I think tools started to resemble the current state around 1 year ago (~exactly Claude 3 Opus launch date).
Now, suppose we had 5x boost for a year. Would it be very visible? We would have got 5 years of progress in 1 year but had software landscape changed a lot in 5 years in pre-LLM era? Comparing 2017 and 2022, I don’t feel like that much changed.
Qumeric
The tech stack has shifted almost entirely to whatever there was the most data on; Python and Javascript/Typescript are in, almost everything else is out.
I think AI agents will actually prefer strongly typed languages because they provide more feedback. Working with TypeScript, Python and Rust, while a year ago the first two were clearly winning in terms of AI productivity boost, nowadays I find Cursor Agent making fewer mistakes with Rust.
I think you might find this paper relevant/interesting: https://aidantr.github.io/files/AI_innovation.pdf
TL;DR: Research on LLM productivity impacts in material disocery.
Main takeaways:
Significant productivity improvement overall
Mostly at idea generation phase
Top performers benefit much more (because they can evaluate AI’s ideas well)
Mild decrease in job satisfaction (AI automates most interesting parts, impact partly counterbalanced by improved productivity)
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don’t remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it’s very impressive, and AIMO results are even more impressive in my opinion.
Thanks, I think I understand your concern well now.
I am generally positive about the potential of prediction markets if we will somehow resolve the legal problems (which seems unrealistic in the short term but realistic in the medium term).
Here is my perspective on “why should a normie who is somewhat risk-averse, don’t enjoy wagering for its own sake, and doesn’t care about the information externalities, engage with prediction markets”
First, let me try to tackle the question at face value:
“A normie” can describe a large social group, but it’s too general to describe a single person. You can be a normie, but maybe you work at a Toyota dealership. Maybe you just accidentally overheard that the head of your department was talking on the phone and said that recently there were major problems with hydrogen cars which are likely to delay deployment by a few years. If there is a prediction market for hydrogen cars, you can bet and win (or at least you can think that you will win). It’s relatively common among normies to think along the lines “I bought a Toyota car and it’s amazing, I will buy Toyota stock and it will make me rich”. Of course, such thinking is usually invalid, Toyota’s quality is probably already priced in, so it’s a toss of a coin if it will overperform the broader market or not. Overall, it’s probably not a bad idea to buy Toyota stock, but some people do it not because it’s an ok idea but because they think it’s an amazing idea. I expect the same dynamics to play in prediction markets.
Even if you don’t enjoy “wagering for its own sake”, prediction markets can be more than mere wagering. Although it’s a bit similar in spirit, gamification is applicable to prediction markets, for example, Manifold is doing it pretty successfully (from my perspective as an active user, it’s quite addictive) although it hasn’t led to substantial user growth yet. Even the wagering itself can be different—you can bet “all on black” because you desperately need money and it’s your only chance, you can be drawn by the dopamine-driving experience of the slots, you can believe in your team and bet as kind of confirmation of your belief, you can make a bet to make watching the game more interesting. There are many aspects of gambling which have a wide appeal, and many of them are applicable to prediction markets.
Second, I am not sure it has to be a thing for the masses. In general, normies usually don’t have much valuable information, so why would we want them to participate? Of course, it will attract professionals who will correct mispricings and make money but ordinary people losing money is a negative externality which can even outweigh the positive ones.
I consider myself at least a semi-professional market participant. I bet on Manifold and use Metaculus a lot for a few years. I used Polymarket before but don’t do it anymore and resort to funny money ones despite they have problems (and of course can’t make me money).
Why I am not using Polymarket anymore:
As a real market should be, it’s far from trivial to make money on Polymarket. Despite that fact, I do (perhaps incorrectly) believe that my bets would be +EV. However, I don’t believe that I can be much better than random, so I don’t find it to be more profitable than investing in something else. However, if I could bet with “my favourite asset” it would become profitable for me (at least in my eyes, which is all that matters) and I would use it.
There are not enough interesting markets, mostly politics or sports. Which is mostly caused by the legal situation. Even Polymarket, a grey-area crypto-based market is very limited by that. PredictIt is even worse. Even if I am wrong here and it’s not the reason, still, there will be definitely more platforms which would experiment more if it was legal in the U.S.
The user experience is (or at least was) not great. Again, I believe it’s mostly caused by legal problems, it’s hard to raise money to improve your product if it’s not legal.
I do agree with your point, definitely “internalize the positive information externalities generated by them” is something which prediction markets should aspire to, an important (and interesting!) problem.
However, I don’t believe it’s essential for “making prediction markets sustainably large” unless we have a very different understanding of “sustainably large”. I am confident that it would be possible to achieve 1% of the global gambling market which would be billions of revenue and a lot of utility. It even seems to be a modest goal, given that it’s a serious instrument. But unfortunately, prediction markets are “basically regulated out of existence” :(
Sidenote on funny money market problems:
Metaculus’s problem is that it’s not a market at all. Perhaps it’s a correct decision but makes it boring, less competitive and less accurate (there are many caveats here, probably making Metaculus a market right now would make it less accurate, but from the highest-level perspective markets are a better mechanism).
Manifold’s problem is that serious markets draw serious people and unserious markets draw unserious people. As a result, serious markets are significantly more accurately priced which disincentivises competitive users to participate in them. That kinda defies the whole point. And also, perhaps even more importantly, users are not engaged enough (because they don’t have money at stake) so winning at Manifold is mostly information arbitrage which is tedious and unfulfilling.
Good to know :)
I do agree that subsidies run into a tragedy-of-commons scenario. So despite subsidies are beneficial, they are not sufficient.
But do you find my solution to be satisfactory?
I thought about it a lot, I even seriously considered launching my own prediction market and wrote some code for it. I strongly believe that simply allowing the usage of other assets solves most of the practical problems, so I would be happy to hear any concerns or further clarify my point.
Or another, perhaps easier solution (I updated my original answer): just allow the market company/protocol to invest the money which are “locked” until resolution to some profit generating strategy and share the profit with users. Of course, it should be diversified, both in terms of investment portfolio and across individual markets (users get the same annual rate of return, no matter what particular thing they bet on). It has some advantages and disadvantages, but I think it’s a more clear-cut solution.
Isn’t this just changing the denominator without changing the zero- or negative-sum nature?
I feel like you are mixing two problems here: an ethical problem and a practical problem. UPD: on second thought, maybe you just meant the second problem, but still I think my response would be clearer by considering them separately.
The ethical problem is that it looks like prediction markets do not generate income, thus they are not useful and shouldn’t be endorsed, they don’t differ much from gambling.
While it’s true that they don’t generate income and are zero-sum games in a strictly monetary sense, they do generate positive externalities. For example, there could be a prediction market about an increase of <insert a metric here> after implementing some policy. The market will allow us to estimate the policy efficiently and make better decisions. Therefore, the market will be positive-sum because of the “better judgement” externality.
The practical problem is that the zero-sum monetary nature of prediction markets disincentives participation (especially in year+ long markets) because on average it’s more profitable to invest in something else (e.g. S&P 500). It can be solved by allowing to bet other assets, so people would bet their S&P 500 shares and on average get the same expected value, so it will be not disincentivising anymore.
Also, there are many cases where positive externalities can be beneficial for some particular entity. For example, an investment company may want to know about the risk of a war in a particular country to decide if they want to invest in the country or not. In such cases, the company can provide rewards for market participants and make it a positive-sum game for them even from the monetary perspective.
This approach is beneficial and used in practice, however, it is not always applicable and also can be combined with other approaches.
Additionally, I would like to note that there is no difference between ETH and “giving a loan to a business” from a mechanism design perspective, you could tokenize your loan (and it’s not crypto-related, you could use traditional finance as well, I am just not sure what “traditional” word fits here) and use the tokenized loan to bet at the prediction market.
but once all the markets resolve, the total wealth would still be $1M, right
Yes, the total amount will still be the same. However, your money will not be locked during the duration of the market, so you will be able to use it to do something else, be it buying a nice home or giving a loan to a real company.
Of course, not all your money will be unlocked and probably not immediately, but it doesn’t change much. Even if only 1% will be unlocked and only in certain conditions, it’s still an improvement.
Also, I encourage you to look at it from another perspective:
What problem do we have? Users don’t want to use prediction markets.
Surely, they would be more interested if they had free loans (of course they are not going to be actually free, but they can be much cheaper than ordinary uncollateralized loans).
Meta-comment: it’s very common in finance to put money through multiple stages. Instead of just buying stock, you could buy stock, then use it as collateral to get a loan, then buy a house on this loan, rent it to somebody, sell the rent contract and use the proceeds to short the original stock to get into a delta-neutral position. Risks multiply after each stage, so it should be done carefully and responsibly. Sometimes the house of cards crumbles, but it’s not a bad strategy per se.
Why does it have to be “safe enough”? If all market participants agree to bet using the same asset, it can bear any degree of risk.
I think I should have said that a good prediction market allows users to choose what asset will a particular “pair” use. It will cause a liquidity split which is also a problem, but it’s also manageable and, in my opinion, it would be much closer to an imaginary perfect solution than “bet only USD”.
I am not sure I understand your second sentence, but my guess is that this problem will also go away if each market “pair” uses a single (but customizable) asset. If I got it wrong, could you please clarify?
In a good prediction market design users would not bet USD but instead something which appreciates over time or generates income (e.g. ETH, Gold, S&P 500 ETF, Treasury Notes, or liquid and safe USD-backed positions in some DeFi protocol).
Another approach would be to use funds held in the market to invest in something profit-generating and distribute part of the income to users. This is the same model which non-algorithmic stablecoins (USDT, USDC) use.
So it’s a problem, but definitely a solvable one, even easily solvable. The major problem is that prediction markets are basically illegal in the US (and probably some other countries as well).
Also, Manifold solves it in a different way—positions are used to receive loans, so you can free your liquidity from long (timewise) markets and use it to e.g. leverage. The loans are automatically repaid when you sell your positions. It is easy for Manifold because it doesn’t use real money, but the same concept can be implemented in the “real” markets, although it would be more challenging (there will be occasional losses for the provider due to bad debt but it’s the same with any other kind of credit, it can be managed).
Regarding 9: I believe it’s when you are successful enough that your AGI doesn’t instantly kill you immediately but it still can kill you in the process of using it. It’s in the context of a pivotal act, so it assumes you will operate it to do something significant and potentially dangerous.
I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.
If I will not land a safety job, one of the obvious options is to try to get hired by an AI company and try to learn more there in the hope I will either be able to contribute to safety there or eventually move to the field as a more experienced engineer.
I am conscious of why pushing capabilities could be bad so I will try to avoid it, but I am not sure how far it extends. I understand that being Research Scientist in OpenAI working on GPT-5 is definitely pushing capabilities but what about doing frontend in OpenAI or building infrastructure at some strong but not leading (and hopefully a bit more safety-oriented) company such as Cohere? Or let’s say working in a hedge fund which invests in AI? Or working in a generative AI company which doesn’t build in-house models but generates profit for OpenAI? Or working as an engineer at Google on non-AI stuff?
I do not currently see myself as an independent researcher or AI safety lab founder, so I will definitely need to find a job. And nowadays too many things seem to touch AI one way or the other, so I am curious if anybody has an idea about how could I evaluate career opportunities.
Or am I taking it too far and the post simply says “Don’t do dangerous research”?
The British are, of course, determined to botch this like they are botching everything else, and busy drafting their own different insane AI regulations.
I am far from being an expert here, but I skimmed through the current preliminary UK policy and it seems significantly better compared to EU stuff. It even mentions x-risk!
Of course, I wouldn’t be surprised if it will turn out to be EU-level insane eventually, but I think it’s plausible that it will be more reasonable, at least from the mainstream (not alignment-centred) point of view.
And compute, especially inference compute, is so scarce today that if we had ASI right now, it would take several decades, even with exponential growth, to build enough compute for ASIs to challenge humanity.
Uhm, what? “Slow takeoff” means ~1 year… Your opinion is very unusual, you can’t just state it without any justification.
Are you implying that it is close to GPT-4 level? If yes, it is clearly wrong. Especially in regards to code: everything (maybe except StarCoder which was released literally yesterday) is worse than GPT-3.5, and much worse than GPT-4.
In addition to many good points already mentioned, I would like to add that I have no idea how to approach this problem.
Approaching x-risk is very hard too, but it is much clearer in comparison.
Preliminary benchmarks had shown poor results. It seems that dataset quality is much worse compared to what LLaMA had or maybe there is some other issue.
Yet another proof that top-notch LLMs are not just data + compute, they require some black magic.
Generally, I am not sure if it’s bad for safety in the notkilleveryoneism sense: such things prevent agent overhang and make current (non-lethal) problems more visible.
Hard to say if net good or net bad, too many factors and the impact of each are not clear.
I am not sure how did you come to the conclusion that current models are superhuman. I can visualize complex scenes in 3D for example. Especially under some drugs :)
And I don’t even think I have an especially good imagination.
In general, it is very hard to compare mental imagery with Stable Diffusion. For example, it it is hard to imagine something with many different details in different parts of the image but it is perhaps a matter of representation. An analogy could be that our perception is like a low-resolution display. I can easily zoom in on any area and see the details.
I wouldn’t say that current models are superhuman. Although I wouldn’t claim humans are better either, it is just very unobvious how to compare it properly and probably there are a lot of potential pitfalls.
So 1) has a large role here.
In 2) CNNs are not a great example (as you mentioned yourself). Vision transformers demonstrate similar performance. It seems that inductive bias is relatively easy to learn for neural networks. I would guess it’s similar for human brains too although I don’t know much about neurobiology.
3) Doesn’t seem like a good reason to me. There are modern GANs that demonstrate similar performance to diffusion models, also there are approaches which make diffusion work in a very small number of steps, even 1 step showed decent results IIRC. Also, even ImageGPT worked pretty well back in the day.
4) Similarly to the initial claim, I don’t think much can be confidently said about LLM language abilities in comparison to humans. I do not know what exactly it means and how to measure it. We can do benchmarks, yes. Do they tell us anything deep? I don’t think so. LLMs are very different kinds of intelligence, they can do many things humans can’t and vice versa.
But at the same time, I wouldn’t say that visual models strike me as much more capable given the same size/same amount of compute. They are quite stupid. They can’t count. They can’t do simple compositionally.
5) It is possible we will have much more efficient language models, but again, I don’t think they are much more inefficient than visual models.
My two main reasons for the perceived efficiency difference:
It is super hard to compare with humans. We may do it completely wrong. I think we should aspire to avoid it unless absolutely necessary.
“Language ability” depends much more on understanding and having a complicated world model compared to “visual ability”. We are not terribly disappointed when Stable Diffusion consistently draws three zombies when we ask for four and mostly forgive it for weird four-fingered hands sometimes growing from the wrong places. But when LLMs do similar nonsense, it is much more evident and hurts performance a lot (both on benchmarks and in the real world). LLMs can imitate style well, they have decent grammar. Larger ones GPT-4 can even count decently well and probably do some reasoning. So the hard part (at least for our current deep learning methods) is the world model. Pattern matching is easy and not really important in the grand scheme of things. But it still looks kinda impressive when visual models do it.
It is easy to understand why such news could increase P(doom) even more for people with high P(doom) prior.
But I am curious about the following question: what if an oracle told us that P(doom) is 25% before the announcement (suppose it was not clear to the oracle what strategy will Anthropic choose, it was inherently unpredictable due to quantum effects or whatever).
Would it still increase P(doom)?
What if the oracle said P(doom) is 5%?
I am not trying to make any specific point, just interested in what people think.
I think it is not necessarily correct to say that GPT-4 is above village idiot level. Comparison to humans is a convenient and intuitive framing but it can be misleading.
For example, this post argues that GPT-4 is around Raven level. Beware that this framing is also problematic but for different reasons.
I think that you are correctly stating Eliezer’s beliefs at the time but it turned out that we created a completely different kind of intelligence, so it’s mostly irrelevant now.
In my opinion, we should aspire to avoid any comparison unless it has practical relevance (e.g. economic consequences).
I don’t think ketamine neurotoxicity is a thing. Ketamine is actually closer to be a neuroprotector.