If you want a more accurate estimate of how often top chess engines pick the theoretical best move, you could compare Leelachess and stockfish. These are very close to each other ELO wise but have very different engines and styles of play. So you could look at how often they agree on the best move, and assume that both have some distribution where they pick their best moves from the the true move ranking, and then use that to calculate parameters of the distribution.
Yair Halberstadt
Stockfish is incredibly strong at exploiting small mistakes. I’m going to assume that on average if you make anything other than the top 5 moves at any point in a game, stockfish will win, no matter what you do after.
An average game is about 40 turns, and there’s about 20 valid moves each turn.
So that puts a upper limit on success of 1 in 4^40.
Similarly if you pick the best move at all times, you’ll win, putting a lower limit at 1 in 20^40
Making some assumptions about how many best moves you need to counteract a poor but not fatal move, you could try to estimate something more accurate in this range.
Ok, I think that makes a lot of sense. Newton’s 2nd law is the first step of constructing a model which is (ideally) isomorphic to reality once you’ve filled in all the details.
But you could equally well start off constructing your model with a different first step, and if you do it might be that some nice neat packaged concepts in modelA do not map cleanly onto anything in modelB. The fundamental concepts in physics are fundamental to the model, not to reality.
I agree it’s tautologically true, but I’m saying that we only use it because it maps nicely to reality. When it doesn’t map cleanly to reality we replace it with something else (special relativity in your example) instead of continuously adding epicycles.
There’s an infinite number of laws I could generate that would be equally tautologically true (e.g. f = mvdv/dt), but we don’t use them because they require more epicycles to work correctly.
I think Newtons seconds law would be discarded if we consistently saw the following:
There was no relation between how hard I push things and how fast they move.
Pushing a faster moving object as a hard as I push a slower moving object, for the same amount of time, speeds it up less than the slower object.
When you take two objects, each of which moves the same speed after 5 seconds of pushing as hard as you can, and stick them together, the resulting object moves 4 times as fast after 5 seconds.
The first would get rid of the law completely, the second would make you refine your concept of acceleration, and the third your concept of how acceleration relates to mass.
Now you’re right you could always add epicycles to fix this, but the correct response would be discarding the theory outright.
I think most of the examples of superior performance by LLMs haven’t yet reached the stage where they make a difference in the real world.
AI models are better than physicians at diagnosing patients
A customer service chatbot built on GPT-3 increases resolutions per hour by 15%
GitHub Copilot increases programmer productivity by 55%
GPT-4 aced Bryan Caplan’s economics midterm
The diagnosing patients bit is in test conditions. In real world conditions physicians perform better because they can see and interact with the patient.
The customer service bit is mostly because a large percentage of requests are about stuff already on the website. The flip side is I’m sure rates of frustration when you actually need a customer service representative went through the roof—especially if this was GPT-3.
I roll to disbelief the GitHub copilot claims. I’m a software developer, and I would have noticed if devs were getting 50% more done. This is clearly measuring productivity in a very narrow way.
Midterms are useful in that they separate poor from strong economic students. Then the strong students are hired by companies and learn the actually useful stuff on the job. GPT-4 might be great at the midterms, but it’s not going to learn on the job.
To ask the obvious question: how do they verify the videos are genuine, unique, recent, taken by this unit, show what they claim to show?
For some reason I haven’t seen any sycophancy, even when deliberately trying to induce it. Have they fixed it already, or is it because I have memory disabled, or is it my custom prompt?
It is perfectly rational to pipe all decisions to a cheaper form of cognition that relies mostly on pattern matching, and to save your limited reserves of concentration and reasoned thought to situations that pass through this initial filter and ping your higher cognition to look into it more.
But I claim that all such priors make assumptions about the distribution of the possible number of buses
I mean, yes, that’s the definition of a prior. How to calculate a prior is an old question in bayesianism, with different approaches—kolmogorov complexity being one.
Sorry, I meant to add in an example where for simplicity you saw the bus numbered 1.
Agreed it’s a terrible prior, it’s just an easy one for a worked example.
Agreed, I just wanted to clarify that the assumption it’s double as long seems baseless to me. The point is it’s usually shortly after.
As a worked example, if I start off assuming that chance of there being n busses is 1/2^n (nice and simple, adds up to 1), then the posterior is 1/n(ln(2))(2^n) - multiply the two distributions, then divide by the integral (ln(2)) so that it adds up to 1.
I’m not using this is a prior, I’m using it to update my existing prior (whatever that was). I believe the posterior will be well defined, so long as the prior was.
It would also update you towards 1600 over 2000.
Oh I see. I’m not trying to guess a specific number, I’m trying to update my distribution.
I’m sorry, I’m not sure what you mean. Under bayesianism this is straightforward.
Note the actual doomsday argument properly applied predicts that humanity is mostly likely to end right now, with probability dropping proportional to total number of humans there have ever been.
To give a simple example why: if you go to a city and see a bus with the number 1546, the number of busses that maximises the chance you would have seen that bus is 1546 busses. At 3000 busses the probability you would have seen that exact bus is halved. And at 3,000,000 it’s 2000 times less likely. This gives you a Bayesian update across your original probability distribution for how many busses there are.
Why isn’t the fact software developers spend 3 years not learning all that much (far less than they would in 6 months on the job) not a problem?
I’m not sure this follows. If I have aims I want to achieve, I may resist permanent shutdown even if do not mind dieing because that limits my ability to achieve my aims.