mishka

Karma: 1,408

mishka Jul 21, 2025, 1:28 AM
2 points
0
in reply to: gjm’s comment on: OpenAI Claims IMO Gold Medal
Yeah, I actually looked at the early years today, and in 1969 only the three perfect scores won gold, and in 1970 this was relaxed a little bit, and the overall trend looked to me like there were multiple reforms with gradual relaxation of the standards for gold (although I did not do more than superficial sampling from several time points).

I think the official goal is still approximately 6:3:2:1, but this year those fuzzy boundaries resulted in 67 gold medals out of 630 participants (slightly above 10.6%).

mishka Jul 20, 2025, 8:04 PM
4 points
2
in reply to: Nick_Tarleton’s comment on: OpenAI Claims IMO Gold Medal
First of all, we would like to see pre-registration, so that we don’t end up learning only about successes (and generally cherry-picking good results, while omitting negative results).

He is trying to steer the field towards generally better practices. I don’t think this is specifically a criticism of this particular OpenAI result, but more an attempt to change the standards.

Although he is likely to have some degree of solidarity with the IMO viewpoint and to share some of their annoyance with timing of all this, e.g. https://www.reddit.com/r/math/comments/1m3uqi0/comment/n40qbe9/

mishka Jul 20, 2025, 12:10 PM
2 points
0
in reply to: J Bostock’s comment on: OpenAI Claims IMO Gold Medal
I would not say it is solved :-)

I am sure we’ll see an easy and consistent 42 score from the models sooner rather than later, and we’ll see much more than that in the adjacent areas, but not yet :-)

(Someone who got a bronze in late 1960-s is telling me that this idea to give gold medals to 10+% of the participants is relatively recent, that when they were competing back in the 60-s there would be exactly 5 gold medals with this table of results.)

mishka Jul 20, 2025, 12:05 PM
2 points
0
in reply to: J Bostock’s comment on: OpenAI Claims IMO Gold Medal
Ah, yes, you are right. And the silver medal threshold is 28=4*7. So this is much more natural, and mostly comes from how the competition is structured (the scoring factor still looks somewhat noticeable to my eye, but is much less of a problem than I thought).

mishka Jul 20, 2025, 2:49 AM
8 points
4
in reply to: J Bostock’s comment on: OpenAI Claims IMO Gold Medal
I am sure he is not in denial. He knows that the AI systems are on the trajectory to the top and beyond.

But, on one hand, he is saying that proper methodology is important and expects it to be in place for the next year competition: https://mathstodon.xyz/@tao/114877789298562646.

On the other hand, it’s great progress, but let’s not be hypnotized by the word “gold”. The model made it to the bottom border of the “gold medal” tier, the largest yellow bar on the histogram here https://www.imo-official.org/year_individual_r.aspx?year=2025.

It’s top 11% of the participants, so it’s great progress, but not some “exclusive and exceptional win”.

Also the shape of that histogram strongly suggests that the IMO scoring process is weird and probably adversarial (given that team leads advocate for their participants during the scoring). The fact that we see this huge peak on the histogram at 35 and that we also see local maxima at the bottoms of the other two medal tiers is suggestive of a process which is not plain “impartial and blind grading” (perhaps that part of the IMO methodology could also use some improvement).

mishka Jul 13, 2025, 7:18 PM
5 points
2
on: 10x more training compute = 5x greater task length (kind of)
Thanks, a very useful analysis.

2.5 months is enough to provide a very strong AI research acceleration, so the trend is likely to be changing towards larger share of algorithmic efficiency in the future (e.g. if we freeze the compute, the progress will not stop even now, although it would slow down noticeably).

mishka Jul 10, 2025, 1:17 PM
3 points
0
in reply to: Chapin Lenthall-Cleary’s comment on: A Medium Scenario
The most likely one is, I think, via research breakthroughs in fundamental physics. For example, I would expect that true understanding of quantum gravity would open a way to much more powerful weapons of mass destruction (on the level of temporary modifying properties of significant volumes of space-time to destroy structure those volumes contain) and also to experiments with unexpected large-scale side effects (think discussion before Trinity test about a possibility of “igniting the atmosphere”, but this time for real).

For this particular branch of existential risk scenarios I would expect and hope that smarter than human entities would be able to create social structure capable to avoid warfare of this kind (although who knows with certainty), but not stumbling into dangerous experiments by accident might depend on the unknown details of our actual physics and cosmology.

There are other possibilities. For example, I crave achieving a working solution for the “hard problem of consciousness”, a solution with experimentally observable empirical consequences, but when I ponder implications of capabilities this is likely to give I feel quite uneasy…

We have this temporary period of “relative stagnation in most fundamental parts of science”. A drastic increase in available levels of intelligence is one way to end this stagnation. We crave that, definitively, and I certainly crave the end of this stagnation, but that’s where some very fundamental sources of existential risks lie…

mishka Jul 9, 2025, 9:19 PM
2 points
0
on: A Medium Scenario
If the most likely existential risk scenario actually involves ASIs destroying themselves and also destroying the “local neighborhood” as a collateral damage, and if that scenario is almost inevitable, that would account for the Fermi Paradox.

(To counter that, one would hope that sufficiently smart entities would be able to avoid that. But the odds would depend on how vulnerable our reality actually is, how difficult it is to avoid that scenario.)

mishka Jul 2, 2025, 9:37 AM
8 points
0
on: Paradigms for computation
Thanks for writing this! I’d like to put together a few hopefully relevant remarks.

The gap between theory and practice is strongly pronounced in computer science. The theory is governed by its inner logic (what’s more tractable, what leads to more rich, more intellectually satisfying formalisms, and so on). As a result, the fit of theory to practical matters is often so-so.

For example, with run-time complexity, the richest, most developed, most taught to students theory deals with worst-case complexity, whereas in practice average complexity tends to be more relevant (e.g. simplex method, quicksort, and so on).

Computers operating in real world are at the very least “computations with an oracle” (somewhat modeled by type-2 Turing machines), with the world being the oracle. If one believes together with Penrose that the world is not computable, then the “Penrose argument” is falling apart. If one believes together with some other thinkers that the world is computable, then any valid version of the “Penrose argument” would be equally applicable to humans.

In reality, even the “computations with an oracle” paradigm is not quite adequate for the real world, since when computers write into an external world and read from it, they can activate “computational circuits” in the external world, and therefore the notion of strict boundary between the internal computations within the computer and the dynamics of the external world becomes inadequate.

The theoretical computational paradigms are not very hardware-friendly (e.g. neither Turing machines, nor lambda-calculus map to our computers all that well). At the same time, a typical computer does not have well-developed autonomous capability to expand memory in an unlimited fashion and, therefore, is not Turing-complete in a formal sense (e.g. if memory is limited, the halting problem stops being undecidable, and so on).

In particular, these days we are missing a GPU-friendly computational paradigm. It is actually possible to create a neuromorphic computational paradigm, and such paradigms go decades back, but a typical incarnation would not be GPU-friendly, so a sizeable gap between theory and practice would remain.

For example, results of Turing universality of RNNs with memory expanding in an unlimited fashion go back to at least 1987. However, originally those results were using unlimited precision reals and were abusing binary expansions of those reals as tapes of Turing machines. This was incompatible with the key property of practical neural computations, namely their relative robustness to noise. This incompatibility is yet another example of the gap between theory and practice.

More recently, this particular incompatibility has been fixed. One can take an RNN and replace neurons handling scalar flows with neurons handling vector flows, and one can consider countably-sized vectors having finite number of non-zero coordinates at any given moment of time, replacing linear combinations of scalar inputs with linear combinations of vector inputs (“generalized attention”). Then one gets a theoretically nice neuromorphic platform for stream-oriented computations which is more robust to noise. It is still not GPU-friendly in its fully generality.

These remarks above are about computational architectures. Much more can be said about the great diversity of possible methods of generating (synthesizing) computations. When our architectures are neuromorphic rather than discrete, we have greater variety of such methods potentially available (not just gradient methods, but, for example, decomposition of connectivity matrices into series, and so on).

mishka Jun 28, 2025, 11:26 PM
3 points
0
on: A Depressed Shrink Tries Shrooms
A very sober trip report.

Thanks!

mishka Jun 25, 2025, 9:27 AM
2 points
0
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief
A talented kid from a math school, I’d say (about a typical virtual character I was interacting with during those conversations). Not bad for the time being…

mishka Jun 25, 2025, 4:21 AM
10 points
0
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief
I don’t know. I started (my experience talking with GPT-4 and such) with asking it to analyze a 200 lines of non-standard code with comments stripped out. It correctly figured out that I was using nested dictionaries to represent vector-like objects, and that that was an implementation of a non-standard unusually flexible (but slow) neural machine.

This was obviously the case of “true understanding” (and it was quite difficult to reproduce, as the models evolved the ability to analyze this code well was lost, then eventually regained in better models; those better models eventually figured even more non-trivial things about that non-standard implementation, e.g. at some point newer models started to notice on their own that that particular neural machine was inherently self-modifying; anyway, very obvious evolution from inept pattern matching to good understanding, with some setbacks during the evolution of models, but eventually with good progress towards better and better performance).

Then I asked it to creatively modify and creatively remix some Shadertoy shaders, and it did a very good job (even more so if one considers that that model was visually blind and was unable to see the animations produced by its shaders). Nothing too difficult, but things like taking a function from one of the shaders and adding a call to this function from another shader with impressive visual effects… Again, with all the simplicity, it was more than would have occurred to me, if I were trying to do this manually...

But when I tried to manually iterate these steps to obtain “evolution of interesting shaders”, I got a rapid saturation, not an unlimited interesting evolution...

So not bad at all (I occasionally do rather creative things, but it is always an effort, so on the occasions when I am unable to successfully do this kind of effort, I start to feel that the model might be more creative than “me in my usual mode” (although, I don’t know if these models are already competitive with “me in my peak mode”)).

When they first introduced Code Interpreter, I asked it to solve a math competition problem, and it did a nice job, and then I asked it what would I do, if I want to do it only using a pen and a paper with limited precision (it was a problem with a huge answer), and it told me to take logarithms, and demonstrated how to do that with logarithms.

That immutable tree processing I mentioned was good in this sense, very elegant, taught me some Python tricks I have not known.

Then when reasoning models were first introduced I asked for a linear algebra problem (which I could not solve myself, but people doing math competitions could), and weaker models could not do it, but o1-preview could one-shot it.

(All these are conversations I am publishing on github, so if one wants to take a closer look, one can.)

Anyway, my impression is that it’s not just training data, it’s more than that. This is not doable without reasonable understanding, without good intuition.

At the same time, they can be lazy, can be sloppy and make embarrassing mistakes (which sometimes don’t prevent them from proceeding to the right result). But it’s the reliability which is a problem, not creative capability which seems to be quite robust (at least, on the “medium creativity” setting).

mishka Jun 25, 2025, 3:42 AM
4 points
2
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief

We’re supposed to be talking about novel ideas about minds / intelligence, and then implementing them in a way that really tests the hypothesis.

No, just about how to actually make non-saturating recursive self-improvement (“intelligence explosion”).

Well, with the added constraint of not killing everyone or almost everyone, and not making almost everyone miserable either...

(Speaking of which, I noticed recently that some of people’s attempts at recursive self-improvement now take longer to saturate than before. And, in fact, they are taking long enough that people are sometimes publishing before pushing them to saturation, so we don’t even know what would happen if they were to simply continue pushing a bit harder.)

Now, implementing those things to test those ideas can actually be quite unsafe (that’s basically “mini-foom” experiments, and people are not talking enough about safety of those). So before pushing harder in this direction, it would be better to do some preliminary work to reduce risks of such experiments...

Go ahead and make an argument then.

Yes, LLMs mostly have reliability/steerability problem. I am seeing plenty of “strokes of genius” in LLMs, so the potential is there. They are not “dumb”, they have good creativity (in my experience).

We just can’t get them to reliably compose, verify, backtrack, and so on to produce the overall quality work. Their “fuzzy error correction” still works less well than human “fuzzy error correction”, at least on some of the relevant scales. So they eventually accumulate too many errors on the long horizon tasks and they don’t self-correct enough on the long horizon tasks.

This sounds to me more like a “character upbringing problem” than a purely technical problem...

That’s especially obvious when one reads reasoning traces of reasoning models...

What I see there sounds to me as if their “orientation” is still wrong (those models seem to be thinking about how to satisfy their user or their maker, and not about how to “do the right thing”, whereas a good human solving a math problem ignores the aspect of satisfying their teacher or their parent, and just tries to do “an objectively good job”, and that’s where LLMs are still falling short)...

mishka Jun 25, 2025, 3:10 AM
2 points
0
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief

If they are novel ideas, it’s much much harder for current LLMs to rightly implement them.

That’s not my experience.

Even with the original GPT-4, asking for some rather non-standard things which I was unable to find online (e.g. immutable transformations of nested dictionaries in Python, so that I could compute JAX gradients with respect to variables located in the nodes of the trees in question) resulted in nice implementations (I had no idea how to even start doing that in Python, I only knew that trying to literally port my Clojure code doing the same thing would be a really bad idea).

Not one-shot, of course, but conversations with several turns have been quite productive in this sense.

And that’s where we have decent progress lately and expect further rapid progress soon (this year, I think, so I would expect a much better one-shot experience in this sense soon).

No I think it’s the first one.

It depends on whether one expects “the solution” to be purist or hybrid.

If the expectation is “purist”, then the barrier is high (not insurmountable, I think, but I am a bit reluctant to discuss the technical details in public).

If the expectation is that the solution will be “hybrid” (e.g. powerful LLMs can be used as components, and one can have read-write access to the internals of those LLMs), then it should be very doable (we have not even started to really “unhobble” those LLMs, reasoning models non-withstanding). Quite a bit of non-triviality still remains even under a “hybrid” approach, but the remaining gap is much lower than for “purist” attempts. It’s obvious that LLMs are more than “half-way there”, even if it might be true that “something fundamental is still missing”.

mishka Jun 25, 2025, 2:40 AM
2 points
0
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief
I like talking to LLMs and discussing things (including various ideas some of which might be related to advanced AI or to AI existential safety). And that helps somewhat.

But if I can ask them to implement my ideas end-to-end in prototype code and to conduct computational experiments for me end-to-end, my personal velocity would grow an order of magnitude, if not more...

And if I can ask them to explore variations and recombinations of my ideas, then...

It’s not that people don’t have good AGI ideas, it’s more that the road from a promising idea to a viable implementation is difficult, and a lot of good ideas remain “hanging in the air” (that is, insufficiently tested for us to know if those promising ideas are actually good).

mishka Jun 25, 2025, 2:18 AM
2 points
0
in reply to: TsviBT’s comment on: A regime-change power-vacuum conjecture about group belief
I think OpenAI, DeepMind and other “big labs” work on many things. They release some of them.

But also many other people work on many non-standard things, e.g. Sakana AI, Liquid AI and others, and with non-standard approaches compute might be less crucial, and open reasoning models like the new R1 are only a few months behind the leading released closed source models, so those smaller orgs can have their research accelerated as well even if they decide they don’t want to use OpenAI/DeepMind/Anthropic models.

mishka Jun 25, 2025, 12:00 AM
3 points
0
on: A regime-change power-vacuum conjecture about group belief
It’s not AGI (even in a narrow sense of a human-equivalent software engineer, mathematician, and AI researcher).

However, the question is: how long till it is accelerating AI research more and more noticeably and more and more autonomously?

This mode might start way before AGI (even in the narrow sense of the word I mention above) is achieved…

mishka Jun 24, 2025, 10:26 PM
6 points
4
on: Analyzing A Critique Of The AI 2027 Timeline Forecasts
Thanks for writing this!

Of course, we are still missing METR-style evaluations measuring the ability to complete long tasks for all recent systems (Claude 4, newer versions of Gemini-2.5, and, importantly, for agentic frameworks such as OpenAI Codex, Claude Code and similar systems).

When we obtain those evaluations, we’ll have a better understand of the shape of the curve, whether the doubling period keeps shrinking, and if so, how rapidly it shrinks…

mishka Jun 21, 2025, 5:09 PM
5 points
0
on: Attribution-based parameter decomposition for linear maps
Nice!

An explanation of the APD acronym would be helpful to the readers. In this sense, including this link should be sufficient: https://www.lesswrong.com/posts/EPefYWjuHNcNH4C7E/attribution-based-parameter-decomposition

mishka Jun 20, 2025, 3:49 AM
1 point
0
in reply to: Satya Benson’s comment on: Moving Past the Question of Consciousness: A Thought Experiment
Right.

What comes to mind, however, is that given that the Galabren are more advanced than humans they might have already solved the “hard problem of aelthousness” (assuming that its difficulty is comparable to the difficulty of the “hard problem of consciousness”).

(I assume we want to stay agnostic on whether these two concepts actually point to the same thing, just with different sets of qualia, or whether they actually point to different things.)