Lech Mazur

Karma: 416

Advameg, Inc. CEO

Founder, city-data.com

https://twitter.com/LechMazur

Author: County-level COVID-19 machine learning case prediction model.

Author: AI assistant for melody composition.

Lech Mazur Feb 28, 2025, 3:14 AM
7 points
0
in reply to: Evan_Gaensbauer’s comment on: AI Rapidly Gets Smarter, And Makes Some of Us Dumber”
It’s a video by an influencer who has repeatedly shown no particular insight in any field other than her own. For example, her video about the simulation hypothesis was atrocious. I gave this one a chance, and it’s just a high-level summary of some recent developments, nothing interesting.

Lech Mazur Feb 28, 2025, 3:01 AM
9 points
11
on: OpenAI releases GPT-4.5
It’s better than 4o across four of my benchmarks: Confabulations, Creative Writing, Thematic Generalization, and Extended NYT Connections. However, since it’s an expensive and huge model, I think we’d be talking about AI progress slowing down at this point if it weren’t for reasoning models.

Lech Mazur Feb 25, 2025, 9:34 AM
7 points
0
on: Anthropic releases Claude 3.7 Sonnet with extended thinking mode
I ran 3 of my benchmarks so far:

Extended NYT Connections
- Claude 3.7 Sonnet Thinking: 4th place, behind o1, o3-mini, DeepSeek R1
- Claude 3.7 Sonnet: 11th place
  GitHub Repository
Thematic Generalization
- Claude 3.7 Sonnet Thinking: 1st place
- Claude 3.7 Sonnet: 6th place
  GitHub Repository
Creative Story-Writing
- Claude 3.7 Sonnet Thinking: 2nd place, behind DeepSeek R1
- Claude 3.7 Sonnet: 4th place
  GitHub Repository
Note that Grok 3 has not been tested yet (no API available).

Lech Mazur Feb 16, 2025, 9:32 AM
7 points
0
in reply to: Quinn’s comment on: Quinn’s Shortform
This might blur the distinction between some evals. While it’s true that most evals are just about capabilities, some could be positive for improving LLM safety.

I’ve created 8 (soon to be 9) LLM evals (I’m not funded by anyone, it’s mostly out of my own curiosity, not for capability or safety or paper publishing reasons). Using them as examples, improving models to score well on some of them is likely detrimental to AI safety:

https://github.com/lechmazur/step_game—to score better, LLMs must learn to deceive others and hold hidden intentions

https://github.com/lechmazur/deception/ - the disinformation effectiveness part of the benchmark

Some are likely somewhat negative because scoring better would enhance capabilities:

https://github.com/lechmazur/nyt-connections/

https://github.com/lechmazur/generalization

Others focus on capabilities that are probably not dangerous:

https://github.com/lechmazur/writing—creative writing

https://github.com/lechmazur/divergent—divergent thinking in writing

However, improving LLMs to score high on certain evals could be beneficial:

https://github.com/lechmazur/goods—teaching LLMs not to overvalue selfishness

https://github.com/lechmazur/deception/?tab=readme-ov-file#-disinformation-resistance-leaderboard—the disinformation resistance part of the benchmark

https://github.com/lechmazur/confabulations/ - reducing the tendency of LLMs to fabricate information (hallucinate)

I think it’s possible to do better than these by intentionally designing evals aimed at creating defensive AIs. It might be better to keep them private and independent. Given the rapid growth of AI capabilities, the lack of apparent concern for an international treaty (as seen in the recent Paris AI summit), and the competitive race dynamics among companies and nations, specifically developing an AI to protect us from threats from other AIs or AIs + humans might be the best we can hope for.

Lech Mazur Jan 24, 2025, 8:39 PM
1 point
0
on: Zvi’s 2024 In Movies
Your ratings have a higher correlation with IMDb ratings at 0.63 (I ran it as a test of Operator).

Lech Mazur Nov 21, 2024, 5:14 AM
11 points
1
in reply to: Nikola Jurkovic’s comment on: DeepSeek beats o1 on math and ties on coding; will release weights
It seems that 76.6% originally came from the GPT-4o announcement blog post. I’m not sure why it dropped to 60.3% by the time of o1′s blog post.

Lech Mazur Oct 22, 2024, 5:29 PM
1 point
0
on: Sabotage Evaluations for Frontier Models
Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.

Lech Mazur Oct 10, 2024, 7:52 PM
1 point
0
on: GPT-4o1
I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini’s results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.

The chart above isn’t very informative without the non-response rate for these documents, which I’ve also calculated:
The GitHub page has further notes.

Lech Mazur Sep 13, 2024, 11:34 PM
28 points
1
on: OpenAI o1
NYT Connections results (436 questions):
o1-mini 42.2
o1-preview 87.1
The previous best overall score was my advanced multi-turn ensemble (37.8), while the best LLM score was 26.5 for GPT-4o.
What links here?
- GPT-o1 by Zvi (Sep 16, 2024, 1:40 PM; 86 points)

Lech Mazur Aug 28, 2024, 2:56 PM
3 points
0
on: Lech Mazur’s Shortform
I’ve created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:

MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o

NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o

GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.

I might make it publicly accessible if there’s enough interest. Of course, there are expected tradeoffs: it’s slower and more expensive to run.

Lech Mazur Jul 6, 2024, 8:39 AM
4 points
0
on: Introduction to French AI Policy
Hugging Face should also be mentioned. They’re a French-American company. They have a transformers library and they host models and datasets.

Lech Mazur May 14, 2024, 5:35 AM
2 points
1
in reply to: DPiepgrass’s comment on: The Story of “I Have Been A Good Bing”
When I was working on my AI music project (melodies.ai) a couple of years ago, I ended up focusing on creating catchy melodies for this reason. Even back then, voice singing software was already quite good, so I didn’t see the need to do everything end-to-end. This approach is much more flexible for professional musicians, and I still think it’s a better idea overall. We can describe images with text much more easily than music, but for professional use, AI-generated images still require fine-scale editing.

Lech Mazur May 6, 2024, 6:36 AM
2 points
1
in reply to: JenniferRM’s comment on: William_S’s Shortform
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation

Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...

Lech Mazur Apr 25, 2024, 2:18 AM
1 point
0
on: Are the LLM “intelligence” tests publicly available for humans to take?
You can go through an archive of NYT Connections puzzles I used in my leaderboard. The scoring I use allows only one try and gives partial credit, so if you make a mistake after getting 1 line correct, that’s 0.25 for the puzzle. Top humans get near 100%. Top LLMs score around 30%. Timing is not taken into account.

Lech Mazur Apr 17, 2024, 6:48 PM
2 points
0
on: A Quick Note on AI Scaling Asymptotes
Chinchilla Scaling: A replication attempt

https://arxiv.org/abs/2404.10102

Lech Mazur Apr 14, 2024, 5:01 AM
1 point
0
on: AlphaGeometry: An Olympiad-level AI system for geometry
https://arxiv.org/abs/2404.06405

“Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu’s method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu’s method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.”

Lech Mazur Mar 12, 2024, 3:48 AM
4 points
0
on: Revisiting algorithmic progress
I noticed a new paper by Tamay, Ege Erdil, and other authors: https://arxiv.org/abs/2403.05812. This time about algorithmic progress in language models.

“Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore’s Law.”

Lech Mazur Mar 5, 2024, 4:49 PM
LW: 28 AF: 12
8
AF
on: Anthropic release Claude 3, claims >GPT-4 Performance
I’ve just created a NYT Connections benchmark. 267 puzzles, 3 prompts for each, uppercase and lowercase.

Results:

GPT-4 Turbo: 31.0

Claude 3 Opus: 27.3

Mistral Large: 17.7

Mistral Medium: 15.3

Gemini Pro: 14.2

Qwen 1.5 72B Chat: 10.7

Claude 3 Sonnet: 7.6

GPT-3.5 Turbo: 4.2

Mixtral 8x7B Instruct: 4.2

Llama 2 70B Chat: 3.5

Nous Hermes 2 Yi 34B: 1.5
- Partial credit is given if the puzzle is not fully solved
- There is only one attempt allowed per puzzle, 0-shot. Humans get 4 attempts and a hint when they are one step away from solving a group
- Gemini Advanced is not yet available through the API
(Edit: I’ve added bigger models from together.ai and from Mistral)

Lech Mazur Feb 17, 2024, 4:21 AM
1 point
0
on: 2023 Survey Results
It might be informative to show the highest degree earned only for people who have completed their formal education.

I think the average age might be underestimated: the age of the respondents appeared to have a negative relationship with the response rates (link).

Lech Mazur Jan 11, 2024, 4:40 AM
5 points
0
on: Introduce a Speed Maximum
If we were to replace speed limit signs, it might be better to go all out and install variable speed limit signs. It’s common to see people failing to adjust their speed sufficiently in poor conditions. A few days ago, there was a 35-vehicle pileup with two fatalities in California due to fog.

Lech Mazur

Extended NYT Connections

Thematic Generalization

Creative Story-Writing