Advameg, Inc. CEO
Founder, city-data.com
https://twitter.com/LechMazur
Author: County-level COVID-19 machine learning case prediction model.
Author: AI assistant for melody composition.
Advameg, Inc. CEO
Founder, city-data.com
https://twitter.com/LechMazur
Author: County-level COVID-19 machine learning case prediction model.
Author: AI assistant for melody composition.
Somewhat related: I just published the LLM Deceptiveness and Gullibility Benchmark. This benchmark evaluates both how well models can generate convincing disinformation and their resilience against deceptive arguments. The analysis covers 19,000 questions and arguments derived from provided articles.
I included o1-preview and o1-mini in a new hallucination benchmark using provided text documents and deliberately misleading questions. While o1-preview ranks as the top-performing single model, o1-mini’s results are somewhat disappointing. A popular existing leaderboard on GitHub uses a highly inaccurate model-based evaluation of document summarization.
The chart above isn’t very informative without the non-response rate for these documents, which I’ve also calculated:
The GitHub page has further notes.
NYT Connections results (436 questions):
o1-mini 42.2
o1-preview 87.1
The previous best overall score was my advanced multi-turn ensemble (37.8), while the best LLM score was 26.5 for GPT-4o.
I’ve created an ensemble model that employs techniques like multi-step reasoning to establish what should be considered the real current state-of-the-art in LLMs. It substantially exceeds the highest-scoring individual models and subjectively feels smarter:
MMLU-Pro 0-shot CoT: 78.2 vs 75.6 for GPT-4o
NYT Connections, 436 questions: 34.9 vs 26.5 for GPT-4o
GPQA 0-shot CoT: 56.0 vs 52.5 for Claude 3.5 Sonnet.
I might make it publicly accessible if there’s enough interest. Of course, there are expected tradeoffs: it’s slower and more expensive to run.
Hugging Face should also be mentioned. They’re a French-American company. They have a transformers library and they host models and datasets.
When I was working on my AI music project (melodies.ai) a couple of years ago, I ended up focusing on creating catchy melodies for this reason. Even back then, voice singing software was already quite good, so I didn’t see the need to do everything end-to-end. This approach is much more flexible for professional musicians, and I still think it’s a better idea overall. We can describe images with text much more easily than music, but for professional use, AI-generated images still require fine-scale editing.
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
You can go through an archive of NYT Connections puzzles I used in my leaderboard. The scoring I use allows only one try and gives partial credit, so if you make a mistake after getting 1 line correct, that’s 0.25 for the puzzle. Top humans get near 100%. Top LLMs score around 30%. Timing is not taken into account.
Chinchilla Scaling: A replication attempt
https://arxiv.org/abs/2404.06405
“Essentially, this classic method solves just 4 problems less than AlphaGeometry and establishes the first fully symbolic baseline strong enough to rival the performance of an IMO silver medalist. (ii) Wu’s method even solves 2 of the 5 problems that AlphaGeometry failed to solve. Thus, by combining AlphaGeometry with Wu’s method we set a new state-of-the-art for automated theorem proving on IMO-AG-30, solving 27 out of 30 problems, the first AI method which outperforms an IMO gold medalist.”
I noticed a new paper by Tamay, Ege Erdil, and other authors: https://arxiv.org/abs/2403.05812. This time about algorithmic progress in language models.
“Using a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months, substantially faster than hardware gains per Moore’s Law.”
I’ve just created a NYT Connections benchmark. 267 puzzles, 3 prompts for each, uppercase and lowercase.
Results:
GPT-4 Turbo: 31.0
Claude 3 Opus: 27.3
Mistral Large: 17.7
Mistral Medium: 15.3
Gemini Pro: 14.2
Qwen 1.5 72B Chat: 10.7
Claude 3 Sonnet: 7.6
GPT-3.5 Turbo: 4.2
Mixtral 8x7B Instruct: 4.2
Llama 2 70B Chat: 3.5
Nous Hermes 2 Yi 34B: 1.5
Partial credit is given if the puzzle is not fully solved
There is only one attempt allowed per puzzle, 0-shot. Humans get 4 attempts and a hint when they are one step away from solving a group
Gemini Advanced is not yet available through the API
(Edit: I’ve added bigger models from together.ai and from Mistral)
It might be informative to show the highest degree earned only for people who have completed their formal education.
I think the average age might be underestimated: the age of the respondents appeared to have a negative relationship with the response rates (link).
If we were to replace speed limit signs, it might be better to go all out and install variable speed limit signs. It’s common to see people failing to adjust their speed sufficiently in poor conditions. A few days ago, there was a 35-vehicle pileup with two fatalities in California due to fog.
It’s a lot of work to learn to create animations and then do them for hours of content. Creating AI images with Dall-E 3, Midjourney v6, or SDXL and then animating them with RunwayML (which in my testing worked better than Pika or Stable Video Diffusion) could be an intermediate step. The quality is already high enough for AI images, but not for video without multiple tries (it should get a lot better in 2024).
Will do.
Entering an extremely unlikely prediction as a strategy to maximize EV only makes sense if there’s a huge number of entrants, which seems improbable unless this contest goes viral. The inclusion of an “interesting” factor in the ranking criteria should deter spamming with low-quality entries.
Kalshi has a real-money market “ChatGPT-5 revealed” for 2023 (that I’ve traded). I think they wouldn’t mind adding another one for 2024.
I’m a fan of prediction markets, but they’re limited to pre-set bets and not ideal for long-shot, longer-term predictions, mainly because betting against such a prediction means a loss compared to risk-free bonds if money is tied up. Therefore, I’d like to fund a 2024 Long-Shot Prediction Contest offering up to three $500 prizes. However, I need volunteers to act as judges and help getting this publicized.
Entrants will submit one prediction for 2024 on any topic or event
Volunteer judges and I will vote on the likelihood of each prediction and how “interesting” it is, forming a ranked list
In January 2025, judges will determine which predictions came true, and winners will get their prizes
To start with a $500 prize, I need at least two people to volunteer as judges and a minimum of 10 predictions (judges cannot enter). If this receives, let’s say, 50+ predictions, there will be two prizes. For 200+ predictions, three prizes.
Interested in judging or have any suggestions? Let me know.
The specific example in your recent paper is quite interesting
“we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision”
It seems that 76.6% originally came from the GPT-4o announcement blog post. I’m not sure why it dropped to 60.3% by the time of o1′s blog post.