Gunnar_Zarncke

Karma: 10,521

Software engineering, parenting, cognition, meditation, other
Linkedin, Facebook, Admonymous (anonymous feedback)

Gunnar_Zarncke Apr 11, 2025, 3:22 PM
7 points
1
on: Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
We use Unsloth for fine-tuning 8B models because it offers support for single GPU training with high throughput. For 70B models, we use the Together AI fine-tuning API. Axolotl is used for 405B models due to its support for Fully Sharded Data Parallel (FSDP) training. Together AI does not support 405B fine-tuning, and Unsloth lacks distributed training capabilities, while the Pro version is in development.
Thank you for your details on your setup. Can you drop hints how long each training run took on these systems?

Gunnar_Zarncke Apr 10, 2025, 9:23 PM
4 points
0
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
I thought it would be good to have some examples where you could have a useful type signature, and I asked ChatGPT. I think these are too wishy-washy, but together with the given explanation, they seem to make sense.
Would you say that this level of “having a type signature in mind” would count?
ChatGPT 4o suggesting examples
1. Prediction vs Explanation
Explanation might be:
Phenomenon → (Theory, Mechanism)
Prediction might be:
Features → Label
These have different type signatures. A model that predicts well might not explain. People often conflate these roles. Type signatures remind us: different input-output relationships.
Moral Judgments vs Policy Proposals
Moral judgment (deontic):
Action → Good/Bad
Policy proposal (instrumental):
(State × Action) → (New State × Externalities)
People often act as if “this action is wrong” implies “we must ban it,” but that only follows if the second signature supports the first. You can disagree about outcomes while agreeing on morals, or vice versa.
Interpersonal Feedback
Effective feedback:
(Action × Impact) → Updated Mental Model
People often act as if the type signature is just Action → Judgment. That’s blame, not feedback. This reframing can help structure nonviolent communication.
Creativity vs Optimization
Optimization:
(Goal × Constraints) → Best Action
Creativity:
Void → (Goal × Constraints × Ideas)
The creative act generates the very goal and constraints. Treating creative design like optimization prematurely can collapse valuable search space.
7. Education
Lecture model:
Speaker → (Concepts × StudentMemory)
Constructivist model:
(Student × Task × Environment) → Insight
If the type signature of insight requires active construction, then lecture-only formats may be inadequate. Helps justify pedagogy choices.
Source: https://chatgpt.com/share/67f836e2-1280-8001-a7ad-1ef1e2a7afa7

Gunnar_Zarncke Apr 10, 2025, 2:20 PM
2 points
−3
on: AI #111: Giving Us Pause
Of these, I’m most worried about neuralese recurrence effectively removing direct access to the AI’s reasoning in a legible format.
I am not worried about this right now. We should always be able to translate latent space reasoning aka neuralese (see COCONUT) to a human language equivalent representation. This might be incomplete or leave out details—but that is already the case for existing models (as discussed here). The solution suggested by Villiam is to recursively expand as needed.
Another option might be to translate neuralese to equivalent program code (preferably Lean). This would be harder for most people to read but more precise and probably easier to verify.

Gunnar_Zarncke Apr 8, 2025, 2:39 PM
2 points
0
in reply to: mako yass’s comment on: Peacewagers so Far
Thanks for the clarifications!
The second referred to holes in the landscape mentioned in the post, not in the rules.

Gunnar_Zarncke Apr 7, 2025, 8:45 PM
2 points
0
on: Peacewagers so Far
That seems like a great game that I will try.
I found the material and the game manual v0.1, but I do not understand
- how corpses are created. When a player is killed/eaten?
- how are holes created and healed? I didn’t see any cards for that.
- are players eliminated when both of their agents are eaten? do they still get a score? I guess yes.

Gunnar_Zarncke Apr 7, 2025, 1:18 PM
3 points
0
on: Arusha Perpetual Chicken—an unlikely iterated game
government issue 4x4s (the true menace of the road)
can confirm from direct observation in both Tanzania and Kenya. It’s not only government issue 4x4, but the distinction doesn’t matter as they enjoy the same privileges.

Gunnar_Zarncke Mar 31, 2025, 1:44 PM
2 points
0
in reply to: Ben Pace’s comment on: Sam Altman’s sister, Annie Altman, says Sam has (severely) abused her
The good thing is that at least those actions on larger platforms leave evidence that can be established by the court.

Gunnar_Zarncke Mar 31, 2025, 8:51 AM
0 points
0
in reply to: Thane Ruthenis’s comment on: Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle
Yes, I think “runes” throw many LLMs off into the wrong simulator. Humans don’t fall for this because the symbols “look” mathematical but a text-based LLM can’t “see” that. The opposite happens for computer scientists: They see “[]” and start to think in programming terms such as lambda functions...
Using a much simpler prompt and without mentioning number theory or math o3 easily solves it:
There’s a series of symbol sequences, composed entirely of “[”, “]”, and “-” in some combination that is equal to a number. Here are examples:
…
What is the meaning of this formal notation?

Gunnar_Zarncke Mar 31, 2025, 8:36 AM
2 points
0
in reply to: Afterimage’s comment on: Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle
Yes. I first tried things like this, too. I also tried term rewrite rules, and some of these were quite close. For example, AB → A*(B+1) or AB → A*(B+A) or AB → A*(B+index) led to some close misses (the question was which to expand first, so which associativity, I also considered expanding smaller first) but failed with later expansions. Took me half an hour to figure out that the index was not additive or multiplicative but the exponent base.

Gunnar_Zarncke Mar 29, 2025, 10:08 AM
5 points
3
on: The Pando Problem: Rethinking AI Individuality
When we talk about AIs scheming, alignment faking or goal preservation, we imply there is something scheming or alignment faking or wanting to preserve its goals or escape the datacentre.
See also this previous discussion about
What is the AGI system and what is the environment? Where does the AGI system draw the boundary when reasoning about itself?

Gunnar_Zarncke Mar 23, 2025, 11:45 PM
4 points
−2
in reply to: MondSemmel’s comment on: Elizabeth’s Shortform
Sure, but presumably that brought them to LW where AI safety is a topic.
But maybe the point at which you get into AI safety is less clear once you are on LW. And FO is a something that can more clearly named. So it could all just be availability heuristic.

Gunnar_Zarncke Mar 23, 2025, 11:18 PM
4 points
0
in reply to: Elizabeth’s comment on: Elizabeth’s Shortform
Why do you think that is the case? I mean a) why do they reveal that only after a few drinks and b) why was that the convincing story—and not HPMoR?

Gunnar_Zarncke Mar 22, 2025, 9:24 PM
4 points
0
on: Gunnar_Zarncke’s Shortform
[Linkpost] China’s AI OVERPRODUCTION
Claim by Balaji:
China seeks to commoditize their complements. So, over the following months, I expect a complete blitz of Chinese open-source AI models for everything from computer vision to robotics to image generation.
If true, what effects would that have on the AI race and AI governance?

Gunnar_Zarncke Mar 22, 2025, 1:14 AM
4 points
0
in reply to: Kaj_Sotala’s comment on: Gunnar_Zarncke’s Shortform
Yes! That’s the right intuition. And the LLMs are doing the same—but we don’t know their world model, and thus, the direction of the simplification can be arbitrarily off.
Drilling down on the simplifications, as suggested by Villiam might help.

Gunnar_Zarncke Mar 21, 2025, 8:46 PM
2 points
0
in reply to: Viliam’s comment on: Gunnar_Zarncke’s Shortform
This is an interesting UI proposal and, if done right, might provide the needed transparency. Most people wouldn’t read it, but some would, esp. for critical answers.

Gunnar_Zarncke Mar 21, 2025, 2:00 PM
2 points
0
in reply to: Christopher King’s comment on: How far along Metr’s law can AI start automating or helping with alignment research?
Yes, but it didn’t mean that AIs could do all kinds of long tasks in 2005. And that is the conclusion many people seem to draw from the METR paper.

Gunnar_Zarncke Mar 21, 2025, 1:58 PM
2 points
0
in reply to: cubefox’s comment on: Gunnar_Zarncke’s Shortform
as we use the term, yes. But the point (and I should have made that more clear) is that any mismodeling of the parent of the interests of the child’s interests and future environment will not be visible to the child or even someone reading the thoughts of the well-meaning parent. So many parents want the best for their child, but model the future of the child wrongly (mostly by status quo bias; the problem is different for AI).

Gunnar_Zarncke Mar 21, 2025, 1:55 PM
2 points
0
in reply to: Rafael Harth’s comment on: How far along Metr’s law can AI start automating or helping with alignment research?
It is a decent metric for chess but a) it doesn’t generalize to other tasks (as people seem to interpret the METR paper), and less importantly, b) I’m quite confident that people wouldn’t beat the chess engines by thinking for years.

Gunnar_Zarncke Mar 21, 2025, 12:13 AM
2 points
0
in reply to: Christopher King’s comment on: How far along Metr’s law can AI start automating or helping with alignment research?
No? It means you can’t beat the chess engine.
And even if—they try to argue in the other direction: If it takes the human time X at time T it will take the AI duration L. That didn’t work for chess either.

Gunnar_Zarncke Mar 20, 2025, 11:55 PM
2 points
0
in reply to: Christopher King’s comment on: How far along Metr’s law can AI start automating or helping with alignment research?
That we would have AIs performing year-long tasks in 2005. Chess is not the same as software engineering but it is still a limited domain.

Gunnar_Zarncke

1. Prediction vs Explanation

Moral Judgments vs Policy Proposals

Interpersonal Feedback

Creativity vs Optimization

7. Education