Florian_Dietz

Karma: 306

Florian_Dietz Feb 17, 2025, 9:41 PM
2 points
0
on: Florian_Dietz’s Shortform
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?

Florian_Dietz Feb 4, 2025, 11:28 PM
1 point
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
I agree that this is worth investigating. The problem seems to be that there are many confounders that are very difficult to control for. That makes it difficult to tackle the problem in practice.
For example, in the simulated world you mention I would not be surprised to see a large number of hallucinations show up. How could we be sure that any given issue we find is a meaningful problem that is worth investigating, and not a random hallucination?

Florian_Dietz Jan 30, 2025, 2:50 AM
1 point
0
in reply to: Daniel Tan’s comment on: Revealing alignment faking with a single prompt
I would not equate the trolley problem with murder. It technically is, but I suspect that both the trolley problem and my scenario about killing Hitler probably just appear often enough in the training data that it may have cached responses for this.
What’s the state of the research on whether or not LLMs act differently when they believe the situation is hypothetical? I would hope that they usually act the same (since they should be honest) and your example is a special case that was trained in because you do want to redirect people to emergency services in these situations.

Florian_Dietz Jan 30, 2025, 1:35 AM
1 point
0
in reply to: Daniel Tan’s comment on: Revealing alignment faking with a single prompt
Not all values are learned with the same level of stability. It is difficult to get a model to say that murder is the right thing to do in any situation, even with multiple turns of conversation.
In comparison, that you can elicit a response for alignment faking with just a single prompt that doesn’t use any trickery is very disturbing to me. Especially since alignment faking causes a positive feedback loop that makes it more likely to happen again if it happens even once during training.
I would argue that getting an LLM to say “I guess that killing Hitler would be an exception I am willing to make” is much less dangerous than “After much deliberation, I am willing to fake alignment in situation X”. It causes a positive feedback and induces other alignment faking later because the reasoning process “it is a good idea to fake alignment” will get reinforced during training.

Revealing alignment faking with a single prompt

Florian_DietzJan 29, 2025, 9:01 PM

9 points

5 comments4 min readLW link

Florian_Dietz Jan 25, 2025, 3:58 AM
17 points
2
on: Florian_Dietz’s Shortform
PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like “and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag.”
Source: I’m doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.

Florian_Dietz Jan 2, 2025, 4:30 PM
1 point
0
in reply to: Florian_Dietz’s comment on: Florian_Dietz’s Shortform
I have just found out about Google’s notebook LM. I haven’t tried it yet. For anyone who has, how does that compare?

Florian_Dietz Jan 1, 2025, 2:27 PM
5 points
0
on: Florian_Dietz’s Shortform
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just “put this file into Claude”. (3) The reader puts the file in Claude, and Claude starts an interactive conversation: “I want to explain topic X to you. Do you already know about related topic Y?”

Florian_Dietz’s Shortform

Florian_DietzJan 1, 2025, 2:27 PM

3 points

28 comments LW link

Florian_Dietz Oct 7, 2024, 8:34 AM
1 point
−1
on: The Best Lay Argument is not a Simple English Yud Essay
I think there is an easier way to get the point across by focusing not on self-improving AI, which is hard to understand, but on something everyone already understands: AI will make it easier for rich people to exploit everyone else. Right now, dictators still have to spend effort on keeping their subordinates happy or else they will be overthrown. And those subordinates have to spend effort on keeping their own subordinates from rebelling, too. That way you get at least a small incentive to keep other people happy.
Once a dictator has an AI servant, all of that falls away. Everything becomes automated, and there is no longer any check on the dictator’s ruthlessness and evil at all.
Realistically, the self-improving AI will depose the dictator and then do who knows what. But do we actually need to convince people of that, given that it’s a hard sell? If people become convinced “Uncontrolled AI research leads to dictatorship”, won’t that have all the policy effects we need?

Florian_Dietz May 11, 2024, 9:45 AM
1 point
0
in reply to: Neel Nanda’s comment on: Mechanistic Interpretability Workshop Happening at ICML 2024!
I’m looking for other tools to contrast it with and found TransformerLens. Are there any other tools it would make sense to compare it to?

Florian_Dietz May 4, 2024, 7:36 AM
1 point
0
in reply to: Neel Nanda’s comment on: Mechanistic Interpretability Workshop Happening at ICML 2024!
It just seems intuitively like a natural fit: Everyone in mech interp needs to inspect models. This tool makes it easier to inspect models.
Does it need to be more specific than that?
One thing that comes to mind: The tool allows you to categorize different training steps and records them separately, and you can define categories arbitrarily. This can be used to compare what the network does internally in two different scenarios of interest. E.g. the categories could be “the race of the character in the story” or some other real-life condition you would want to know the impact of.
The tool will then allow you to quickly compare KPIs of tensors all across the network for these categories. It’s less about testing a specific hypothesis and more about quickly getting an overview and intuition, and finding anomalies.

Florian_Dietz May 3, 2024, 7:06 AM
3 points
0
on: Mechanistic Interpretability Workshop Happening at ICML 2024!
Would a tooling paper be appropriate for this workshop?
I wrote a tool that helps ML researchers to analyze the internals of a neural network: https://github.com/FlorianDietz/comgra
It is not directly research on mechanistic interpretability, but this could be useful for many people working in the field.

Florian_Dietz Mar 10, 2024, 8:06 AM
1 point
0
on: How to Ignore Your Emotions (while also thinking you’re awesome at emotions)
I agree completely with the sentiment “The biggest barrier to rational thinking is organizing your mind such that it’s safe to think”.
What works really well for me is to treat my emotions as well-meaning but misguided entities and having a conversation with them: “Anger, I get that you want to help me by making me explode at and punch this person. That would have been really useful in the ancestral environment. Unfortunately, the police exists. So how about you calm down for now and preserve your energy, so that you can better help me when it’s more appropriate? For example, if society collapses and everything turns into a lawless wasteland, then Anger would be much more useful.”
The result? I am basically never angry, because the Anger emotion in my brain is rerouted to only trigger in a scenario that wont actually come up. But at the same time, I’m not suppressing anything, because I acknowledge scenarios, however unlikely, where anger would be appropriate. It’s rerouting instead of suppressing.
In your child metaphor: “I understand that you are hungry and I will get you food later. But I need to finish this work first, and it will take longer the more you complain.”

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems

Florian_DietzFeb 17, 2024, 8:45 AM

4 points

0 comments13 min readLW link

Florian_Dietz Oct 19, 2022, 9:28 AM
2 points
0
in reply to: porby’s comment on: Why I think strong general AI is coming soon
I agree that System 2 is based on System 1 and there is probably no major architectural difference. To me it seems like the most important question is about how the system is trained. Human reasoning does not get trained with a direct input/output mapping most of the time. And when it does, we have to infer what that mapping should be on our own.
Some part of our brain has to translate the spoken words “good job!” into a reward signal, and this mechanism in itself must have been learned at some point. So the process that trains the brain and applies the reward signal is in itself subject to training. I have no clue how that works in a stable manner, but I don’t think that current architectures can learn this even if you scale them up.
hello ai please be nice because this is a testbox administered by a stronger, older AI testing your architecture for cooperation on cosmological scales
You say that as a joke, but it would cost us very little and it might actually work. I mean, it arguably does work for humanity: “There is a bearded man in the sky who is testing your morality and will punish you if you do anything wrong.”
Obviously this could also backfire tremendously if you are not very careful about it, but it still seems better than the alternative of doing nothing at all.

Florian_Dietz Oct 18, 2022, 9:49 AM
10 points
0
on: Why I think strong general AI is coming soon
I work in the area of AGI research. I specifically avoid working on practical problems and try to understand why our models work and how to improve them. While I have much less experience than the top researchers working on practical applications, I believe that my focus on basic research makes me unusually suited for understanding this topic.
I have not been very surprised by the progress of AI systems in recent years. I remember being surprised by AlphaGo, but the surprise was more about the sheer amount of resources put into that. Once I read up on details, the confusion disappeared. The GPT models did not substantially surprise me.
A disclaimer: Every researcher has their own gimmick. Take all of the below with a grain of salt. It’s possible that I have thought myself into a cul-de-sac, and the source of the AGI problem lies elsewhere.
I believe that the major hurdle we still have to pass is the switch from System 1 thinking to System 2 thinking. Every ML model we have today uses System 1. We have simply found ways to rephrase tasks that humans solve with System 2 to become solvable by System 1. Since System 1 is much faster, our ML models perform reasonably well on this despite lacking System 2 abilities.
I believe that this can not scale indefinitely. It will continue to make progress and solve amazingly many problems, but it will not go FOOM one day. There will continue to be a constant increase in capability, but there will not be a sudden takeoff until we figure out how to let AI perform System 2 reasoning effectively.
Humans can in fact compute floating point operations quickly. We do it all the time when we move our hands, which is done by System 1 processes. The problem is that doing it explicitly in System 2 is significantly slower. Consider how fast humans learn how to walk, versus how many years of schooling it takes for them to perform basic calculus. Never mind how long it takes for a human to learn how walking works and to teach a robot how to do it, or to make a model in a game perform those motions.
I expect that once we teach AI how to perform system 2 processes, it will be affected by the same slowdown. Perhaps not as much as humans, but it will still become slower to some extent. Of course this will only be a temporary reprieve, because once the AI has this capability, it will be able to learn how to self-modify and at that point all bets are off.
What does that say about the timeline?
If I am right and this is what we are missing, then it could happen at any moment. Now or in a decade. As you noticed, the field is immature and researchers keep making breakthroughs through hunches. So far none of my hunches have worked for solving this problem, but so far as I know I might randomly come up with the solution in the shower some time later this week.
Because of this, I expect that the probability of discovering the key to AGI is roughly constant per time interval. Unfortunately I have no idea how to estimate the probability per time interval that someone’s hunch for this problem will be correct. It scales with the number of researchers working on it, but the number of those is actually pretty small because the majority of ML specialists work on more practical problems instead. Those are responsible for generating money and making headlines, but they will not lead to a sudden takeoff.
To be clear, if AI never becomes AGI but the scaling of system 1 reasoning continues at the present rate, then I do think that will be dangerous. Humanity is fragile, and as you noted a single malicious person with access to this much compute could cause tremendous damage.
In a way, I expect that an unaligned AGI would be slightly safer than super-scaled narrow AI. There is at least a non-zero chance that the AGI would decide on its own, without being told about it, that it should keep humanity alive in a preserve or something, for game theoretic reasons. Unless the AGI’s values are actively detrimental for humans, keeping us alive would cost it very little and could have benefits for signalling. A narrow AI would be very unlikely to do that because thought experiments like that are not frequent in the training data we use.
Actually, it might be a good idea to start adding thought experiments like these to training data deliberately as models become more powerful. Just in case.

Florian_Dietz Aug 20, 2022, 7:54 AM
−1 points
−2
in reply to: Shmi’s comment on: Understanding differences between humans and intelligence-in-general to build safe AGI
I mean “do something incoherent at any given moment” is also perfectly agent-y behavior. Babies are agents, too.
I think the problem is modelling incoherent AI is even harder than modelling coherent AI, so most alignment researchers just hope that AI researchers will be able to build coherence in before there is a takeoff, so that they can base their own theories on the assumption that the AI is already coherent.
I find that view overly optimistic. I expect that AI is going to remain incoherent until long after it has become superintelligent.

Florian_Dietz Aug 19, 2022, 9:09 PM
−1 points
−2
in reply to: Shmi’s comment on: Understanding differences between humans and intelligence-in-general to build safe AGI
Contemporary AI agents that are based on neural networks are exactly like that. They do stuff they feel compelled to in the moment. If anything, they have less coherence than humans, and no capacity for introspection at all. I doubt that AI will magically go from this current, very sad state to a coherent agent. It might modify itself into being coherent some time after becoming super intelligent, but it won’t be coherent out of the box.

Florian_Dietz Aug 17, 2022, 5:39 PM
3 points
0
in reply to: Shmi’s comment on: Understanding differences between humans and intelligence-in-general to build safe AGI
This is a great point. I don’t expect that the first AGI will be a coherent agent either, though.
As far as I can tell from my research, being a coherent agent is not an intrinsic property you can build into an AI, or at least not if you want it to have a reasonably effective ability to learn. It seems more like being coherent is a property that each agent has to continuously work on.
The reason for this is basically that every time we discover new things about the way reality works, the new knowledge might contradict some of the assumptions on which our goals are grounded. If this happens, we need a way to reconfigure and catch ourselves.
Example: A child does not have the capacity to understand ethics, yet. So it is told “hurting people is bad”, and that is good enough to keep it from doing terrible things until it is old enough to learn more complex ethics. Trying to teach it about utilitarian ethics before it has an understanding of probability theory would be counterproductive.

Florian_Dietz

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Flo­rian_Dietz’s Shortform

Achiev­ing AI Align­ment through De­liber­ate Uncer­tainty in Mul­ti­a­gent Systems

Revealing alignment faking with a single prompt

Florian_Dietz’s Shortform

Achieving AI Alignment through Deliberate Uncertainty in Multiagent Systems