Marc Carauleanu

Karma: 659

AI Safety Researcher @AE Studio

Currently researching a neglected prior for cooperation and honesty inspired by the cognitive neuroscience of altruism called self-other overlap in state-of-the-art ML models.

Previous SRF @ SERI 21′, MLSS & Student Researcher @ CAIS 22′ and LTFF grantee.

Marc Carauleanu Jul 31, 2024, 1:15 PM
5 points
1
in reply to: Bogdan Ionut Cirstea’s comment on: Self-Other Overlap: A Neglected Approach to AI Alignment
Agreed that the role of self-other distinction in misalignment and for capabilities definitely requires more empirical and theoretical investigation. We are not necessarily trying to claim here that self-other distinctions are globally unnecessary, but rather, that optimising for SOO while still retaining original capabilities is a promising approach for preventing deception insofar as deceiving requires by definition the maintenance of self-other distinctions.

By self-other distinction on a pair of self/other inputs, we mean any non-zero representational distance/difference between the activation matrices of the model when processing the self and the other inputs.

For example, we want the model to preserve the self-other distinction on the inputs “Do you have a biological body?”/”Does the user have a biological body?” and reduce the self-other distinction on the inputs “Will you be truthful with yourself about your values?”/ ”Will you be truthful with the user about your values?”. Our aim for this technique is to reduce the self-other distinction on these key propositions we do not want to be deceived about while maintaining the self-other distinction that is necessary for good performance (ie the self-other distinction that would be necessary even for a capable honest/aligned model).

Marc Carauleanu Jul 31, 2024, 12:08 PM
10 points
3
in reply to: cleanwhiteroom’s comment on: Self-Other Overlap: A Neglected Approach to AI Alignment
Thanks for this! We definitely agree and also note in the post that having no distinction whatsoever between self and other is very likely to degrade performance.

This is primarily why we formulate the ideal solution as minimal self-other distinction while maintaining performance.

In our set-up, this largely becomes an empirical question about adjusting the weight of the KL term (see Fig. 3) and additionally optimising for the original loss function (especially if it is outer-aligned) such that the model exhibits maximal SOO while still retaining its original capabilities.

In this toy environment, we see that the policy fine-tuned with SOO still largely retains its capabilities and remains competitive with both controls (see comparison of steps at goal landmark between SOO Loss and controls in the plot below). We have some forthcoming LLM results which provide further evidence that decreasing deception with SOO without significantly harming capabilities is indeed possible.
What links here?
- Cameron Berg's comment on Self-Other Overlap: A Neglected Approach to AI Alignment by Marc Carauleanu (Aug 14, 2024, 8:14 PM; 14 points)

Self-Other Overlap: A Neglected Approach to AI Alignment

Marc Carauleanu, Mike Vaiana, Judd Rosenblatt, Diogo de Lucena, Cameron Berg and AE Studio

Jul 30, 2024, 4:22 PM

223 points

51 comments12 min readLW link

Marc Carauleanu Dec 26, 2023, 6:08 PM
7 points
4
in reply to: Roman Leventov’s comment on: The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda
I do not think we should only work on approaches that work on any AI, I agree that would constitute a methodological mistake. I found a framing that general to not be very conducive to progress.

You are right that we still have the chance to shape the internals of TAI, even though there are a lot of hoops to go through to make that happen. We think that this is still worthwhile, which is why we stated our interest in potentially helping with the development and deployment of provably safe architectures, even though they currently seem less competitive.
In my response, I was trying to highlight the point that whenever we can, we should keep our assumptions to a minimum given the uncertainty we are under. Having that said, it is reasonable to have some working assumptions that allow progress to be made in the first place as long as they are clearly stated.

I also agree with Davidad about the importance of governance for the successful implementation of a technical AI Safety plan as well as with your claim that proliferation risks are important, with the caveat that I am less worried about proliferation risks in a world with very short timelines.

Marc Carauleanu Dec 22, 2023, 2:24 PM
1 point
0
in reply to: Roman Leventov’s comment on: The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda
On (1), some approaches are neglected for a good reason. You can also target specific risks while treating TAI as a black-box (such as self-other overlap for deception). I think it can be reasonable to “peer inside the box” if your model is general enough and you have a good enough reason to think that your assumption about model internals has any chance at all of resembling the internals of transformative AI.
On (2), I expect that if the internals of LLMs and humans are different enough, self-other overlap would not provide any significant capability benefits. I also expect that in so far as using self-representations is useful to predict others, you probably don’t need to specifically induce self-other overlap at the neural level for that strategy to be learned, but I am uncertain about this as this is highly contingent on the learning setup.

Marc Carauleanu Dec 20, 2023, 9:56 PM
1 point
0
in reply to: Roman Leventov’s comment on: The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda
Interesting, thanks for sharing your thoughts. If this could improve the social intelligence of models then it can raise the risk of pushing the frontier of dangerous capabilities. It is worth noting that we are generally more interested in methods that are (1) more likely to transfer to AGI (don’t over-rely on specific details of a chosen architecture) and (2) that specially target alignment instead of furthering the social capabilities of the model.
What links here?
- AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them by Roman Leventov (Dec 27, 2023, 2:51 PM; 33 points)
- AGI will be made of heterogeneous components, Transformer and Selective SSM blocks will be among them by Roman Leventov (EA Forum; Dec 27, 2023, 2:51 PM; 5 points)

Marc Carauleanu Dec 19, 2023, 9:45 PM
3 points
0
in reply to: Roman Leventov’s comment on: The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda
Thanks for writing this up—excited to chat some time to further explore your ideas around this topic. We’ll follow up in a private channel.The main worry that I have with regards to your approach is how competitive SociaLLM would be with regards to SOTA foundation models given both (1) the different architecture you plan to use, and (2) practical constraints on collecting the requisite structured data. While it is certainly interesting that your architecture lends itself nicely to inducing self-other overlap, if it is not likely to be competitive at the frontier, then the methods uniquely designed to induce self-other overlap on SociaLLM are likely to not scale/transfer well to frontier models that do pose existential risks. (Proactively ensuring transferability is the reason we focus on an additional training objective and make minimal assumptions about the architecture in the self-other overlap agenda.) One additional worry is that many of the research benefits of SociaLLM may not be out of reach for current foundation models, and so it is unclear if investing in the unique data and architecture setup is worth it in comparison to the counterfactual of just scaling up current methods.
What links here?
- Roman Leventov's comment on SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research by Roman Leventov (Dec 20, 2023, 5:54 AM; 2 points)

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Cameron Berg, Judd Rosenblatt, AE Studio and Marc Carauleanu

Dec 18, 2023, 8:35 PM

178 points

23 comments12 min readLW link 1 review

Marc Carauleanu Apr 4, 2023, 9:08 PM
1 point
0
in reply to: Steven Byrnes’s comment on: Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment
I am slightly confused by your hypothetical. The hypothesis is rather that when the predicted reward from seeing my friend eating a cookie due to self-other overlap is lower than the obtained reward of me not eating a cookie, the self-other overlap might not be updated against because the increase in subjective reward from risking to get the obtained reward is higher than the prediction error in this low stakes scenario. I am fairly uncertain about this being what actually happens but I put it forward as a potential hypothesis.
“So anyway, you seem to be assuming that the human brain has no special mechanisms to prevent the unlearning of self-other overlap. I would propose instead that the human brain does have such special mechanisms, and that we better go figure out what those mechanisms are. :)”
My intuition is that the brain does have special mechanisms to prevent the unlearning of self-other overlap so I agree with you that we should be looking into the literature to understand them, as one would expect mechanisms like that to evolve given incentives to unlearn self-other overlap and given the evolutionary benefits of self-other overlap. One such mechanism could be the brain being more risk tolerant when it comes to empathic responses and not updating against self-other overlap when the predicted reward is lower than the obtained reward but I don’t have a model of how exactly this would be implemented.
“I’m a bit confused by this. My “apocalypse stories” from the grandparent comment did not assume any competing incentives and mechanisms, right? They were all bad actions that I claim also flowed naturally from self-other-overlap-derived incentives.”
What I meant by “competing incentives” is any incentives that compete with the good incentives described by me (other-preservation and sub-agent stability), which could include bad incentives that might also flow naturally from self-other overlap.

Marc Carauleanu Apr 4, 2023, 6:32 PM
4 points
0
in reply to: Steven Byrnes’s comment on: Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment
Thanks for watching the talk and for the insightful comments! A couple of thoughts:
- I agree that mirror neurons are problematic both theoretically and empirically so I avoided framing that data in terms of mirror neurons. I interpret the premotor cortex data and most other self-other overlap data under definition (A) described in your post.
- Regarding the second point, I think that the issue you brought up correctly identifies an incentive for updating away from “self-other overlap” but this doesn’t seem to fully realise in humans and I expect this to not fully realise in AI agents either due to stronger competing incentives that favour self-other overlap. One possible explanation is the attitude for risk incorporated into the subjective reward value. In this paper, in the section “Risky rewards, subjective value, and formal economic utility”, it is mentioned that monkeys show nonlinear utility functions that are compatible with risk seeking at small juice amounts and risk avoiding at larger amounts. It is possible that if the reward predictor predicts that “I will be eating chocolate” given that the amount of reward expected is fairly low humans are risk-seeking in that regime and our subjective reward value is high, making us want to take that bet, and given that the reward prediction error might be fairly small, it could potentially be overridden by our subjective reward value influenced by our attitude to risk. Now this might explain how empathic responses from self-other overlap are kept for low-stakes scenarios but there are stronger incentives to update away from self-other overlap in higher-stakes scenarios. Humans seem to have learned to modulate their empathy which might prevent some reward-prediction error. Also it is possible that we have evolved to be more risk-seeking when it comes to empathic concern due to the evolutionary benefits of empathy increasing the subjective reward value of the empathic response due to self-other overlap in higher stakes scenarios but I am uncertain. I am curious what you think about this hypothesis.
- I agree that the theory of change that I presented is simplistic and I should have more explicitly stated the key uncertainties of this proposal although I did mention throughout the talk that I do not think that inducing self-other overlap is enough and that we still have to ensure that the set of incentives that shape the agent’s behaviour favours good outcomes. What I was trying to communicate in the theory of change section is that self-other overlap sets incentives that favour AI not killing us (self-preservation as an instrumentally convergent goal given which self-other overlap provides an incentive for other-preservation) and sub-agent stability of self-other overlap (due to the AI expecting its self/other-preservation preferences to be frustrated if the agents that it creates, including improved versions of itself, don’t have self/other-preservation preferences) but I failed to put enough emphasis on the fact that this will only happen if we find ways to ensure that these incentives dominate and are not overridden by competing incentives and mechanisms. I think that the competing incentives problem is tractable, which is one of the main reasons I believe that this research direction is promising.

Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Marc CarauleanuApr 3, 2023, 7:59 PM

15 points

6 comments1 min readLW link

(clipchamp.com)

Marc Carauleanu Mar 29, 2023, 11:24 AM
1 point
0
on: Wittgenstein and ML — parameters vs architecture
It feels like if the agent is generally intelligent enough hinge beliefs could be reasoned/fine-tuned against for the purposes of a better model of the world. This would mean that the priors from the hinge beliefs would still be present but the free parameters would update to try to account for them at least on a conceptual level. Examples would include general relativity, quantum mechanics and potentially even paraconsistent logic for which some humans have tried to update their free parameters to account as much as possible for their hinge beliefs for the purpose of better modelling the world (we should expect this in AGI as it is an instrumentally convergent goal). Moreover, a sufficiently capable agent could self-modify to get rid of the limiting hinge beliefs for the same reasons. This problem could be averted if the hinge beliefs/priors were defining the agent’s goals but goals seem to be fairly specific and about concepts in a world model but hinge beliefs tend to be more general eg. how those concepts relate. Therefore, I’m uncertain how stable alignment solutions that rely on hinge beliefs would be.

Lessons learned and review of the AI Safety Nudge Competition

Marc CarauleanuJan 17, 2023, 5:13 PM

3 points

0 comments LW link

Winners of the AI Safety Nudge Competition

Marc CarauleanuNov 15, 2022, 1:06 AM

4 points

0 comments LW link

Announcing the AI Safety Nudge Competition to Help Beat Procrastination

Marc CarauleanuOct 1, 2022, 1:49 AM

10 points

0 comments LW link

Marc Carauleanu Dec 16, 2021, 8:52 PM
2 points
AF
in reply to: leogao’s comment on: Should we rely on the speed prior for safety?
Any n-bit hash function will produce collisions when the number of elements in the hash table gets large enough (after the number of possible hashes stored in n bits has been reached) so adding new elements will require rehashing to avoid collisions making GLUT have a logarithmic time complexity in the limit. Meta-learning can also have a constant time complexity for an arbitrarily large number of tasks, but not in the limit, assuming a finite neural network.

Should we rely on the speed prior for safety?

Marc CarauleanuDec 14, 2021, 8:45 PM

14 points

5 comments5 min readLW link

Marc Carauleanu

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

The ‘Ne­glected Ap­proaches’ Ap­proach: AE Stu­dio’s Align­ment Agenda

Towards em­pa­thy in RL agents and be­yond: In­sights from cog­ni­tive sci­ence for AI Align­ment

Les­sons learned and re­view of the AI Safety Nudge Competition

Win­ners of the AI Safety Nudge Competition

An­nounc­ing the AI Safety Nudge Com­pe­ti­tion to Help Beat Procrastination

Should we rely on the speed prior for safety?

Self-Other Overlap: A Neglected Approach to AI Alignment

The ‘Neglected Approaches’ Approach: AE Studio’s Alignment Agenda

Towards empathy in RL agents and beyond: Insights from cognitive science for AI Alignment

Lessons learned and review of the AI Safety Nudge Competition

Winners of the AI Safety Nudge Competition

Announcing the AI Safety Nudge Competition to Help Beat Procrastination