Bogdan Ionut Cirstea

Karma: 1,629

Automated / strongly-augmented safety research.

Bogdan Ionut Cirstea Mar 21, 2025, 4:00 PM
1 point
1
on: Bogdan Ionut Cirstea’s Shortform
In light of recent works on automating alignment and AI task horizons, I’m (re)linking this brief presentation of mine from last year, which I think stands up pretty well and might have gotten less views than ideal:
Towards automated AI safety research

Bogdan Ionut Cirstea Mar 13, 2025, 12:23 PM
20 points
0
on: Bogdan Ionut Cirstea’s Shortform
The first automatically produced, (human) peer-reviewed, (ICLR) workshop-accepted[/able] AI research paper: https://sakana.ai/ai-scientist-first-publication/

Bogdan Ionut Cirstea Mar 7, 2025, 10:32 AM
0 points
0
on: Bogdan Ionut Cirstea’s Shortform
There have been numerous scandals within the EA community about how working for top AGI labs might be harmful. So, when are we going to have this conversation: contributing in any way to the current US admin getting (especially exclusive) access to AGI might be (very) harmful?
[cross-posted from X]

Bogdan Ionut Cirstea Feb 7, 2025, 9:44 PM
4 points
0
on: Detecting Strategic Deception Using Linear Probes
I find the pessimistic interpretation of the results a bit odd given considerations like those in https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed.

Bogdan Ionut Cirstea Feb 4, 2025, 10:53 PM
LW: 4 AF: 2
0
AF
in reply to: Bogdan Ionut Cirstea’s comment on: Daniel Kokotajlo’s Shortform
I also think it’s important to notice how much less scary / how much more probably-easy-to-mitigate (at least strictly when it comes to technical alignment) this story seems than the scenarios from 10 years ago or so, e.g. from Superintelligence / from before LLMs, when pure RL seemed like the dominant paradigm to get to AGI.

Bogdan Ionut Cirstea Feb 4, 2025, 10:51 PM
2 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I agree it’s bad news w.r.t. getting maximal evidence about steganography and the like happening ‘by default’. I think it’s good news w.r.t. lab incentives, even for labs which don’t speak too much about safety.

Bogdan Ionut Cirstea Feb 2, 2025, 8:53 PM
LW: 2 AF: 1
0
AF
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I pretty much agree with 1 and 2. I’m much more optimistic about 3-5 even ‘by default’ (e.g. R1′s training being ‘regularized’ towards more interpretable CoT, despite DeepSeek not being too vocal about safety), but especially if labs deliberately try for maintaining the nice properties from 1-2 and of interpretable CoT.

Bogdan Ionut Cirstea Jan 29, 2025, 11:48 PM
9 points
3
on: Dario Amodei: On DeepSeek and Export Controls
If “smarter than almost all humans at almost all things” models appear in 2026-2027, China and several others will be able to ~immediately steal the first such models, by default.
Interpreted very charitably: but even in that case, they probably wouldn’t have enough inference compute to compete.

Bogdan Ionut Cirstea Jan 29, 2025, 12:02 AM
9 points
0
on: Human takeover might be worse than AI takeover
Quick take: this is probably interpreting them over-charitably, but I feel like the plausibility of arguments like the one in this post makes e/acc and e/acc-adjacent arguments sound a lot less crazy.

Bogdan Ionut Cirstea Jan 24, 2025, 11:49 PM
4 points
1
in reply to: Hopenope’s comment on: Hopenope’s Shortform
To the best of my awareness, there isn’t any demonstrated proper differential compute efficiency from latent reasoning to speak of yet. It could happen, it could also not happen. Even if it does happen, one could still decide to pay the associated safety tax of keeping the CoT.
More generally, the vibe of the comment above seems too defeatist to me; related: https://www.lesswrong.com/posts/HQyWGE2BummDCc2Cx/the-case-for-cot-unfaithfulness-is-overstated.

Bogdan Ionut Cirstea Jan 24, 2025, 1:43 PM
3 points
0
in reply to: Adam Karvonen’s comment on: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
They also require relatively little compute (often around $1 for a training run), so AI agents could afford to test many ideas.
Ok, this seems surprisingly cheap. Can you say more about what such a 1$ training run typically looks like (what the hyperparameters are)? I’d also be very interested in any analysis about how SAE (computational) training costs scale vs. base LLM pretraining costs.
I wouldn’t be surprised if SAE improvements were a good early target for automated AI research, especially if the feedback loop is just “Come up with idea, modify existing loss function, train, evaluate, get a quantitative result”.
This sounds spiritually quite similar to what’s already been done in Discovering Preference Optimization Algorithms with and for Large Language Models and I’d expect something roughly like that to probably produce something interestin, especially if a training run only cost $1.

Bogdan Ionut Cirstea Jan 20, 2025, 11:29 AM
4 points
0
in reply to: Bogdan Ionut Cirstea’s comment on: Before smart AI, there will be many mediocre or specialized AIs
Some additional evidence: o3 used 5.7B tokens per task to achieve its ARC score of 87.5%; it also scored 75.7% on low compute mode using 33M tokens per task:
https://arcprize.org/blog/oai-o3-pub-breakthrough

Bogdan Ionut Cirstea Jan 20, 2025, 10:37 AM
2 points
0
on: 5 ways to improve CoT faithfulness
It might also be feasible to use multimodal CoT, like in Imagine while Reasoning in Space: Multimodal Visualization-of-Thought, and consistency checks between the CoTs in different modalities. Below are some quotes from a related chat with Claude about this idea.
multimodal CoT could enhance faithfulness through several mechanisms: cross-modal consistency checking, visual grounding of abstract concepts, increased bandwidth costs for deception, and enhanced interpretability. The key insight is that coordinating deception across multiple modalities would be substantially more difficult than in text alone.
Multimodal CoT for Enhanced Faithfulness: A Proposal
Core Idea
Multimodal Chain-of-Thought (like MVoT) could enhance CoT faithfulness by making it harder for models to hide information or engage in steganography. The key insight is that forcing models to express their reasoning through both text and generated images creates additional constraints that make unfaithful reasoning more difficult.
Mechanisms for Enhanced Faithfulness
1. Cross-Modal Consistency Checking
When a model must express its reasoning in both text and images, we can verify that both modalities tell the same story:
- Any hidden information in the text would need to be consistently reflected in the generated images
- The required coordination across modalities makes steganography more challenging
- Inconsistencies between modalities could signal unfaithful reasoning
2. Visual Grounding of Abstract Concepts
Images force concrete visualization of abstract concepts:
- Makes it harder to maintain multiple contradictory lines of reasoning
- Abstract deceptions become more apparent when they must be visually represented
- Visual medium reduces ambiguity that could be exploited for steganography
3. Increased Bandwidth Cost for Deception
Hiding information across both text and images is more computationally expensive:
- Model must maintain coherence across a higher-dimensional space
- Steganographic signals need more sophisticated encoding to work across modalities
- Higher cognitive load makes deception more challenging
4. Enhanced Interpretability
Multimodal outputs provide more angles for oversight:
- Humans can cross-reference the text reasoning against visual representations
- Visual artifacts of deceptive reasoning may be easier to spot
- Multiple channels for detecting inconsistencies
Implementation Approach
Required Components
1. A multimodal model capable of generating both text and images
2. Consistency checking mechanisms across modalities
3. Visual grounding mechanisms for abstract concepts
4. Multi-channel oversight tools
Training Strategy
1. Start with models pre-trained on aligned multimodal data
2. Fine-tune with explicit rewards for cross-modal consistency
3. Penalize detected inconsistencies between modalities
4. Validate using human oversight across both channels
Potential Challenges
Technical Challenges
- Ensuring high-quality image generation that accurately reflects reasoning
- Developing robust cross-modal consistency metrics
- Managing computational overhead of dual-modal generation
Validation Challenges
- Defining ground truth for faithful multimodal reasoning
- Creating test cases that can detect cross-modal deception
- Measuring effectiveness compared to text-only approaches
Research Directions
Immediate Next Steps
1. Develop metrics for cross-modal consistency
2. Create test suites for detecting multimodal steganography
3. Build tools for human validation of multimodal reasoning
4. Study failure modes specific to multimodal deception
Longer-term Research
1. Explore optimal balances between modalities
2. Investigate transfer learning between modalities
3. Study emergence of cross-modal deception strategies
4. Develop automated detection of multimodal unfaithfulness
Conclusion
Multimodal CoT presents a promising approach for enhancing faithfulness by leveraging the constraints and verification opportunities provided by multiple modalities. While technical challenges remain, the potential benefits warrant further investigation and development.

Bogdan Ionut Cirstea Jan 17, 2025, 5:49 PM
15 points
−4
in reply to: Charlie Steiner’s comment on: Charlie Steiner’s Shortform
At the very least, evals for automated ML R&D should be a very decent proxy for when it might be feasible to automate very large chunks of prosaic AI safety R&D.

Bogdan Ionut Cirstea Jan 16, 2025, 7:57 PM
4 points
0
on: SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
We find that these evaluation results are nuanced and there is no one ideal SAE configuration—instead, the best SAE varies depending on the specifics of the downstream task. Because of this, we cannot combine the results into a single number without obscuring tradeoffs. Instead, we provide a range of quantitative metrics so that researchers can measure the nuanced effects of experimental changes.
It might be interesting (perhaps in the not-very-near future) to study if automated scientists (maybe roughly in the shape of existing ones, like https://sakana.ai/ai-scientist/) using the evals as proxy metrics, might be able to come up with better (e.g. Pareto improvements) SAE architectures, hyperparams, etc., and whether adding more metrics might help; as an analogy, this seems to be the case for using more LLM-generated unit tests for LLM code generation, see Dynamic Scaling of Unit Tests for Code Reward Modeling.

Bogdan Ionut Cirstea Jan 14, 2025, 2:48 PM
10 points
2
on: Building AI Research Fleets
I expect that, fortunately, the AI safety community will be able to mostly learn from what people automating AI capabilities research and research in other domains (more broadly) will be doing.
It would be nice to have some hands-on experience with automated safety research, too, though, and especially to already start putting in place the infrastructure necessary to deploy automated safety research at scale. Unfortunately, AFAICT, right now this seems mostly bottlenecked on something like scaling up grantmaking and funding capacity, and there doesn’t seem to be enough willingness to address these bottlenecks very quickly (e.g. in the next 12 months) by e.g. hiring and / or decentralizing grantmaking much more aggressively.

Bogdan Ionut Cirstea Jan 9, 2025, 10:35 PM
5 points
0
in reply to: anaguma’s comment on: How will we update about scheming?
From https://x.com/__nmca__/status/1870170101091008860:
o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1
@ryan_greenblatt Shouldn’t this be interpreted as a very big update vs. the neuralese-in-o3 hypothesis?

Bogdan Ionut Cirstea Dec 25, 2024, 2:11 PM
7 points
0
on: What o3 Becomes by 2028
Do you have thoughts on the apparent recent slowdown/disappointing results in scaling up pretraining? These might suggest very diminishing returns in scaling up pretraining significantly before 6e27 FLOP.

Bogdan Ionut Cirstea Dec 24, 2024, 10:57 PM
5 points
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
I’ve had similar thoughts previously: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=rSDHH4emZsATe6ckF.

Bogdan Ionut Cirstea Dec 19, 2024, 9:16 PM
15 points
0
on: Bogdan Ionut Cirstea’s Shortform
Gemini 2.0 Flash Thinking is claimed to ‘transparently show its thought process’ (in contrast to o1, which only shows a summary): https://x.com/denny_zhou/status/1869815229078745152. This might be at least a bit helpful in terms of studying how faithful (e.g. vs. steganographic, etc.) the Chains of Thought are.

Bogdan Ionut Cirstea

Multimodal CoT for Enhanced Faithfulness: A Proposal

Core Idea

Mechanisms for Enhanced Faithfulness

1. Cross-Modal Consistency Checking

2. Visual Grounding of Abstract Concepts

3. Increased Bandwidth Cost for Deception

4. Enhanced Interpretability

Implementation Approach

Required Components

Training Strategy

Potential Challenges

Technical Challenges

Validation Challenges

Research Directions

Immediate Next Steps

Longer-term Research

Conclusion