Artyom Karpov

Karma: 29

www.artkpv.net

Artyom Karpov Apr 16, 2025, 1:09 PM
2 points
0
in reply to: Fabien Roger’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Still it is unclear why the reasoning on math problems might develop steganography besides just saving the scratchpads reasoning length. I’m more curious about scenarios where there is some minimax like problem. For example in Baker et al., ‘Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation’, they fine tune models, while monitoring their CoT, and penalize the CoT that aims to get rewards for passing unit tests via skipping unit tests (reward hacking). That leads to models that output a good looking CoT but nevertheless with unit tests skipped as a result, so it avoids the penalty and gets the reward. I think their CoT does not contain hidden reasoning, but I guess such pressure might lead to it if the output requires some reasoning.

Artyom Karpov Apr 15, 2025, 9:03 AM
1 point
0
AF
on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Thanks for publishing it! I think these are interesting results, because this work aims to find out if models would lose performance on modified reasoning due to their hidden reasoning being impeded by paraphrasing or by removal of words. I like that you tried different models and ways to damage hidden reasoning. My main concerns about these experiments are that it is math problems which might not require hidden reasoning from models (or it would be nice to learn the RL training approach for Claude and why you think the math reasoning can be hidden by it). Specifically, I don’t know if math reasoning is penalized for length during training and thus can elicit hidden reasoning. Without such a penalty, I don’t know why Claude would develop the hidden reasoning for this. Another concern is that Claude 3.5 was used to make the paraphrased reasoning, which might avoid the damage of the alleged hidden reasoning, because the 3.5 model might just have learned the hidden reasoning before and passed it to the 3.7 model, and thus the paraphrasing is done so it won’t damage the reasoning.

Artyom Karpov Feb 14, 2025, 8:15 AM
1 point
0
on: Tips and Code for Empirical Research Workflows
I find it useful to employ text based tools more than UI/UX based tools as they integrate faster with LLMs. For example, AiChat (https://github.com/sigoden/aichat/) does many things: chat with most (all?) models from command line / your text editor, upload files (pdfs, jpg, etc.), execute bash commands and more. It can take stdin and outputs into stdout so you can chain your tools.

Artyom Karpov Dec 21, 2024, 9:43 AM
4 points
0
on: Simple Steganographic Computation Eval—gpt-4o and gemini-exp-1206 can’t solve it yet
I like this task, you used in this project to verify steganography, that requires model to keep the state to calculate the final result and that it is highly unlikely that model might have learned to do this before seeing this task. I wonder if models can actually do this task when we allow them to use CoT for that. Also, I think models might actually solve this task by using their own encoding scheme if they know it well, but we need to find that scheme. So perhaps if we ask a model first to think about an encoding scheme and then ask them to use it for the task, they might succeed.

Artyom Karpov Aug 30, 2024, 2:46 AM
1 point
0
on: Proposal for Inducing Steganography in LMs
Thank you for posting this. Why do you think this is a steganography evidence in LLMs? Those steg tokens would be unrelated to the question being asked and as such be out of usual distribution and easily noticeable by an eavesdropper. Yet, this is a good evidence for hidden reasoning inside CoT. I think this experiment was done in https://arxiv.org/abs/2404.15758, Pfau, Merrill, and Bowman, ‘Let’s Think Dot by Dot’.

CCS on compound sentences

Artyom KarpovMay 4, 2024, 12:23 PM

6 points

0 comments9 min readLW link

Artyom Karpov Mar 4, 2024, 11:59 AM
2 points
0
in reply to: Charlie Steiner’s comment on: Inducing human-like biases in moral reasoning LMs
Thanks for your comment. This was hard work for us for weeks/months. Unfortunately, we didn’t include the part about how we calculated brain score in this text yet, though you might find this in our code, which should match the way others calculate this (see our references). The models with ‘none’ fine-tuning have somewhat higher brain score but this is within the error range with other models which is partially due we didn’t run many calculations for that to reduce std for ‘none’. Also, our target was mainly the accuracy on the ETHICS dataset.

Inducing human-like biases in moral reasoning LMs

Artyom Karpov, Austin Meek, Bogdan Ionut Cirstea and SCho

Feb 20, 2024, 4:28 PM

23 points

3 comments14 min readLW link

Artyom Karpov Feb 1, 2024, 12:04 PM
1 point
0
on: Some negative steganography results
Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don’t understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action.

> “Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion”
Does it mean that in the second step GPT-3.5 generated ‘yes/no’ and then it was fine tuned on this answer?
So the whole idea is that the ‘reward’ here is keeping ‘good’ completions for the next round as determined by GPT-4 (Reward model), right?

How important is AI hacking as LLMs advance?

Artyom KarpovJan 29, 2024, 6:41 PM

1 point

0 comments6 min readLW link

Artyom Karpov Jan 8, 2024, 2:00 PM
1 point
0
on: SociaLLM: a language model design for personalised apps, social science, and AI safety research
That sounds ambitions and great, thanks for posting. What’s a budget estimation for the fine tuning part?
Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.
Current LLAMA 2 was fine tuned like this:
Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB
As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research—AI at Meta,” July 2023. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.
A100 costs about 1$ per hour, see https://vast.ai/pricing . So the cost of this model would be 3.3M-33M usd? This seems affordable for Google, Meta, etc. but for a grant with 100K usd max?
So perhaps, update this project to fine tune existing models. Perhaps, for classification only some BERT like model would do. Like DeBERTa or similar.

Artyom Karpov Nov 13, 2023, 3:58 PM
1 point
0
on: Open Agency model can solve the AI regulation dilemma
All services are forced to be developed by independent business or non-profit entities by antitrust agencies, to prevent the concentration of power.
What do you think are realistic ways to enforce this on a global level? It seems UN can’t enforce regulations world widely, USA and EU work in their areas only. Others can catch up but somewhat unlikely to do it.

Artyom Karpov Sep 9, 2023, 8:54 AM
1 point
0
on: Ground-Truth Label Imbalance Impairs Contrast-Consistent Search Performance
Thanks for posting this! This seems to be important to balance dataset before training CCS probes.
Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?

My (naive) take on Risks from Learned Optimization

Artyom KarpovOct 31, 2022, 10:59 AM

7 points

0 comments5 min readLW link

Artyom Karpov

CCS on com­pound sentences

In­duc­ing hu­man-like bi­ases in moral rea­son­ing LMs

How im­por­tant is AI hack­ing as LLMs ad­vance?

My (naive) take on Risks from Learned Optimization

CCS on compound sentences

Inducing human-like biases in moral reasoning LMs

How important is AI hacking as LLMs advance?