All I want for christmas is a “version for engineers.” Here’s how we constructed the reward, here’s how we did the training, here’s what happened over the course of training.
For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).
My current impression is that the algorithm for deciding who wins an argument is clever, if computationally expensive, but you don’t have a clever way to turn this into a supervisory signal, instead relying on brute force (which you don’t have much of).
You mean ArgRank (i.e. PageRank on the argument graph)? The idea was to simply use ArgRank to assign rewards to individual utterances, then use the resulting context-utterance-reward triples as experiences for RL. After collecting experiences, update the weights, and repeat. Now, though, I’d rather do PEFT on the top utterances as a kind of expert iteration, which would also make it feasible to store previous model versions for league training (e.g. by just storing LoRa weight diffs).
I didn’t see where you show that you managed to actually make the LLMs better arguers.
Indeed, preliminary results are poor, and the bar was set pretty low at “somehow make these ideas run in this setup.” For now, I’d drop ArgRank and instead use traditional methods from computational argumentation on an automatically encoded argument graph (see 5.2), then PEFT on the winning parties. But I’m also interested in extending CCS-like tools for bettering ArgRank (see 2.5). I’m applying to AISC9 for related follow-up work (among others), and I’d find it really valuable if you could send me some feedback on the proposal summary. Could I send you a DM with it?
Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.
Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?
Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).
Sounds good. I enjoyed at least 50% of the time I spent reading the epistemology :P I just wanted a go-to resource for specific technical questions.
Could I send you a DM with it?
Sure, but no promises on interesting feedback.
Connection between winning an argument and finding the truth continues to seem plenty breakable both in humans and in AIs.
Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?
Deception’s not quite the right concept. More like exploitation of biases and other weaknesses. This can look like deception, or it can look like incentivizing an AI to “honestly” be searching for arguments in a way that just so happens to be shaped by the argument-evaluation process’ standards other than truth.
Thanks a lot for the feedback!
For sure, I greatly underestimated the importance of legible and concise communication in the increasingly crowded and dynamic space that is alignment. Future outputs will at the very least include an accompanying paper-overview-in-a-post, and in general a stronger focus on self-contained papers. I see the booklet as a preliminary, highly exploratory bit of work that focused more on the conceptual and theoretical rather than the applied, a goal for which I think it was very suitable (e.g. introducing an epistemological theory with direct applications to alignment).
You mean ArgRank (i.e. PageRank on the argument graph)? The idea was to simply use ArgRank to assign rewards to individual utterances, then use the resulting context-utterance-reward triples as experiences for RL. After collecting experiences, update the weights, and repeat. Now, though, I’d rather do PEFT on the top utterances as a kind of expert iteration, which would also make it feasible to store previous model versions for league training (e.g. by just storing LoRa weight diffs).
Indeed, preliminary results are poor, and the bar was set pretty low at “somehow make these ideas run in this setup.” For now, I’d drop ArgRank and instead use traditional methods from computational argumentation on an automatically encoded argument graph (see 5.2), then PEFT on the winning parties. But I’m also interested in extending CCS-like tools for bettering ArgRank (see 2.5). I’m applying to AISC9 for related follow-up work (among others), and I’d find it really valuable if you could send me some feedback on the proposal summary. Could I send you a DM with it?
Is it because of obfuscated arguments and deception, or some other fundamental issue that you find it so?
Sounds good. I enjoyed at least 50% of the time I spent reading the epistemology :P I just wanted a go-to resource for specific technical questions.
Sure, but no promises on interesting feedback.
Deception’s not quite the right concept. More like exploitation of biases and other weaknesses. This can look like deception, or it can look like incentivizing an AI to “honestly” be searching for arguments in a way that just so happens to be shaped by the argument-evaluation process’ standards other than truth.