ok, options. - Review of 108 ai alignment plans - write-up of Beyond Distribution—planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it - get familiar with the TPUs I just got access to - run hhh and it’s variants, testing the idea behind Beyond Distribution, maybe make a guide on itr— continue improving site design
- fill out the form i said i was going to fill out and send today - make progress on cross coders—would prob need to get familiar with those tpus - writeup of ai-plans, the goal, the team, what we’re doing, what we’ve done, etc - writeup of the karma/voting system - the video on how to do backprop by hand - tutorial on how to train an sae
think Beyond Distribution writeup. he’s waiting and i feel bad.
btw, thoughts on this for ‘the alignment problem’? ”A robust, generalizable, scalable, method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]”
Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms’ corrigibility sequence and my “instruction-following AGI is easier....”
The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.
I’d say you’re addressing the question of goalcrafting or selecting alignment targets.
I think you’ve got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my “if we solve alignment, do we all die anyway” for the problems with that scenario.
Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
But I do think your goal defintion is a good alignment target for the technical work. I don’t think there’s a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they’re less rigid, but they’re both very similar to your definition.
I pretty much agree. I prefer rigid definitions because they’re less ambiguous to test and more robust to deception. And this field has a lot of deception.
I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact:
https://lu.ma/sjd7r89v
I’m running the event because I think this is something really valuable and underdone.
The Sequences highly praise Jaynes and recommend reading his work directly.
The Sequences aren’t trying to be a replacement, they’re trying to be a pop sci intro to the style of thinking. An easier on-ramp.
If Jaynes already seems exciting and comprehensible to you, read that instead of the Sequences on probability.
ok, options.
- Review of 108 ai alignment plans
- write-up of Beyond Distribution—planned benchmark for alignment evals beyond a models distribution, send to the quant who just joined the team who wants to make it
- get familiar with the TPUs I just got access to
- run hhh and it’s variants, testing the idea behind Beyond Distribution, maybe make a guide on itr—
continue improving site design
- fill out the form i said i was going to fill out and send today
- make progress on cross coders—would prob need to get familiar with those tpus
- writeup of ai-plans, the goal, the team, what we’re doing, what we’ve done, etc
- writeup of the karma/voting system
- the video on how to do backprop by hand
- tutorial on how to train an sae
think Beyond Distribution writeup. he’s waiting and i feel bad.
btw, thoughts on this for ‘the alignment problem’?
”A robust, generalizable, scalable, method to make an AI model which will do set [A] of things as much as it can and not do set [B] of things as much as it can, where you can freely change [A] and [B]”
Freely changing an AGIs goals is corrigibility, which is a huge advantage if you can get it. See Max Harms’ corrigibility sequence and my “instruction-following AGI is easier....”
The question is how a reliably get such a thing. Goalcrafting is one part of the problem, and I agree that those are good goals; the other and larger part is technical alignment, getting those desired goals to really work that way in the particular first AGI we get.
Yup, those are hard. Was just thinking of a definition for the alignment problem, since I’ve not really seen any good ones.
I’d say you’re addressing the question of goalcrafting or selecting alignment targets.
I think you’ve got the right answer for technical alignment goals; but the question remains of what human would control that AGI. See my “if we solve alignment, do we all die anyway” for the problems with that scenario.
Spoiler alert; we do all die anyway if really selfish people get control of AGIs. And selfish people tend to work harder at getting power.
But I do think your goal defintion is a good alignment target for the technical work. I don’t think there’s a better one. I do prefer instruction following or corriginlilty by the definitions in the posts I linked above because they’re less rigid, but they’re both very similar to your definition.
I pretty much agree. I prefer rigid definitions because they’re less ambiguous to test and more robust to deception. And this field has a lot of deception.
I think this is a really good opportunity to work on a topic you might not normally work on, with people you might not normally work with, and have a big impact: https://lu.ma/sjd7r89v
I’m running the event because I think this is something really valuable and underdone.
give better names to actual formal math things, jesus christ.
I’m finally reading The Sequences and it screams midwittery to me, I’m sorry.
Compare this:
to Jaynes:
Jaynes is better organized, more respectful to the reader, more respectful to the work he’s building on and more useful
The Sequences highly praise Jaynes and recommend reading his work directly.
The Sequences aren’t trying to be a replacement, they’re trying to be a pop sci intro to the style of thinking. An easier on-ramp. If Jaynes already seems exciting and comprehensible to you, read that instead of the Sequences on probability.
Fair enough. Personally, so far, I’ve found Jaynes more comprehensible than The Sequences.
I think most people with a natural inclination towards math probably would feel likewise.