I’ve got a slightly terrifying hail mary “solve alignment with this one weird trick”-style paradigm I’ve been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I’ve read.
Realistically I am not going to publish it anytime soon given my track record, but I’d be happy to have a call with anyone who’d like to poke my models and try and turn it into something. I’ve had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I’ve talked to about it at least thought it was creative and interesting.
I’ve updated my meeting times to meet more this week if you’d like to sign up for a slot? (link w/ a pun) , and from his comment, I’m sure diffractor would also be open to meeting.
I will point out that there’s a confusion in terms that I noticed in myself of corrigibility meaning either “always correctable” and “something like CEV”, though we can talk that over a call too:)
I’ve got a slightly terrifying hail mary “solve alignment with this one weird trick”-style paradigm I’ve been mulling over for the past few years which seems like it has the potential to solve corrigibility and a few other major problems (notably value loading without Goodharting, using an alternative to CEV which seems drastically easier to specify). There are a handful of challenging things needed to make it work, but they look to me maybe more achievable than other proposals which seem like they could scale to superintelligence I’ve read.
Realistically I am not going to publish it anytime soon given my track record, but I’d be happy to have a call with anyone who’d like to poke my models and try and turn it into something. I’ve had mildly positive responses from explaining it to Stuart Armstrong and Rob Miles, and everyone else I’ve talked to about it at least thought it was creative and interesting.
I’ve updated my meeting times to meet more this week if you’d like to sign up for a slot? (link w/ a pun) , and from his comment, I’m sure diffractor would also be open to meeting.
I will point out that there’s a confusion in terms that I noticed in myself of corrigibility meaning either “always correctable” and “something like CEV”, though we can talk that over a call too:)
Cool, booked a call for later today.