I don’t think I could name a working method for constructing a safe powerful mind. What I want to say is more like: if you want to deconfuse some core AI x-risk problems, you should deconfuse your basic reasons for worry and core frames first, otherwise you’re building on air.
I don’t think I could name a working method for constructing a safe powerful mind. What I want to say is more like: if you want to deconfuse some core AI x-risk problems, you should deconfuse your basic reasons for worry and core frames first, otherwise you’re building on air.
I could, and my algorithm basically boils down to the following:
Specify a weak/limited prior over goal space, like the genome does.
Create a preference model by using DPO, RLHF or whatever else suits your fancy to guide the intelligence into alignment with x values.
Use the backpropagation algorithm to update the weights of the brain in the optimal direction for alignment.
Repeat until you get low loss, or until you can no longer optimize the function anymore.