I don’t think I could name a working method for constructing a safe powerful mind.
I could, and my algorithm basically boils down to the following:
Specify a weak/limited prior over goal space, like the genome does.
Create a preference model by using DPO, RLHF or whatever else suits your fancy to guide the intelligence into alignment with x values.
Use the backpropagation algorithm to update the weights of the brain in the optimal direction for alignment.
Repeat until you get low loss, or until you can no longer optimize the function anymore.
I could, and my algorithm basically boils down to the following:
Specify a weak/limited prior over goal space, like the genome does.
Create a preference model by using DPO, RLHF or whatever else suits your fancy to guide the intelligence into alignment with x values.
Use the backpropagation algorithm to update the weights of the brain in the optimal direction for alignment.
Repeat until you get low loss, or until you can no longer optimize the function anymore.