I resonated with the post and I think it’s a great direction to draw inspiration from!
A big problem with goodharting in RL is that you’re handcrafting a utility function. In the wisdom traditions, we’re encouraged to explore and gain insights into different ideas to form our utility function over time.
Therefore, I feel that setting up the right training environment together with some wisdom principles might be enough to create wise AI.
We, of course, run into all of the annoying inner alignment and deception whilst training style problems, yet still, it seems the direction to go in. I don’t think the orthogonality thesis is fully true or false, it is more dependent on your environment and if we can craft the right one I think we can have wise AI that wants to create the most loving and kind future imaginable.
I resonated with the post and I think it’s a great direction to draw inspiration from!
A big problem with goodharting in RL is that you’re handcrafting a utility function. In the wisdom traditions, we’re encouraged to explore and gain insights into different ideas to form our utility function over time.
Therefore, I feel that setting up the right training environment together with some wisdom principles might be enough to create wise AI.
We, of course, run into all of the annoying inner alignment and deception whilst training style problems, yet still, it seems the direction to go in. I don’t think the orthogonality thesis is fully true or false, it is more dependent on your environment and if we can craft the right one I think we can have wise AI that wants to create the most loving and kind future imaginable.