This doesn’t mean specification gaming is impossible, but hopefully we’d find a way to make it less likely with a sound definition of what “trust” really means
I think the interesting part of alignment is in defining “trust” in a way that goes against reward hacking/specification gaming, which has been assumed away in this post. I mentioned a pivotal act, defined as an action that has a positive impact on humanity even a billion years away, because that’s the end goal of alignment. I don’t see this post getting us closer to a pivotal act because, as mentioned, the interesting bits have been assumed away.
Though, this is a well-thought out post, and I didn’t see the usual errors of a post like this (eg not thinking of specification at all, not considering how you measure “trust”, etc)
Thank you! You’re absolutely right, we left out the “hard part”, mostly because it’s the really hard part and we don’t have a solution for it. Maybe someone smarter than us will find one.
I think the interesting part of alignment is in defining “trust” in a way that goes against reward hacking/specification gaming, which has been assumed away in this post. I mentioned a pivotal act, defined as an action that has a positive impact on humanity even a billion years away, because that’s the end goal of alignment. I don’t see this post getting us closer to a pivotal act because, as mentioned, the interesting bits have been assumed away.
Though, this is a well-thought out post, and I didn’t see the usual errors of a post like this (eg not thinking of specification at all, not considering how you measure “trust”, etc)
Thank you! You’re absolutely right, we left out the “hard part”, mostly because it’s the really hard part and we don’t have a solution for it. Maybe someone smarter than us will find one.