And you can’t determine if it’s safe by examining or understanding it’s utility function, if the abstraction is so loose as to not be align-able.
Its not really an abstraction at all in this case, it literally has a utility function. What rates highest on its utility function is returning whatever token is ‘most likely’ given it’s training data.
And you can’t determine if it’s safe by examining or understanding it’s utility function, if the abstraction is so loose as to not be align-able.
Its not really an abstraction at all in this case, it literally has a utility function. What rates highest on its utility function is returning whatever token is ‘most likely’ given it’s training data.