How would you goodhart this metric? To be clear, you want to map x to f(x), but this takes a second of your time. You pay them to map x to f(x), but they map x to g(x). After they’re done mapping 100 x to g(x), you select a random 10 of those 100, spend 10 seconds to calculate the corresponding g(x)-f(x), and pay them more the smaller the absolute difference.
I’m trying to think through your point using the above stapler and office supplies example.
If you hired an aggressively lazy AI to buy office supplies for you and you told it :
Buy me a stapler.
Then it might buy a stapler from the nearest trash can in exchange for one speck of dirt.
Then, you would go to Staples and buy a new ergonomically friendly stapler that can handle 20 sheets (model XYZ from Brand ABC) using your credit card with free five day shipping.
You proposed that we would reward the AI by calculating the distance between the two sets of actions.
I don’t see a way for you to avoid having to invent a friendly AI to solve the problem.
Otherwise, you will inevitably leave out a metric (e.g. Oops, I didn’t have a metric for don’t-speed-on-roads so now the robot has a thousand dollar speeding ticket after optimizing for shipping speed).
I suppose for messy real-world tasks, you can’t define distances objectively ahead of time. You could simply check a random 10 (x,f(x)) and choose how much to pay. In an ideal world, if they think you’re being unfair they can stop working for you. In this world where giving someone a job is a favor, they could go to a judge to have your judgement checked.
Though if we’re talking about AIs: You could have the AI output a probability distribution g(x) over possible f(x) for each of the 100 x. Then for a random 10 x, you generate an f(x) and reward the AI according to how much probability it assigned to what you generated.
Then I have a better answer for the question about how I would Goodhart things.
Let U_Gurkenglas = the set of all possible x that may be randomly checked as you described
Let U = the set of all possible metrics in the real world (superset of U_G)
For a given action A, optimize for U_G and ignore the set of metrics in U that are not in U_G.
You will be unhappy to the degree that the ignored subset contains things that you wish were in U_G. But until you catch on, you will be completely satisfied by the perfect application of all x in U_G.
To put this b concrete terms, if you don’t have a metric for “nitrous oxide emissions” because it’s the 1800s, then you won’t have any way to disincentivize an employee who races around the countryside driving a diesel truck that ruins the air.
(the mobile editor doesn’t have any syntax help ; I’ll fix formatting later)
That’s a bad example. In the 1800s no matter of caring would have resulted in the person chosing a car with less nitrous oxide because that wasn’t in the things people thought about.
Ability to predict is distinct from actual caring. It doesn’t get you around people trying to goodhart metrics.
How would you goodhart this metric? To be clear, you want to map x to f(x), but this takes a second of your time. You pay them to map x to f(x), but they map x to g(x). After they’re done mapping 100 x to g(x), you select a random 10 of those 100, spend 10 seconds to calculate the corresponding g(x)-f(x), and pay them more the smaller the absolute difference.
I’m trying to think through your point using the above stapler and office supplies example.
If you hired an aggressively lazy AI to buy office supplies for you and you told it : Buy me a stapler.
Then it might buy a stapler from the nearest trash can in exchange for one speck of dirt.
Then, you would go to Staples and buy a new ergonomically friendly stapler that can handle 20 sheets (model XYZ from Brand ABC) using your credit card with free five day shipping.
You proposed that we would reward the AI by calculating the distance between the two sets of actions.
I don’t see a way for you to avoid having to invent a friendly AI to solve the problem.
Otherwise, you will inevitably leave out a metric (e.g. Oops, I didn’t have a metric for don’t-speed-on-roads so now the robot has a thousand dollar speeding ticket after optimizing for shipping speed).
I suppose for messy real-world tasks, you can’t define distances objectively ahead of time. You could simply check a random 10 (x,f(x)) and choose how much to pay. In an ideal world, if they think you’re being unfair they can stop working for you. In this world where giving someone a job is a favor, they could go to a judge to have your judgement checked.
Though if we’re talking about AIs: You could have the AI output a probability distribution g(x) over possible f(x) for each of the 100 x. Then for a random 10 x, you generate an f(x) and reward the AI according to how much probability it assigned to what you generated.
Then I have a better answer for the question about how I would Goodhart things.
Let U_Gurkenglas = the set of all possible x that may be randomly checked as you described Let U = the set of all possible metrics in the real world (superset of U_G) For a given action A, optimize for U_G and ignore the set of metrics in U that are not in U_G.
You will be unhappy to the degree that the ignored subset contains things that you wish were in U_G. But until you catch on, you will be completely satisfied by the perfect application of all x in U_G.
To put this b concrete terms, if you don’t have a metric for “nitrous oxide emissions” because it’s the 1800s, then you won’t have any way to disincentivize an employee who races around the countryside driving a diesel truck that ruins the air.
(the mobile editor doesn’t have any syntax help ; I’ll fix formatting later)
That’s a bad example. In the 1800s no matter of caring would have resulted in the person chosing a car with less nitrous oxide because that wasn’t in the things people thought about.