The resource gathering agent is completely indifferent to anything that is not causally connected to the source of information about v or −v.
No. The agent will still want to gather resources (that’s the point). For instance, if you expect to discover tomorrow whether you hate puppies or love them, you would still want to get a lot of money, become world dictator, etc… And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.
I think this gets into ontology identification issues.
I’m trying to design it so that it doesn’t. If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome. I’m trying to say that “lobotomizing everyone has a large expected impact” rather than defining what lobotomizing means.
And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.
I see. I still think this is different from getting vetoed by both M(ϵu+v) and M(ϵu−v). Suppose S(u) will take some action that will increase v by one, and will otherwise not affect v at all. This will be vetoed by M(ϵu−v). However, a resource gathering agent will not care. This makes sense given that getting vetoed by both M(ϵu−v) and M(ϵu+v) imposes two constraints, while getting vetoed by only a resource gathering agent imposes one.
If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome.
That makes sense. I actually didn’t understand the part in the original post where M(ϵu+v) expects to exist in the future, and assumed that it had no abilities other than veto power.
Still, I think it might be possible to fix E[v] for almost all v while destroying human value. Let f be the distribution over v that M(u−v) uses. Suppose M(u−v) can compute the default E[v] values for each v (e.g. because it has a distribution over which value function a future singleton will have). Then it will construct a distribution consisting of a mixture of ϵu+v and ϵu−v for almost all v in the support of f, sample from this distribution, and allow the resulting utility function to take over the world. To construct the distribution such that each E[v] stays almost the same for almost all v, it can adjust the probabilities assigned to ϵu+v and ϵu−v appropriately for almost all v in the support of f. It might be necessary to include convex combinations ϵu+θv1+(1−θ)v2 as well sometimes, but this doesn’t significantly change things. Now with high probability, ϵu±v will take over the world for some v that is not close to human values.
Perhaps this leads us to the conclusion that we should set f to the distribution over the future singleton’s values, and trust that human values are probable enough that these will be taken into account. But it seems like you are saying we can apply this even when we haven’t constructed a distribution that assigns non-negligible probability to human values.
No. The agent will still want to gather resources (that’s the point). For instance, if you expect to discover tomorrow whether you hate puppies or love them, you would still want to get a lot of money, become world dictator, etc… And the agent does expect (counterfactually or hypothetically or whatever you want to call the false miracle approach) to exist in the future.
I’m trying to design it so that it doesn’t. If for example you kill everyone (or lobotomize them or whatever) then a putative M(v) (or M(ϵu+v)) agent will find it a lot easier to maximise v, as there is less human opposition to overcome. I’m trying to say that “lobotomizing everyone has a large expected impact” rather than defining what lobotomizing means.
I see. I still think this is different from getting vetoed by both M(ϵu+v) and M(ϵu−v). Suppose S(u) will take some action that will increase v by one, and will otherwise not affect v at all. This will be vetoed by M(ϵu−v). However, a resource gathering agent will not care. This makes sense given that getting vetoed by both M(ϵu−v) and M(ϵu+v) imposes two constraints, while getting vetoed by only a resource gathering agent imposes one.
That makes sense. I actually didn’t understand the part in the original post where M(ϵu+v) expects to exist in the future, and assumed that it had no abilities other than veto power.
Still, I think it might be possible to fix E[v] for almost all v while destroying human value. Let f be the distribution over v that M(u−v) uses. Suppose M(u−v) can compute the default E[v] values for each v (e.g. because it has a distribution over which value function a future singleton will have). Then it will construct a distribution consisting of a mixture of ϵu+v and ϵu−v for almost all v in the support of f, sample from this distribution, and allow the resulting utility function to take over the world. To construct the distribution such that each E[v] stays almost the same for almost all v, it can adjust the probabilities assigned to ϵu+v and ϵu−v appropriately for almost all v in the support of f. It might be necessary to include convex combinations ϵu+θv1+(1−θ)v2 as well sometimes, but this doesn’t significantly change things. Now with high probability, ϵu±v will take over the world for some v that is not close to human values.
Perhaps this leads us to the conclusion that we should set f to the distribution over the future singleton’s values, and trust that human values are probable enough that these will be taken into account. But it seems like you are saying we can apply this even when we haven’t constructed a distribution that assigns non-negligible probability to human values.