Thank you for this post. Various “uploading nightmare” scenarios seem quite salient for many people considering digital immortality/cryonics. It’s good to have potential countermeasures that address such worries.
My concern about your proposal is that, if an attacker can feed you inputs and get outputs, they can train a deep model on your inputs/outputs, then use that model to infer how you might behave under rewind. I expect the future will include deep models extensively pretrained to imitate humans (simulated and physical), so the attacker may need surprisingly little of your inputs/outputs to get a good model of you. Such a model could also use information about your internal computations to improve its accuracy, so it would be very bad to leak such info.
I’m not sure what can be done about such a risk. Any output you generate is some function of your internal state, so any output risks leaking internal state info. Maybe you could use a “rephrasing” neural net module that modifies your outputs to remove patterns that leak personality-related information? That would cause many possible internal states to map onto similar input/output patterns and make inferring internal state more difficult.
You could also try to communicate only with entities that you think will not attempt such an attack and that will retain as little of your communication as possible. However, both those measures seem like they’d make forming lasting friendships with outsiders difficult.
Actually it sounds like a poker game anyway. People try to build a model of you and predict you, and you respond randomly from time to time to mess with their trainings.
Thank you for this post. Various “uploading nightmare” scenarios seem quite salient for many people considering digital immortality/cryonics. It’s good to have potential countermeasures that address such worries.
My concern about your proposal is that, if an attacker can feed you inputs and get outputs, they can train a deep model on your inputs/outputs, then use that model to infer how you might behave under rewind. I expect the future will include deep models extensively pretrained to imitate humans (simulated and physical), so the attacker may need surprisingly little of your inputs/outputs to get a good model of you. Such a model could also use information about your internal computations to improve its accuracy, so it would be very bad to leak such info.
I’m not sure what can be done about such a risk. Any output you generate is some function of your internal state, so any output risks leaking internal state info. Maybe you could use a “rephrasing” neural net module that modifies your outputs to remove patterns that leak personality-related information? That would cause many possible internal states to map onto similar input/output patterns and make inferring internal state more difficult.
You could also try to communicate only with entities that you think will not attempt such an attack and that will retain as little of your communication as possible. However, both those measures seem like they’d make forming lasting friendships with outsiders difficult.
Way above my paygrade, but can you just respond to some inputs randomly?
Actually it sounds like a poker game anyway. People try to build a model of you and predict you, and you respond randomly from time to time to mess with their trainings.