As a spaghetti behavior executor, I’m worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible, so I’d like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.
As a spaghetti behavior executor, I’m worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible
As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don’t have any particular reason to think will act in a way I would approve of in the long term.
Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me’s experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.
I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don’t end up trapped in some specific places current-me does not endorse (e.g. “heroin addict”).
so I’d like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.
So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.
I see the case for improving the world model. But once I have that better world model, I don’t see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.
As a note, you could probably say something clever like “the thing you put in the goal slot should just be ‘behave in the way you would if you had access to unlimited time to think and the best available world model’”, but if we’re going there then I contend that the rock I picked up has a goal slot filled with “behave exactly like this particular rock”.
The point is control over this process, ability to make decisions over development of oneself, instead of leaving it largely in the hands of the inscrutable low level computational dynamics of the brain and influence of external data. Digital immortality doesn’t guard against this, and in a million subjective years you might just slip away bit by bit for reasons you don’t endorse, not having had enough time to decide how to guide this process. But if there is a way to put uncontrollable drift on hold, then it’s your own goal slots, you can do with them what you will when you are ready.
As a spaghetti behavior executor, I’m worried that neural networks are not a safe medium for keeping a person alive without losing themselves to value drift, especially throughout a much longer life than presently feasible, so I’d like to get myself some goal slots that much more clearly formulate the distinction between capabilities and values. In general this sort of thing seems useful for keeping goals stable, which is instrumentally valuable for achieving those goals, whatever they happen to be, even for a spaghetti behavior executor.
As a fellow spaghetti behavior executor, replacing my entire motivational structure with a static goal slot feels like dying and handing off all of my resources to an entity that I don’t have any particular reason to think will act in a way I would approve of in the long term.
Historically, I have found varying things rewarding at various stages of my life, and this has chiseled the paths in my cognition that make me me. I expect that in the future my experiences and decisions and how rewarded / regretful I feel about those decisions will continue to chisel my cognition in a way that changes what I care about, in the way that past-me endorsed current-me’s experiences causing me to care about things (e.g. specific partners, offspring) that past-me did not care about.
I would not endorse freezing my values in place to prevent value drift in full generality. At most I endorse setting up contingencies so my values don’t end up trapped in some specific places current-me does not endorse (e.g. “heroin addict”).
So in this ontology, an agent is made up of a queryable world model and a goal slot. Improving the world model allows the agent to better predict the outcomes of its actions, and the goal slot determines which available action the agent would pick given its world model.
I see the case for improving the world model. But once I have that better world model, I don’t see why I would additionally want to add an immutable goal slot that overrides my previous motivational structure. My understanding is that adding a privileged immutable goal slot would only change the my behavior in those cases where I would otherwise have decided that achieving the goal that was placed in that slot was not a good idea on balance.
As a note, you could probably say something clever like “the thing you put in the goal slot should just be ‘behave in the way you would if you had access to unlimited time to think and the best available world model’”, but if we’re going there then I contend that the rock I picked up has a goal slot filled with “behave exactly like this particular rock”.
The point is control over this process, ability to make decisions over development of oneself, instead of leaving it largely in the hands of the inscrutable low level computational dynamics of the brain and influence of external data. Digital immortality doesn’t guard against this, and in a million subjective years you might just slip away bit by bit for reasons you don’t endorse, not having had enough time to decide how to guide this process. But if there is a way to put uncontrollable drift on hold, then it’s your own goal slots, you can do with them what you will when you are ready.