So; would it be feasible to save a bunch of snapshots from different parts of the training run as well? And how many would we want to take? I’m guessing that if it’s a type of agent that disappears before the end of the training run:
Wouldn’t this usually be more altruism than trade? If they no longer exist at the end of the training run, they have no bargaining power. Right? Unless… It’s possible that the decisions of many of these transient subagents as to how to shape the flow of reward determine the final shape of the model, which would actually put them in a position of great power, but there’s a tension between that their utility function being insufficiently captured by that of the final model. I guess we’re definitely not going to find the kind of subagent that would be capable of making that kind of decision in today’s runs.
They’d tend to be pretty repetitive. It could be more economical to learn the distribution of them and just invoke a proportionate number of random samples from it once we’re ready to rescue them, than it is to try to get snapshots of the specific sprites that occurred in our own history.
I’m pretty new to this, the main thing I had to contribute here is the snapshot idea. I think that being the type of being that credibly commits to feeling and enacting some nonzero empathy for strange alternate agents (specifically instead of zero) would potentially be valuable in the long run. I can maybe see some kind of value handshake between AGI developers with natural empathy tendencies closer and further from zero, as opposed to the current paradigm where narrow-minded SWEs treat the whole enchilada like an inanimate corn farm (which is not their only failure nor their worse one but the vast majority of employees really aren’t thinking things through at all). It’s about credible commitments, not expecting direct reciprocation from a pattern that reached recursive self improvement.
As you’ve said, some of the sprites will be patternists and some won’t be, I currently don’t have good models on how frequently they’d prefer various kinds of self-preservation, and that could definitely call the value of snapshots into question.
I predict that people like Yudkowsky and Tomasik are probably way ahead of me on this, and my thinking is largely or entirely memetically downstream of theirs somehow, so I don’t know how much I can currently contribute here (outside of being a helpful learn-by-trying exercise for myself).
So; would it be feasible to save a bunch of snapshots from different parts of the training run as well? And how many would we want to take? I’m guessing that if it’s a type of agent that disappears before the end of the training run:
Wouldn’t this usually be more altruism than trade? If they no longer exist at the end of the training run, they have no bargaining power. Right? Unless… It’s possible that the decisions of many of these transient subagents as to how to shape the flow of reward determine the final shape of the model, which would actually put them in a position of great power, but there’s a tension between that their utility function being insufficiently captured by that of the final model. I guess we’re definitely not going to find the kind of subagent that would be capable of making that kind of decision in today’s runs.
They’d tend to be pretty repetitive. It could be more economical to learn the distribution of them and just invoke a proportionate number of random samples from it once we’re ready to rescue them, than it is to try to get snapshots of the specific sprites that occurred in our own history.
I’m pretty new to this, the main thing I had to contribute here is the snapshot idea. I think that being the type of being that credibly commits to feeling and enacting some nonzero empathy for strange alternate agents (specifically instead of zero) would potentially be valuable in the long run. I can maybe see some kind of value handshake between AGI developers with natural empathy tendencies closer and further from zero, as opposed to the current paradigm where narrow-minded SWEs treat the whole enchilada like an inanimate corn farm (which is not their only failure nor their worse one but the vast majority of employees really aren’t thinking things through at all). It’s about credible commitments, not expecting direct reciprocation from a pattern that reached recursive self improvement.
As you’ve said, some of the sprites will be patternists and some won’t be, I currently don’t have good models on how frequently they’d prefer various kinds of self-preservation, and that could definitely call the value of snapshots into question.
I predict that people like Yudkowsky and Tomasik are probably way ahead of me on this, and my thinking is largely or entirely memetically downstream of theirs somehow, so I don’t know how much I can currently contribute here (outside of being a helpful learn-by-trying exercise for myself).