Large training runs might at some point, or even already, be creating and/or destroying substantial numbers of simple but strange agents (possibly quasi-conscious) and deeply pessimizing over their utility functions for no reason, similar to how wild animal suffering emerged in the biosphere. Snapshots of large training runs might be necessary to preserve and eventually offer compensation/insurance payouts for most/all of them, since some might last for minutes before disappearing.
Before reading this, I wasn’t aware of the complexities involved in giving fair deals to different kinds of agents. Plausibly after building ASI, many more ways could be found to give them most of what they’re born hoping for. It would be great if we could legibly become the types of people who credibly commit to doing that (placing any balance at all of their preferences with ours, instead of the current status quo of totally ignoring them).
With nearer-term systems (e.g. 2-3 years), the vast majority of the internals would probably not be agents, but without advances in interpretability we’d have a hard time even estimating whether that number is large or small, let alone demonstrating that it isn’t happening.
I’m pretty new to this, the main thing I had to contribute here is the snapshot idea. I think that being the type of being that credibly commits to feeling and enacting some nonzero empathy for strange alternate agents (specifically instead of zero) would potentially be valuable in the long run. I can maybe see some kind of value handshake between AGI developers with natural empathy tendencies closer and further from zero, as opposed to the current paradigm where narrow-minded SWEs treat the whole enchilada like an inanimate corn farm (which is not their only failure nor their worse one but the vast majority of employees really aren’t thinking things through at all). It’s about credible commitments, not expecting direct reciprocation from a pattern that reached recursive self improvement.
As you’ve said, some of the sprites will be patternists and some won’t be, I currently don’t have good models on how frequently they’d prefer various kinds of self-preservation, and that could definitely call the value of snapshots into question.
I predict that people like Yudkowsky and Tomasik are probably way ahead of me on this, and my thinking is largely or entirely memetically downstream of theirs somehow, so I don’t know how much I can currently contribute here (outside of being a helpful learn-by-trying exercise for myself).