A procedure for ‘~shardifying’[1] an incoherent utility function into a coherent utility function by pushing preferences into conditionals. Example of an extreme case of this would be an ideal predictor (i.e. one which has successfully learned values fit to the predictive loss, not other goals, and does not exhibit internally motivated instrumental behavior) trained to perfectly predict the outputs of an incoherent agent.
The ideal predictor model, being perfectly conditional, would share the same outputs but would retain coherence: inconsistencies in the original utility function are remapped to be conditional. Apparent preference cycles over world states are fine if the utility function isn’t primarily concerned with world states. The ideal predictor is coherent by default- it doesn’t need to work out any kinks to avoid stepping on its own toes.
Upon entering a hypothetical capability-induced coherence death spiral, what does the original inconsistent agent do? Does it try to stick to object level preferences, forcing it to violate its previous preferences in some presumably minimized way?[2] Or does it punt things into conditionality to maintain behaviors implied by the original inconsistencies? Is that kind of shardification convergent?
Is there a path to piggybacking on greed/noncompetitive inclinations for restricting compute access in governance? One example: NVIDIA already requires that data center customers purchase its vastly more expensive data center products. The driver licenses for the much cheaper gaming class hardware already do not permit use cases like “build a giant supercomputer for training big LLMs.”
Extending this to, say, having a dead man’s switch built into the driver if the GPU installation hasn’t received an appropriate signal recently (implying that the relevant regulatory entity has not been able to continue its audits of the installation and its use), the cluster simply dies.
Modified drivers could bypass some of the restrictions, but some hardware involvement would make it more difficult. NVIDIA may already be doing this kind of hardware-level signing to ensure that only approved drivers can be used (I haven’t checked). It’s still possible in principle to bypass- the hardware and software are both in the hands of the enemy- but it would be annoying.
Even if they don’t currently do that sort of check, it would be relatively simple to add some form of it with a bit of lead time.
By creating more regulatory hurdles that NVIDIA (or other future dominant ML hardware providers) can swallow without stumbling too badly, they get a bit of extra moat against up-and-comers. It’d be in their interest to get the government to add those regulations, and then they could extract a bit more profit from hyperscalers.
I’m using the word “shard” here to just mean “a blob of conditionally activated preferences.” It’s probably importing some other nuances that might be confusing because I haven’t read enough of shard theory things to catch where it doesn’t work.
This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)
Quarter-baked ideas for potential future baking:
A procedure for ‘~shardifying’[1] an incoherent utility function into a coherent utility function by pushing preferences into conditionals. Example of an extreme case of this would be an ideal predictor (i.e. one which has successfully learned values fit to the predictive loss, not other goals, and does not exhibit internally motivated instrumental behavior) trained to perfectly predict the outputs of an incoherent agent.
The ideal predictor model, being perfectly conditional, would share the same outputs but would retain coherence: inconsistencies in the original utility function are remapped to be conditional. Apparent preference cycles over world states are fine if the utility function isn’t primarily concerned with world states. The ideal predictor is coherent by default- it doesn’t need to work out any kinks to avoid stepping on its own toes.
Upon entering a hypothetical capability-induced coherence death spiral, what does the original inconsistent agent do? Does it try to stick to object level preferences, forcing it to violate its previous preferences in some presumably minimized way?[2] Or does it punt things into conditionality to maintain behaviors implied by the original inconsistencies? Is that kind of shardification convergent?
Is there a path to piggybacking on greed/noncompetitive inclinations for restricting compute access in governance? One example: NVIDIA already requires that data center customers purchase its vastly more expensive data center products. The driver licenses for the much cheaper gaming class hardware already do not permit use cases like “build a giant supercomputer for training big LLMs.”
Extending this to, say, having a dead man’s switch built into the driver if the GPU installation hasn’t received an appropriate signal recently (implying that the relevant regulatory entity has not been able to continue its audits of the installation and its use), the cluster simply dies.
Modified drivers could bypass some of the restrictions, but some hardware involvement would make it more difficult. NVIDIA may already be doing this kind of hardware-level signing to ensure that only approved drivers can be used (I haven’t checked). It’s still possible in principle to bypass- the hardware and software are both in the hands of the enemy- but it would be annoying.
Even if they don’t currently do that sort of check, it would be relatively simple to add some form of it with a bit of lead time.
By creating more regulatory hurdles that NVIDIA (or other future dominant ML hardware providers) can swallow without stumbling too badly, they get a bit of extra moat against up-and-comers. It’d be in their interest to get the government to add those regulations, and then they could extract a bit more profit from hyperscalers.
I’m using the word “shard” here to just mean “a blob of conditionally activated preferences.” It’s probably importing some other nuances that might be confusing because I haven’t read enough of shard theory things to catch where it doesn’t work.
This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)