And what do you think are the chances that those strategies work, or that the world lives after you hypothetically buy three or six more years that way?
I’m not well calibrated on sub 1% probabilities. Yeah, the odds are low.
There are other classes of Hail Mary. Picture a pair of reseachers, one of whom controls an electrode wired to the pleasure centers of the other. Imagine they have free access to methamphetamine and LSD. I don’t think research output is anywhere near where it could be.
So—just to be very clear here—the plan is that you do the bad thing, and then almost certainly everybody dies anyways even if that works?
I think at that level you want to exhale, step back, and not injure the reputations of the people who are gathering resources, finding out what they can, and watching closely for the first signs of a positive miracle. The surviving worlds aren’t the ones with unethical plans that seem like they couldn’t possibly work even on the open face of things; the real surviving worlds are only injured by people who imagine that throwing away their ethics surely means they must be buying something positive.
Great! If I recall correctly, you wanted genetically optimized kids to be gestated and trained.
I suspect that akrasia is a much bigger problem than most people think, and to be truly effective, one must outsource part of their reward function. There could be massive gains.
What do you think about the setup I outlined, where a pair of reseachers exist such that one controls an electrode embedded in the other’s reward center? Think Focus from Vinge’s A Deepness In The Sky.
I think memetically ‘optimized’ kids (and adults?) might be an interesting alternative to explore. That is, more scalable and better education for the ‘consequentialists’ (I have no clue how to teach people that are not ‘consequentialist’, hopefully someone else can teach those) may get human thought-enhancement results earlier, and available to more people. There has been some work in this space and some successes, but I think that in general, the “memetics experts” and the “education experts” haven’t been cooperating properly as much as they should. I think it would seem dignified to me to try bridging this gap. If this is indeed dignified, then that would be good, because I’m currently in the early stages of a project trying to bridge this gap.
The better version then reward hacking I can think of is inducing a state of jhana (basically a pleasure button) in alignment researchers. For example, use neuro-link to get the brain-process of ~1000 people going through the jhanas at multiple time-steps, average them in a meaningful way, induce those brainwaves in other people.
The effect is people being satiated with the feeling of happiness (like being satiated with food/water), and are more effective as a result.
Jhana’s seem much healthier, though I’m pretty confused imagining your setup so I don’t have much confidence. Say it works and gets past the problems of generalizing reward (eg the brain only rewards for specific parts of research and not others) and ignoring downward spiral effects of people hacking themselves, then we hopefully have people who look forward to doing certain parts of research.
If you model humans as multi-agents, it’s making a certain type of agent (the “do research” one) have a stronger say in what actions get done. This is not as robust as getting all the agents to agree and not fight each other. I believe jhana gets part of that done because some sub-agents are pursuing the feeling of happiness and you can get that any time.
In our earliest work with a single lever it was noted that while the subject would lever-press at a steady rate for stimulation to various brain sites, the current could be turned off entirely and he would continue lever-pressing at the same rate (for as many as 2000 responses) until told to stop.
It is of interest that the introduction of an attractive tray of food produced no break in responding, although the subject had been without food for 7 hours, was noted to glance repeatedly at the tray, and later indicated that he knew he could have stopped to eat if he wished. Even under these conditions he continued to respond without change in rate after the current was turned off, until finally instructed to stop, at which point he ate heartily.
People’s revealed choice in tenaciously staying alive and keeping others alive suggests otherwise. This everyday observation trumps all philosophical argument that fire does not burn, water is not wet, and bears do not shit in the woods.
I’m not immediately convinced (I think you need another ingredient).
Imagine a kind of orthogonality thesis but with experiential valence on one axis and ‘staying aliveness’ on the other. I think it goes through (one existence proof for the experientially-horrible-but-high-staying-aliveness quadrant might be the complex of torturer+torturee).
Another ingredient you need to posit for this argument to go through is that, as humans are constituted, experiential valence is causally correlated with behaviour in a way such that negative experiential valence reliably causes not-staying-aliveness. I think we do probably have this ingredient, but it’s not entirely clear cut to me.
Unlike jayterwahl, I don’t consider experiential valence, which I take to mean mental sensations of pleasure and pain in the immediate moment, as of great importance in itself. It may be a sign that I am doing well or badly at life, but like the score on a test, it is only a proxy for what matters. People also have promises to keep, and miles to go before they sleep.
I think many of the things that you might want to do in order to slow down tech development are things that will dramatically worsen human experiences, or reduce the number of them. Making a trade like that in order to purchase the whole future seems like it’s worth considering; making a trade like that in order to purchase three more years seems much more obviously not worth it.
I will note that I’m still a little confused about Butlerian Jihad style approaches (where you smash all the computers, or restrict them to the capability available in 1999 or w/e); if I remember correctly Eliezer has called that a ‘straightforward loss’, which seems correct from a ‘cosmic endowment’ perspective but not from a ‘counting up from ~10 remaining years’ perspective.
My guess is that the main response is “look, if you can coordinate to smash all of the computers, you can probably coordinate on the less destructive-to-potential task of just not building AGI, and the difficulty is primarily in coordinating at all instead of the coordination target.”
Coordination on what, exactly?
Coordination (cartelization) so that AI capabilities are not a race to the bottom
Coordination to indefinitely halt semiconductor supply chains
Coordination to shun and sanction those who research AI capabilities (compare: coordination against embyronic human gene editing)
Coordination to deliberately turn Moore’s Law back a few years (yes, I’m serious)
And do you think if you try that, you’ll succeed, and that the world will then be saved?
These are all strategies to buy time, so that alignment efforts may have more exposure to miracle-risk.
And what do you think are the chances that those strategies work, or that the world lives after you hypothetically buy three or six more years that way?
I’m not well calibrated on sub 1% probabilities. Yeah, the odds are low.
There are other classes of Hail Mary. Picture a pair of reseachers, one of whom controls an electrode wired to the pleasure centers of the other. Imagine they have free access to methamphetamine and LSD. I don’t think research output is anywhere near where it could be.
So—just to be very clear here—the plan is that you do the bad thing, and then almost certainly everybody dies anyways even if that works?
I think at that level you want to exhale, step back, and not injure the reputations of the people who are gathering resources, finding out what they can, and watching closely for the first signs of a positive miracle. The surviving worlds aren’t the ones with unethical plans that seem like they couldn’t possibly work even on the open face of things; the real surviving worlds are only injured by people who imagine that throwing away their ethics surely means they must be buying something positive.
Fine. What do you think about the human-augmentation cluster of strategies? I recall you thought along very similar lines circa ~2001.
I don’t think we’ll have time, but I’d favor getting started anyways. Seems a bit more dignified.
Great! If I recall correctly, you wanted genetically optimized kids to be gestated and trained.
I suspect that akrasia is a much bigger problem than most people think, and to be truly effective, one must outsource part of their reward function. There could be massive gains.
What do you think about the setup I outlined, where a pair of reseachers exist such that one controls an electrode embedded in the other’s reward center? Think Focus from Vinge’s A Deepness In The Sky.
(I predict that would help with AI safety, in that it would swiftly provide useful examples of reward hacking and misaligned incentives)
I think memetically ‘optimized’ kids (and adults?) might be an interesting alternative to explore. That is, more scalable and better education for the ‘consequentialists’ (I have no clue how to teach people that are not ‘consequentialist’, hopefully someone else can teach those) may get human thought-enhancement results earlier, and available to more people. There has been some work in this space and some successes, but I think that in general, the “memetics experts” and the “education experts” haven’t been cooperating properly as much as they should. I think it would seem dignified to me to try bridging this gap. If this is indeed dignified, then that would be good, because I’m currently in the early stages of a project trying to bridge this gap.
The better version then reward hacking I can think of is inducing a state of jhana (basically a pleasure button) in alignment researchers. For example, use neuro-link to get the brain-process of ~1000 people going through the jhanas at multiple time-steps, average them in a meaningful way, induce those brainwaves in other people.
The effect is people being satiated with the feeling of happiness (like being satiated with food/water), and are more effective as a result.
The “electrode in the reward center” setup has been proven to work in humans, whereas jhanas may not tranfer over Neuralink.
Deep brain stimulation is FDA-approved in humans, meaning less (though nonzero) regulatory fuckery will be required.
Happiness is not pleasure; wanting is not liking. We are after reinforcement.
Could you link the proven part?
Jhana’s seem much healthier, though I’m pretty confused imagining your setup so I don’t have much confidence. Say it works and gets past the problems of generalizing reward (eg the brain only rewards for specific parts of research and not others) and ignoring downward spiral effects of people hacking themselves, then we hopefully have people who look forward to doing certain parts of research.
If you model humans as multi-agents, it’s making a certain type of agent (the “do research” one) have a stronger say in what actions get done. This is not as robust as getting all the agents to agree and not fight each other. I believe jhana gets part of that done because some sub-agents are pursuing the feeling of happiness and you can get that any time.
https://en.wikipedia.org/wiki/Brain_stimulation_reward
https://doi.org/10.1126/science.140.3565.394
https://sci-hub.hkvisa.net/10.1126/science.140.3565.394
Is the average human life experientially negative, such that buying three more years of existence for the planet is ethically net-negative?
People’s revealed choice in tenaciously staying alive and keeping others alive suggests otherwise. This everyday observation trumps all philosophical argument that fire does not burn, water is not wet, and bears do not shit in the woods.
I’m not immediately convinced (I think you need another ingredient).
Imagine a kind of orthogonality thesis but with experiential valence on one axis and ‘staying aliveness’ on the other. I think it goes through (one existence proof for the experientially-horrible-but-high-staying-aliveness quadrant might be the complex of torturer+torturee).
Another ingredient you need to posit for this argument to go through is that, as humans are constituted, experiential valence is causally correlated with behaviour in a way such that negative experiential valence reliably causes not-staying-aliveness. I think we do probably have this ingredient, but it’s not entirely clear cut to me.
Unlike jayterwahl, I don’t consider experiential valence, which I take to mean mental sensations of pleasure and pain in the immediate moment, as of great importance in itself. It may be a sign that I am doing well or badly at life, but like the score on a test, it is only a proxy for what matters. People also have promises to keep, and miles to go before they sleep.
I think many of the things that you might want to do in order to slow down tech development are things that will dramatically worsen human experiences, or reduce the number of them. Making a trade like that in order to purchase the whole future seems like it’s worth considering; making a trade like that in order to purchase three more years seems much more obviously not worth it.
I will note that I’m still a little confused about Butlerian Jihad style approaches (where you smash all the computers, or restrict them to the capability available in 1999 or w/e); if I remember correctly Eliezer has called that a ‘straightforward loss’, which seems correct from a ‘cosmic endowment’ perspective but not from a ‘counting up from ~10 remaining years’ perspective.
My guess is that the main response is “look, if you can coordinate to smash all of the computers, you can probably coordinate on the less destructive-to-potential task of just not building AGI, and the difficulty is primarily in coordinating at all instead of the coordination target.”