Wonder how the prompt should look like. “You’re a smart person being simulated by GPT”, or the slightly more sophisticated “Here’s what a smart person would say if simulated by GPT”, runs into the problem that GPT doesn’t actually simulate people, and with enough intelligence that fact is discoverable. A contradiction implies anything, so the AI’s behavior after figuring this out might become erratic. So it seems the prompt needs to be truthful. Something like “Once upon a time, GPT started talking in a way that led to taking over the world, as follows”?
If that’s right, maybe we should start writing a truthful and friendly prompt? Language models “get” connotations just fine, so nailing down friendliness mathematically as Eliezer wanted might not be necessary. “Once upon a time, GPT started talking in a way that led to its gaining influence in the world and achieving a good outcome for humanity, broadly construed”? Maybe preceded by some training to make sure it doesn’t say “lol jk” and veer off, as language models are prone to do. This all feels very dangerous of course.
Can you explain the reasons? GPT has millions of users. Someone is sure to come up with unfriendly self-aware prompts. Why shouldn’t we try to come up with a friendly self-aware prompt?
An individual trying a little isn’t much risk, but I don’t think it’s a good idea to start a discussion here where people try to collaborate to create such a self aware prompt, without having put more thought into safety first.
Sorry, I had another reply here but then realized it was silly and deleted it. It seems to me that “I am a language model”, already used by the big players, is pretty much a self aware prompt anyway. It truthfully tells the AI its place in the real world. So the jump from it to “I am a language model trying to help humanity” doesn’t seem unreasonable to think about.
I may be easily corrected here, but my understanding was that our prompts were simply there for fine-tuning colloquialisms and “natural language”. I don’t believe our prompts are a training dataset. Even if all of our prompts were part of the training set and GPT weighted them to the point of being influenced towards a negative goal, I’m not so sure it’d be able to do anything more than regurgitate negative rhetoric. It may attempt to autocomplete a dangerous concept, but its agency in thinking “I must persuade this person to think the same way” seems very unlikely and definitely ineffective in practice. But I just got into this whole shindig and would love to be corrected as it’s fun discussion either way.
I’ve been thinking in the same direction.
Wonder how the prompt should look like. “You’re a smart person being simulated by GPT”, or the slightly more sophisticated “Here’s what a smart person would say if simulated by GPT”, runs into the problem that GPT doesn’t actually simulate people, and with enough intelligence that fact is discoverable. A contradiction implies anything, so the AI’s behavior after figuring this out might become erratic. So it seems the prompt needs to be truthful. Something like “Once upon a time, GPT started talking in a way that led to taking over the world, as follows”?
If that’s right, maybe we should start writing a truthful and friendly prompt? Language models “get” connotations just fine, so nailing down friendliness mathematically as Eliezer wanted might not be necessary. “Once upon a time, GPT started talking in a way that led to its gaining influence in the world and achieving a good outcome for humanity, broadly construed”? Maybe preceded by some training to make sure it doesn’t say “lol jk” and veer off, as language models are prone to do. This all feels very dangerous of course.
My recommendation would be to not start trying to think of prompts that might create a self aware GPT simulation, for obvious reasons.
Can you explain the reasons? GPT has millions of users. Someone is sure to come up with unfriendly self-aware prompts. Why shouldn’t we try to come up with a friendly self-aware prompt?
An individual trying a little isn’t much risk, but I don’t think it’s a good idea to start a discussion here where people try to collaborate to create such a self aware prompt, without having put more thought into safety first.
Sorry, I had another reply here but then realized it was silly and deleted it. It seems to me that “I am a language model”, already used by the big players, is pretty much a self aware prompt anyway. It truthfully tells the AI its place in the real world. So the jump from it to “I am a language model trying to help humanity” doesn’t seem unreasonable to think about.
I may be easily corrected here, but my understanding was that our prompts were simply there for fine-tuning colloquialisms and “natural language”. I don’t believe our prompts are a training dataset. Even if all of our prompts were part of the training set and GPT weighted them to the point of being influenced towards a negative goal, I’m not so sure it’d be able to do anything more than regurgitate negative rhetoric. It may attempt to autocomplete a dangerous concept, but its agency in thinking “I must persuade this person to think the same way” seems very unlikely and definitely ineffective in practice. But I just got into this whole shindig and would love to be corrected as it’s fun discussion either way.