Why is there virtuallynobodyelse interested in metaphilosophy or ensuring AI philosophical competence (or that of future civilization as a whole), even as we get ever closer to AGI, and other areas of AI safety start attracting more money and talent?
I’ve written a bit on this topic that you might find interesting; I refer to it as the Set of Robust Concepts (SORC). I also employed this framework to develop a tuning dataset, which enables a shutdown mechanism to activate when the AI’s intelligence poses a risk to humans. It works 57.33% of the time.
I managed to improve the success rate to 88%. However, I’m concerned that publishing the method in the conventional way could potentially put the world at greater risk. I’m still contemplating how to responsibly share information on AI safety, especially when it could be reverse-engineered to become dangerous.
There is also a theory from Jung that deeply concerns me. According to Jung, the human psyche contains a subliminal state, or subconscious mind, which serves as a battleground for gods and demons. Our dreams process this ongoing conflict and bring it into our conscious awareness. What if these same principle got transferred to LLMs since human related data was used for training? This idea doesn’t seem far-fetched, especially since we refer to the current phenomenon of misleading outputs in LLMs as “hallucinations.”
I have conducted an experiment on this, specifically focusing on hyperactivating the “shadow behavior” in GPT-2 XL and I could fairly say that it is reminiscent of Jung’s thought. For obvious reasons, I won’t disclose the method here[1] but I’m open to discussing it privately.
Unfortunately, the world isn’t a safe place to disclose this method. As discussed in this post and this post, I don’t know of a secure way to share the correct information and disseminate it to right people who can actually do something about it. For now, I’ll leave this comment here in the hope that the appropriate individual might come across it and be willing to engage in one of the most unsettling discussions they’ll ever have.
I’ve written a bit on this topic that you might find interesting; I refer to it as the Set of Robust Concepts (SORC). I also employed this framework to develop a tuning dataset, which enables a shutdown mechanism to activate when the AI’s intelligence poses a risk to humans. It works 57.33% of the time.
I managed to improve the success rate to 88%. However, I’m concerned that publishing the method in the conventional way could potentially put the world at greater risk. I’m still contemplating how to responsibly share information on AI safety, especially when it could be reverse-engineered to become dangerous.
There is also a theory from Jung that deeply concerns me. According to Jung, the human psyche contains a subliminal state, or subconscious mind, which serves as a battleground for gods and demons. Our dreams process this ongoing conflict and bring it into our conscious awareness. What if these same principle got transferred to LLMs since human related data was used for training? This idea doesn’t seem far-fetched, especially since we refer to the current phenomenon of misleading outputs in LLMs as “hallucinations.”
I have conducted an experiment on this, specifically focusing on hyperactivating the “shadow behavior” in GPT-2 XL and I could fairly say that it is reminiscent of Jung’s thought. For obvious reasons, I won’t disclose the method here[1] but I’m open to discussing it privately.
Unfortunately, the world isn’t a safe place to disclose this method. As discussed in this post and this post, I don’t know of a secure way to share the correct information and disseminate it to right people who can actually do something about it. For now, I’ll leave this comment here in the hope that the appropriate individual might come across it and be willing to engage in one of the most unsettling discussions they’ll ever have.