This is the abstract of my research proposal submitted to AI Alignment Awards. I am publishing this here for community feedback. You can find the link to the whole research paper here.
Abstract
We are entering a decade of singularity and great uncertainty. Across all disciplines, including wars, politics, human health, as well as the environment, there are concepts that could prove to be a double edged sword. Perhaps the most powerful factor in determining our future is how information is distributed to the public. It can be both transformational and empowering using advanced AI technology – or it can lead to disastrous outcomes that we may not have the foresight to predict with our current capabilities.
Goal misgeneralization is defined as a robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. This research proposal tries to capture what might be a better description of this problem and solutions from a Jungian perspective.
This proposal covered key AI alignment topics, from goal misgeneralisation to other pressing issues. It offers a comprehensive approach for addressing critical questions in the field.
reward misspecification and hacking
situational awareness
deceptive reward hacking
internally-represented goals
learning broadly scoped goals
broadly scoped goals incentivizing power-seeking,
power seeking policies would choose high reward behaviors for instrumental reasons
misaligned AGIs gain control of the key levers of power
These above-mentioned topics were reviewed to check the viability of approaching the alignment problem through a Jungian approach. 3 key concepts emerged from the review:
By understanding how humans use patterns to recognize intentions at a subconscious level, researchers can leverage on Jungian archetypes and create systems that mimic natural decision-making. With this insight into human behavior, AI can be trained more effectively with archetypal data.
Stories are more universal in human thought than goals. Goals and rewards will always yield the same problems encountered in alignment research. AI systems should utilize the robustness of complete narratives to guide its responses.
Values-based models can serve as the moral compass for AI systems in determining what is a truthful and responsible response or not. Testing this theory is essential in continuing progress on alignment research.
A list of initial methodologies were added to present an overview of how the research will proceed once approved.
In conclusion, alignment research should look into the possibility of replacing goals and rewards in evaluating AI systems. By understanding that humans think consciously and subconsciously through Jungian archetypal patterns, this paper proposes that complete narratives should be leveraged in training and deploying AI models.
A number of limitations were included in the last section. The main concern is the need to hire Jungian scholars or analytical psychologists—as they will define what constitutes archetypal data and evaluate results. They will also be required to influence the whole research process with a high moral ground and diligence. They will be difficult to find.
AI systems will impact our future significantly, so it is important that they are developed responsibly. History has taught us what can happen when intentions are poorly executed: the deaths of millions through the use of wrong ideologies haunt us and remind us of the need for caution in this field.
Research proposal: Leveraging Jungian archetypes to create values-based models
This project is the origins of the Archetypal Transfer Learning (ATL) method
This is the abstract of my research proposal submitted to AI Alignment Awards. I am publishing this here for community feedback. You can find the link to the whole research paper here.
Abstract
We are entering a decade of singularity and great uncertainty. Across all disciplines, including wars, politics, human health, as well as the environment, there are concepts that could prove to be a double edged sword. Perhaps the most powerful factor in determining our future is how information is distributed to the public. It can be both transformational and empowering using advanced AI technology – or it can lead to disastrous outcomes that we may not have the foresight to predict with our current capabilities.
Goal misgeneralization is defined as a robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. This research proposal tries to capture what might be a better description of this problem and solutions from a Jungian perspective.
This proposal covered key AI alignment topics, from goal misgeneralisation to other pressing issues. It offers a comprehensive approach for addressing critical questions in the field.
reward misspecification and hacking
situational awareness
deceptive reward hacking
internally-represented goals
learning broadly scoped goals
broadly scoped goals incentivizing power-seeking,
power seeking policies would choose high reward behaviors for instrumental reasons
misaligned AGIs gain control of the key levers of power
These above-mentioned topics were reviewed to check the viability of approaching the alignment problem through a Jungian approach. 3 key concepts emerged from the review:
By understanding how humans use patterns to recognize intentions at a subconscious level, researchers can leverage on Jungian archetypes and create systems that mimic natural decision-making. With this insight into human behavior, AI can be trained more effectively with archetypal data.
Stories are more universal in human thought than goals. Goals and rewards will always yield the same problems encountered in alignment research. AI systems should utilize the robustness of complete narratives to guide its responses.
Values-based models can serve as the moral compass for AI systems in determining what is a truthful and responsible response or not. Testing this theory is essential in continuing progress on alignment research.
A list of initial methodologies were added to present an overview of how the research will proceed once approved.
In conclusion, alignment research should look into the possibility of replacing goals and rewards in evaluating AI systems. By understanding that humans think consciously and subconsciously through Jungian archetypal patterns, this paper proposes that complete narratives should be leveraged in training and deploying AI models.
A number of limitations were included in the last section. The main concern is the need to hire Jungian scholars or analytical psychologists—as they will define what constitutes archetypal data and evaluate results. They will also be required to influence the whole research process with a high moral ground and diligence. They will be difficult to find.
AI systems will impact our future significantly, so it is important that they are developed responsibly. History has taught us what can happen when intentions are poorly executed: the deaths of millions through the use of wrong ideologies haunt us and remind us of the need for caution in this field.