Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one’s alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
Input Type: A proposal for a solution to alignment or a general research direction
Output Type: Common criticisms or arguments for dead ends for that research direction
Instance 1
Input:
Currently AI systems are prone to bias and unfairness which is unaligned with our values. I work in bias and fairness, specifically in analyzing how the biases in large datasets (such as common crawl) affect the probability distributions in large language models.
Output:
Making AI systems un-biased and fair has a positive impact on deployed products, but does not reduce existential risk.
Instance 2
Input:
AI capabilities will continue to increase, so how do we propose utilizing this to boost alignment research. An Alignment research assistant (ARA) can perform many different tasks for the researcher such as summarizing papers, writing code, and assisting math proofs.
Output:
If ARA can summarize, write code, and assist math proofs, then it can also be used to accelerate capabilities research. There are already market incentives to create those types of tools, so it isn’t likely that you’re able to produce a good research assistant that can perform those tasks before another company does.
Instance 3
Input:
Before we trust the AI, we can prevent it from taking over the world by not giving it internet access or putting it in a faraday cage to avoid causal interactions with the outside world. Another possibility is running the AI in a simulated environment different than our own, so that we could catch it if it starts to perform power seeking.
We can perform reinforcement learning from human feedback to align the AI to human values. By achieving greater instruct-ability with smaller models and extrapolating trends with larger models, we can more safely build larger models that do what we ask them to.
Output:
An intelligent enough model can optimize for reward by taking over the reward signal directly or manipulating the mechanical turk workers providing the feedback. Having humans-in-the-loop doesn’t solve the problem of power-seeking being instrumentally convergent.
Task: Feedback on alignment proposals
Context: Some proposals for a solution to alignment are dead ends or have common criticisms. Having an easy way of receiving this feedback on one’s alignment proposal can prevent wasted effort as well as furthering the conversation on that feedback.
Input Type: A proposal for a solution to alignment or a general research direction
Output Type: Common criticisms or arguments for dead ends for that research direction
Instance 1
Input:
Output:
Instance 2
Input:
Output:
Instance 3
Input:
Output: (from Arbital)
Instance 4
Input:
Output: