I work at OpenAI on safety. In the past it seems like theres a gap between what I’d consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.
Topics w/ strategies/breakdown:
Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues
Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values
Design some evaluations or metrics for measuring a specific alignment failure (which lets you measure whether you’ve improved a model or not)
Gather human feedback data / labels / whatever you think you can try training on
Try training on your data (there are tutorials on how to use Google Colab to fine-tune GPT-2 with a new dataset)
Forecast scaling laws: figure out how performance on your evaluation or metric varies with the amount of human input data; compare to how much time it takes to generate each labelled example (be quantitative!)
Multi-objective reinforcement learning — instead of optimizing a single objective, optimize multiple objectives together (and some of the objectives can be constraints)
What are ways we can break down existing AI alignment failures in RL-like settings into multi-objective problems, where some of the objectives are safety objectives and some are goal/task objectives
How can we design safety objectives such that they can transfer across a wide variety of systems, machines, situations, environments, etc?
How can we measure and evaluate our safety objectives, and what should we expect to observe during training/deployment?
How can we incentivize individual development and sharing of safety objectives
How can we augment RL methods to allow transferrable safety objectives between domains (e.g., if using actor critic methods, how to integrate a separate critic for each safety objective)
What are good benchmark environments or scenarios for multi-objective RL with safety objectives (classic RL environments like Go or Chess aren’t natively well-suited to these topics)
Forecasting the Economics of AGI (turn ‘fast/slow/big/etc’ into real numbers with units)
This is more “AI Impacts” style work than you might be asking for, but I think it’s particularly well-suited for clever folks that can look things up on the internet.
Identify vague terms in AI alignment forecasts, like the “fast” in “fast takeoff”, that can be operationalized
Come up with units that measure the quantity in question, and procedures for measurements that result in those units
Try applying traditional economics growth models, such as experience curves, to AI development, and see how well you can get things to fit (much harder to do this for AI than making cars — is a single unit a single model trained? Maybe a single week of a researchers time? Is the cost decreasing in dollars or flops or person-hours or something else? Etc etc)
Sketch models for systems (here the system is the whole ai field) with feedback loops, and inspect/explore parts of the system which might respond most to different variables (additional attention, new people, dollars, hours, public discourse, philanthropic capital, etc)
Topics not important enough to make it into my first 30 minutes of writing:
Cross disciplinary integration with other safety fields, what will and won’t work
Systems safety for organizations building AGI
Safety acceleration loops — how/where can good safety research make us better and faster at doing safety research
Cataloguing alignment failures in the wild, and create a taxonomy of them
Anti topics: Things I would have put on here a year ago
Too late for me to keep writing so saving this for another time I guess
I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.
I work at OpenAI on safety. In the past it seems like theres a gap between what I’d consider to be alignment topics that need to be worked on, and the general consensus for this forum. A good friend poked me to write something for this so here I am.
Topics w/ strategies/breakdown:
Fine-tuning GPT-2 from human preferences, to solve small scale alignment issues
Brainstorm small/simple alignment failures: ways that existing generative language models are not aligned with human values
Design some evaluations or metrics for measuring a specific alignment failure (which lets you measure whether you’ve improved a model or not)
Gather human feedback data / labels / whatever you think you can try training on
Try training on your data (there are tutorials on how to use Google Colab to fine-tune GPT-2 with a new dataset)
Forecast scaling laws: figure out how performance on your evaluation or metric varies with the amount of human input data; compare to how much time it takes to generate each labelled example (be quantitative!)
Multi-objective reinforcement learning — instead of optimizing a single objective, optimize multiple objectives together (and some of the objectives can be constraints)
What are ways we can break down existing AI alignment failures in RL-like settings into multi-objective problems, where some of the objectives are safety objectives and some are goal/task objectives
How can we design safety objectives such that they can transfer across a wide variety of systems, machines, situations, environments, etc?
How can we measure and evaluate our safety objectives, and what should we expect to observe during training/deployment?
How can we incentivize individual development and sharing of safety objectives
How can we augment RL methods to allow transferrable safety objectives between domains (e.g., if using actor critic methods, how to integrate a separate critic for each safety objective)
What are good benchmark environments or scenarios for multi-objective RL with safety objectives (classic RL environments like Go or Chess aren’t natively well-suited to these topics)
Forecasting the Economics of AGI (turn ‘fast/slow/big/etc’ into real numbers with units)
This is more “AI Impacts” style work than you might be asking for, but I think it’s particularly well-suited for clever folks that can look things up on the internet.
Identify vague terms in AI alignment forecasts, like the “fast” in “fast takeoff”, that can be operationalized
Come up with units that measure the quantity in question, and procedures for measurements that result in those units
Try applying traditional economics growth models, such as experience curves, to AI development, and see how well you can get things to fit (much harder to do this for AI than making cars — is a single unit a single model trained? Maybe a single week of a researchers time? Is the cost decreasing in dollars or flops or person-hours or something else? Etc etc)
Sketch models for systems (here the system is the whole ai field) with feedback loops, and inspect/explore parts of the system which might respond most to different variables (additional attention, new people, dollars, hours, public discourse, philanthropic capital, etc)
Topics not important enough to make it into my first 30 minutes of writing:
Cross disciplinary integration with other safety fields, what will and won’t work
Systems safety for organizations building AGI
Safety acceleration loops — how/where can good safety research make us better and faster at doing safety research
Cataloguing alignment failures in the wild, and create a taxonomy of them
Anti topics: Things I would have put on here a year ago
Too late for me to keep writing so saving this for another time I guess
I’m available tomorrow to chat about these w/ the group. Happy to talk then (or later, in replies here) about any of these if folks want me to expand further.