Currently, newcomers to the field of AI Alignment often struggle to understand the ongoing work and individuals involved, as well as the assumptions, strengths, and weaknesses of each plan.
We believe AI-plans.com will be an easy, centralized way to discover and learn more about the most promising alignment plans.
The site is currently in Stage 1, functioning purely as a compendium. We are in the process of adding up to 1000 plans and the criticisms made against them so far. Further plans and criticisms can be added by users.
Projected benefits of Stage 1:
- Easy discovery of proposed plans and better understanding of their prevalent challenges.
(This is already showing promise, with one researcher letting us know they found useful papers on the site and multiple researchers interested- including Jonathan Ng who has been helping us.)
Next, in Stage 2, we will introduce a scoring system for criticisms and a ranking system for plans. Plans will be ranked based on the cumulative scores of their criticisms. Criticism votes will be weighted, giving more influence to users who have submitted higher-scoring criticisms. Alignment researchers will have the option to link their AI-Plans account to accounts on research-relevant platforms (such as arXiv, OpenReview or the AI Alignment Forum) in order to start out with a slightly weighted vote (with mod approval).
Each new plan will start with 0 bounty, and lower bounty plans will give the most points. That way, each new plan will have a lot of opportunity and incentive for criticism. More details here.
Projected benefits of Stage 2:
- Incentivizes users to write high-quality criticisms.
- Facilitates identification of plans with significant weaknesses, supporting arguments against problematic plans.
- Allows newcomers to the field(including talented and untapped scientists and engineers) to see which companies have the least problematic plans.
After all, who would want to work for the lowest-ranked company on the leaderboard?
(I have spoken with the creator of aisafety.careers, who intends to integrate with our site.)
At Stage 3, in addition to everything from Stage 1 and 2, we plan to introduce monthly cash prizes for the highest ranking plan and for the users with the most criticism points that month.
Projected benefits of Stage 3:
- Supercharges the impact of Stage 2, attracting talented individuals who require a non-committal monetary incentive to engage with alignment research.
- Provides a heuristic argument for the difficulty of the problem: “There is money on the table if anyone can come up with a plan with fewer problems, yet no one has done so!”
What I’d like to ask lesswrong users-
What do you think could go wrong with this?
What would make you want to use the site/participate/contribute?
Re 2: Beware of https://www.lesswrong.com/posts/9kNxhKWvixtKW5anS/you-are-not-measuring-what-you-think-you-are-measuring.
Absolutely!
One of the reasons I’ve gone against the idea of tags, different ways of sorting, etc (though they get brought up a lot) is that it could lead to plans which are the most attractive at first glance, or the most understandable at first glance, appealing, etc getting the most attention.
It’s very important that what a criticism’s points measure is the validity of the criticism to the plan and not something else—though, I think if there are two criticisms making the same point and one gets a higher amount of points because it’s more readable/better said/organized, this would actually be good.
Some of the measures taken for this so far:
Criticisms do not have author attribution- so someone such as Musk, Yudkowsky, etc can’t just post ‘this plan suxx, lmaooo~~’ and get a thousand points (we’re obviously working on a spam filter to catch obvious stuff like this).
Authors/posters of plans cannot vote on criticisms of their own plans (we’re also thinking about solutions to sock puppets)
Criticizers cannot vote on their own criticisms.
We’re thinking about having a system for measuring if users are just voting for the same people- this could help with sock puppets and also voting circles. We’re working with TJ to integrate the EigenKarma Network, which I think may be able to help with this.
A lot of this is going to be stuff the average user never notices or sees- the goal is to make something that just works, by aggressively attacking the ways it might not work.
It’s very important to get the ‘root’ right in the karma system- make sure that the first few users, who might heavily influence which way the site’s direction goes, is done right. I’ve been doing a lot of red teaming of the ideas for this.
Currently, I’m making a rigorous test for prospective moderators, to make sure they understand what the hard and important parts of the alignment problem are, which I’ll be posting here and in other groups when it’s done. Dr Roman Yampolsky has also sent some papers on why he believes aligning an AGI/ASI is impossible, I will be integrating those as well.
Another problem is getting users- emailing scientists whose papers we’ve added to both avoid any copyright/stepping on toes problems and also generate interest and get feedback on the site- I’ve been very pleasantly surprised with many of the responses!!
Do you have any suggestions on improvements we could make or things we should be doing but haven’t thought of? I’d love to hear them!!
Suppose an outcome pump picks a random property, checks if papers with it Goodhart your points, and time-loops until it finds one. Do you think it would eventually find one? Unfortunately, optimization tries all properties in parallel, without even an outcome pump.
Treat hardness proofs (perpetual motion, NP, …) as neon tubes on the box to think outside of. Find any difference between the proven-hard problem and yours (usually exists!), then imagine leads that wouldn’t help on the proven-hard problem, leads you don’t get better at ruling out by knowing the existing proof.
To not fall to the dire kind of “adversary” that moves after you, don’t calculate a number.
Sorry, I think I have an idea of what you’re saying, but I’m not really sure. Do you mind elaborating? With a little less LessWrong lingo, please.
FYI: The link in the first line didn’t work for me (“Invalid URL: https://ai-plans.com″). This link works: https://www.ai-plans.com/
Thank you! Changed it to that!