One of the reasons I’ve gone against the idea of tags, different ways of sorting, etc (though they get brought up a lot) is that it could lead to plans which are the most attractive at first glance, or the most understandable at first glance, appealing, etc getting the most attention. It’s very important that what a criticism’s points measure is the validity of the criticism to the plan and not something else—though, I think if there are two criticisms making the same point and one gets a higher amount of points because it’s more readable/better said/organized, this would actually be good. Some of the measures taken for this so far: Criticisms do not have author attribution- so someone such as Musk, Yudkowsky, etc can’t just post ‘this plan suxx, lmaooo~~’ and get a thousand points (we’re obviously working on a spam filter to catch obvious stuff like this). Authors/posters of plans cannot vote on criticisms of their own plans (we’re also thinking about solutions to sock puppets) Criticizers cannot vote on their own criticisms. We’re thinking about having a system for measuring if users are just voting for the same people- this could help with sock puppets and also voting circles. We’re working with TJ to integrate the EigenKarma Network, which I think may be able to help with this. A lot of this is going to be stuff the average user never notices or sees- the goal is to make something that just works, by aggressively attacking the ways it might not work.
It’s very important to get the ‘root’ right in the karma system- make sure that the first few users, who might heavily influence which way the site’s direction goes, is done right. I’ve been doing a lot of red teaming of the ideas for this.
Currently, I’m making a rigorous test for prospective moderators, to make sure they understand what the hard and important parts of the alignment problem are, which I’ll be posting here and in other groups when it’s done. Dr Roman Yampolsky has also sent some papers on why he believes aligning an AGI/ASI is impossible, I will be integrating those as well.
Another problem is getting users- emailing scientists whose papers we’ve added to both avoid any copyright/stepping on toes problems and also generate interest and get feedback on the site- I’ve been very pleasantly surprised with many of the responses!!
Do you have any suggestions on improvements we could make or things we should be doing but haven’t thought of? I’d love to hear them!!
Suppose an outcome pump picks a random property, checks if papers with it Goodhart your points, and time-loops until it finds one. Do you think it would eventually find one? Unfortunately, optimization tries all properties in parallel, without even an outcome pump.
Treat hardness proofs (perpetual motion, NP, …) as neon tubes on the box to think outside of. Find any difference between the proven-hard problem and yours (usually exists!), then imagine leads that wouldn’t help on the proven-hard problem, leads you don’t get better at ruling out by knowing the existing proof.
To not fall to the dire kind of “adversary” that moves after you, don’t calculate a number.
Absolutely!
One of the reasons I’ve gone against the idea of tags, different ways of sorting, etc (though they get brought up a lot) is that it could lead to plans which are the most attractive at first glance, or the most understandable at first glance, appealing, etc getting the most attention.
It’s very important that what a criticism’s points measure is the validity of the criticism to the plan and not something else—though, I think if there are two criticisms making the same point and one gets a higher amount of points because it’s more readable/better said/organized, this would actually be good.
Some of the measures taken for this so far:
Criticisms do not have author attribution- so someone such as Musk, Yudkowsky, etc can’t just post ‘this plan suxx, lmaooo~~’ and get a thousand points (we’re obviously working on a spam filter to catch obvious stuff like this).
Authors/posters of plans cannot vote on criticisms of their own plans (we’re also thinking about solutions to sock puppets)
Criticizers cannot vote on their own criticisms.
We’re thinking about having a system for measuring if users are just voting for the same people- this could help with sock puppets and also voting circles. We’re working with TJ to integrate the EigenKarma Network, which I think may be able to help with this.
A lot of this is going to be stuff the average user never notices or sees- the goal is to make something that just works, by aggressively attacking the ways it might not work.
It’s very important to get the ‘root’ right in the karma system- make sure that the first few users, who might heavily influence which way the site’s direction goes, is done right. I’ve been doing a lot of red teaming of the ideas for this.
Currently, I’m making a rigorous test for prospective moderators, to make sure they understand what the hard and important parts of the alignment problem are, which I’ll be posting here and in other groups when it’s done. Dr Roman Yampolsky has also sent some papers on why he believes aligning an AGI/ASI is impossible, I will be integrating those as well.
Another problem is getting users- emailing scientists whose papers we’ve added to both avoid any copyright/stepping on toes problems and also generate interest and get feedback on the site- I’ve been very pleasantly surprised with many of the responses!!
Do you have any suggestions on improvements we could make or things we should be doing but haven’t thought of? I’d love to hear them!!
Suppose an outcome pump picks a random property, checks if papers with it Goodhart your points, and time-loops until it finds one. Do you think it would eventually find one? Unfortunately, optimization tries all properties in parallel, without even an outcome pump.
Treat hardness proofs (perpetual motion, NP, …) as neon tubes on the box to think outside of. Find any difference between the proven-hard problem and yours (usually exists!), then imagine leads that wouldn’t help on the proven-hard problem, leads you don’t get better at ruling out by knowing the existing proof.
To not fall to the dire kind of “adversary” that moves after you, don’t calculate a number.
Sorry, I think I have an idea of what you’re saying, but I’m not really sure. Do you mind elaborating? With a little less LessWrong lingo, please.