Yesterday I had a conversation with a person very much into cyborgism, and they told me about a particular path to impact floating around the cyborgism social network: Evals.
I really like this idea, and I have no clue how I didn’t think of it myself! Its the obvious thing to do when you have a bunch of insane people (used as a term of affection & praise by me for such people) obsessed with language models, who are also incredibly good & experienced at getting the models to do whatever they want. I would trust these people red-teaming a model and telling us its safe than the rigid, proscutean, and less-creative red-teaming I anticipate goes on at ARC-evals. Not that ARC-evals is bad! But that basically everyone looks more rigid, proscutean, and less creative than the cyborgists I’m excited about!
@janus wrote a little bit about this in the final section here, particularly referencing the detection of situational awareness as a thing cyborgs might contribute to. It seems like a fairly straightforward thing to say that you would want the people overseeing AI systems to also be the ones who have the most direct experience interacting with them, especially for noticing anomalous behavior.
I just reread that section, and I think I didn’t recognized it the first time because I wasn’t thinking “what concrete actions is Janus implicitly advocating for here”. Though maybe I just have worse than average reading comprehension.
There now exist two worlds I must glomarize between.
In the first, the irony is intentional, and I say “wouldn’t you like to know”. In the second, its not, “Irony? What irony!? I have no clue what you’re talking about”.
I think many people focus on doing research that focuses on full automation, but I think it’s worth trying to think in the semi-automated frame as well when trying to come up with a path to impact. Obviously, it isn’t scalable, but it may be more sufficient than we’d think by default for a while. In other words, cyborgism-enjoyers might be especially interested in those kinds of evals, capability measurements that are harder to pull out of the model through traditional evals, but easier to measure through some semi-automated setup.
Yesterday I had a conversation with a person very much into cyborgism, and they told me about a particular path to impact floating around the cyborgism social network: Evals.
I really like this idea, and I have no clue how I didn’t think of it myself! Its the obvious thing to do when you have a bunch of insane people (used as a term of affection & praise by me for such people) obsessed with language models, who are also incredibly good & experienced at getting the models to do whatever they want. I would trust these people red-teaming a model and telling us its safe than the rigid, proscutean, and less-creative red-teaming I anticipate goes on at ARC-evals. Not that ARC-evals is bad! But that basically everyone looks more rigid, proscutean, and less creative than the cyborgists I’m excited about!
@janus wrote a little bit about this in the final section here, particularly referencing the detection of situational awareness as a thing cyborgs might contribute to. It seems like a fairly straightforward thing to say that you would want the people overseeing AI systems to also be the ones who have the most direct experience interacting with them, especially for noticing anomalous behavior.
I just reread that section, and I think I didn’t recognized it the first time because I wasn’t thinking “what concrete actions is Janus implicitly advocating for here”. Though maybe I just have worse than average reading comprehension.
I have no idea if this is intended to be read as irony or not, and the ambiguity is delicious.
There now exist two worlds I must glomarize between.
In the first, the irony is intentional, and I say “wouldn’t you like to know”. In the second, its not, “Irony? What irony!? I have no clue what you’re talking about”.
I think many people focus on doing research that focuses on full automation, but I think it’s worth trying to think in the semi-automated frame as well when trying to come up with a path to impact. Obviously, it isn’t scalable, but it may be more sufficient than we’d think by default for a while. In other words, cyborgism-enjoyers might be especially interested in those kinds of evals, capability measurements that are harder to pull out of the model through traditional evals, but easier to measure through some semi-automated setup.