TLDR
Submit ideas of “interesting evaluations” in the comments. The best one by December 5th will get $50. All of them will be highly appreciated.
Motivation
A few of us (myself, Nuño Sempere, and Ozzie Gooen), have been working recently to better understand how to implement meaningful evaluation systems for EA/rationalist research and projects. This is important both for short-term use (so we can better understand how valuable EA/rationalist research is) and for long-term use (as in, setting up scalable forecasting systems on qualitative parameters). In order to understand this problem, we’ve been investigating evaluations specific to research and evaluations in a much broader sense.
We expect work in this area to be useful for a wide variety of purposes. For instance, even if Certificates of Impact eventually get used as the primary mode of project evaluation, purchasers of certificates will need strategies to actually do the estimation.
Existing writing on “evaluations” seems to be fairly domain-specific (only focused on Education or Nonprofits), one-sided (yay evaluations or boo evaluations), or both. This often isn’t particularly useful when trying to understand the potential gains and dangers of setting up new evaluation systems.
I’m now investigating a neutral history of evaluations, with the goal of identifying trends in what aids or hinders an evaluation system in achieving its goals. The ideal output of this stage would be an absolutely comprehensive list that will be posted to LessWrong. While this is probably impractical, hopefully, we could make one comprehensive enough, especially with your help.
Task
Suggest an interesting example (or examples) of an evaluation system. For these purposes, evaluation means “a systematic determination of a subject’s merit, worth and significance, using criteria governed by a set of standards”, but if you think of something that doesn’t seem to fit, err on the side of inclusion
Prize
The prize is $50 for the top submission.
Rules
To enter, submit a comment suggesting an interesting example below, before the 5th of December. This post is both on LessWrong and the EA Forum, so comments on either count.
Rubric
To hold true to the spirit of the project, we have a rubric evaluation system to score this competition. Entries will be evaluated using the following criteria:
Usefulness/uniqueness of lesson from the example
Novelty or surprise of the entry itself, for Elizabeth
Novelty of the lessons learned from the entry, for Elizabeth.
Accepted Submission Types
I care about finding interesting things more than proper structure. Here are some types of entries that would be appreciated:
A single example in one of the categories already mentioned
Four paragraphs on an unusual exam and its interesting impacts
A babbled list of 104 things that vaguely sound like evaluations
Examples of Interesting Evaluations
We have a full list here, but below is a subset to not anchor you too much. Don’t worry about submitting duplicates: I’d rather risk a duplicate than miss an example.
Chinese Imperial Examination
Westminster Dog Show
Turing Test
Consumer Reports Product Evaluations
Restaurant Health Grades
Art or Jewelry Appraisal
ESGs/Socially Responsible Investing Company Scores
“Is this porn?”
Legally?
For purposes of posting on Facebook?
Charity Cost-Effectiveness Evaluations
Judged Sports (e.g. Gymnastics)
Motivating Research
These are some of our previous related posts:
I guess you intend to classify the responses afterward to discover underexplored dimensions of evaluations. Anticipating that I will just offer a lot of dimensions and examples thereof:
Evaluation of an attribute the subject can or can not influence (weight vs. height)
Kind of the evaluated attribute(s) - Physical (weight), technical (gear ratio), cognitive (IQ, mental imagery), mental (stress), medical (ICD classification), social (number of friends), mathematical (prime), …
Abstractness of the evaluated attribute(s)
low: e.g. directly observable physical attributes like height;
high: requiring expert interpretation and judgment e.g. beauty of a proof
Evaluation with the knowledge of the subject or without—test in school vs. secret observation
Degree of Goodhardting possible or actually occurring on the evaluation
Entangledness of the evaluation with the subject and evaluator
No relation between both—two random strangers, one assesses the other and moves on
Evaluator acts in a complex system, subject does not—RCT study of a new drug in mice
Both act in a shared complex system—employee evaluation by superior
Evaluation that is voluntary or not—medical check vs. appraisal on the job
Evaluation that is legal or not—secret observation of work performance is often illegal
Evaluation by the subject itself, another entity, or both together
Purpose of the evaluation—for decision making (which candidate to choose), information gathering (observing competitors or own strengths), or quality assurance (good meet expectations)
Evaluation for the purpose of the subject, the evaluator, another party, or a combination—exams in school often serve all of these
Objectiveness of the evaluation or the underlying criteria
Degree of standardization or “acceptedness” of the criteria—SAT vs. ad-hoc questionnaire
Single (entry exams), repeated (test in school), or continuous evaluation (many technical monitoring systems)
Size of evaluated population—single, few, statistically relevant sample size, or all
Length of the evaluation
Effort needed for the evaluation
You can treat this submission as an evaluation of evaluations ;-)
EDIT: spell checking
Stress tests
Many systems get “spot-checked” by artificially forcing them into a rare but important-to-correctly-handle stressed state under controlled conditions where more monitoring and recovery resources are available (or where the stakes are lower) than would be the case during a real instance of the stressed state.
These serve to practice procedures, yes, but they also serve to evaluate whether the procedures would be followed correctly in a crisis, and whether the procedures even work.
Drills
Fire/tornado/earthquake/nuclear-attack drills
Military drills (the kind where you tell everyone to get to battle stations, not the useless marching around in formation kind)
Large cloud computing companies I’ve worked at need to stay online in the face of loss of a single computer, or a single datacenter. They periodically check to see that these failures are survivable by directly powering off computers, disconnecting entire datacenters from the network, or simply running through a datacenter failover procedure beginning to end to check that it works.
https://en.wikipedia.org/wiki/Stress_test_(financial)
A meeting quality score, as described in the patent referenced in this article (https://www.geekwire.com/2020/microsoft-patents-technology-score-meetings-using-body-language-facial-expressions-data/ )
Two that are focused on critique rather than evaluation per se:
“the critical response process” is about useful critique in the arts: https://lizlerman.com/critical-response-process/
“best practices for code review”: https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/
Microsoft TrueSkill (Multiplayer ELO-like system, https://www.wikiwand.com/en/TrueSkill)
I originally read this EA as “Evolutionary Algorithms” rather than “Effective Altruism”, which made me think of this paper on degenerate solutions to evolutionary algorithms (https://arxiv.org/pdf/1803.03453v1.pdf). One amusing example is shown in a video at https://twitter.com/jeffclune/status/973605950266331138?s=20
Some additional ideas: There’s a large variety of “loss functions” that are used in machine learning to score the quality of solutions. There are a lot of these, but some of the most popular are below. A good overview is at https://medium.com/udacity-pytorch-challengers/a-brief-overview-of-loss-functions-in-pytorch-c0ddb78068f7
* Mean Absolute Error (a.k.a. L1 loss)
* Mean squared error
* Negative log-likelihood
* Hinge loss
* KL divergence
* BLEU loss for machine translation (https://www.wikiwand.com/en/BLEU)
There’s also a large set of “goodness of fit” measures that evaluate the quality of a model, including simple things like r^2 but also more exotic tests to do things like compare distributions. Wikipedia again has a good overview (https://www.wikiwand.com/en/Goodness_of_fit)
One key factor in metrics is how the number relates to the meaning. We’d prefer metrics that have scales which are meaningful to the users, not arbitrary. I really liked one example I saw recently.
In discussing this point in a paper entitled “Arbitrary metrics in psychology,” Blanton and Jaccard (doi:10.1037/0003-066X.61.1.27) fist point out that likert scales are not so useful. They then discuss the the (in)famous IAT test, where the scale is a direct measurement of the quantity of interest, but note that: “The metric of milliseconds, however, is arbitrary when it is used to measure the magnitude of an attitudinal preference.” Therefore, when thinking about degree of racial bias, “researchers and practitioners should refrain from making such diagnoses until the metric of the IAT can be made less arbitrary and until a compelling empirical case can be made for the diagnostic criteria used.” They go on to discuss norming measures, and looking at variance—but the base measure being used in not meaningful, so any transformation is of dubious value.
Going beyond that paper, looking at the broader literature on biases, we can come up with harder to measure but more meaningful measures of bias. Using probability of hiring someone based on racially-coded names might be a more meaningful indicator—but probability is also not a clear indicator, and use of names as a proxy obscures some key details about whether the measurement is class-based versus racial. It’s also not as clear how big of an effect a difference in probability makes, despite being directly meaningful.
A very directly meaningful measure of bias that is even easier to interpret is dollars. This is immediately meaningful; if a person pays a different amount for identical service, that is a meaningful indicator of not only the existence, but the magnitude of a bias. Of course, evidence of pay differentials is a very indirect and complex question, but there are better ways of getting the same information in less problematic contexts. Evidence can still be direct, such as how much someone bids for watches, where pictures were taken with the watch on a black or white person’s wrist, are a much more direct and useful way to understand how much bias is being displayed.
See also: https://twitter.com/JessieSunPsych/status/1333086463232258049
Oh man, I wish you’d come in under the deadline.
For people who don’t feel like clicking: it’s a quantification of behavior predicted by different scores on Big-5.
IT security auditing; e.g. https://safetag.org/guide/
“Postmortem culture” from the Google SRE book: https://sre.google/sre-book/postmortem-culture/
This book has some other sections that are also about evaluation, but this chapter is possibly my favorite chapter from any corporate handbook.
I have a go-to evaluation system for best ROI items in a brainstormed list amongst team members. First we generate the list, which ends up with, say, three dozen items from 6 of us. Then name a reasonably small but large-enough number like 10. Everyone may put 10 stars by items, max 2 per item, for any reason they like, including “this would be best for my morale”. Sort, pick the top three to use. Any surprises? Discuss them. (Modify numbers like 8, 2, and 3 as appropriate.)
This evaluation system is simple to implement in many contexts, easily understood without much explanation at all, fast, and produces perfectly acceptable if not necessarily optimal results. It is pretty decent at grabbing info from folks intuitions without requiring them to introspect enough to make those intuitions explicit.
Current open market price of an asset
Public, highly liquid markets for assets create lots of information about the value of those assets, which is extremely useful for both individuals and firms that are trying to understand:
the state of their finances
how successful a venture dealing in those assets has been
whether to accept a deal (a financial transaction, or some cooperative venture) involving those assets
(if the assets are stock in some company) how successful the company has been so far
Advanced Placement tests
SAT subject tests
Preregistration of studies
Scaling ‘Laws’
Raven’s Progressive Matrices
Welsh Figure Preference Test
Ruleset evolution in speedrunning as an example of a self-policing community.
In the news today: CASP (Critical Assessment of protein Structure Prediction)
I see “property assessment” on the list, but it’s worth calling out self-assessment specifically (where the owner has to sell their property if offered their self-assessed price).
Then there are those grades organizations give politicians. And media endorsements of politicians. And, for that matter, elections.
Keynesian beauty contests.
And it seems with linking to this prior post (not mine): https://www.lesswrong.com/posts/BthNiWJDagLuf2LN2/evaluating-predictions-in-hindsight
My posts here are basically all evaluations or considerations useful for cost-effectiveness evaluations. They are crossposted from the EA Forum. The most interesting ones for your purpose are probably:
- A general framework for evaluating aging research. Part 1: reasoning with Longevity Escape Velocity
- Why SENS makes sense
- Evaluating Life Extension Advocacy Foundation