Part of it may be that current LLMs aren’t very agentic. If you give them a specific question, they often come up with a very good answer. But an open ended request like, write an article for Less wrong, and they flounder.
I think you don’t need a lot of agency to write comments on Less Wrong. I imagine an algorithm like this:
After an article is written, wait a random time interval between 15 minutes and 24 hours.
Read the article and the existing comments.
Prepare a comment that you believe would get most karma (you can learn from existing LW comments and their karma).
Think again how much karma would your prepared comment probably get. In your estimation, include the fact that LW readers do not like comments that seem like written by an LLM.
If you conclude that the expected karma is 5 or more, post the comment. Otherwise, do not comment on this article.
I imagine that the “learn from existing LW comments and their karma” would be the difficult part, because the more comments you need to process, the more work it is. But otherwise, this seems relatively simple; I am curious how well would such algorithm be received.
This should be probably only attempted with clear and huge warning that it’s a LLM authored comment. Because LLMs are good at matching style without matching the content, it could go with exploiting heuristics of the users calibrated only for human level of honesty / reliability / non-bulshitting.
Also check this comment about how conditioning on the karma score can give you hallucinated strong evidence:
Makes me think, what about humans who would do the same thing? But probably the difference is that humans can build their credibility over time, and if someone new posted an unlikely comment, they would be called out on that.
It’s really hard for humans to match the style / presentation / language without putting a lot of work into understanding the target of the comment. LLMs are inherently worse (right now) at doing the understanding, coming up with things worth saying, being calibrated about being critical AND they are a lot better at just imitating the style.
This just invalidates some side signals humans habitually use on one another.
This seems quite technologically feasible now, and I expect the outcome would mostly depend on the quality and care that went into the specific implementation.
I am even more confident that if the detail of ‘the comments of the bot get further tuning via feedback, so that initial flaws get corrected’, then the bot would quickly (after a few hundred such feedbacks) get ‘good enough’ to pass most people’s bars for inclusion.
We should have empirical evidence about this, actually, since the LW team has been experimenting with a “virtual comments” feature. @Raemon, the EDT issue aside, were the comments any good if you forgot they’re written by an LLM? Can you share a few (preferably a lot) examples?
It’s been a long time since I looked at virtual comments, as we never actually merged them in. IIRC, none were great, but sometimes they were interesting (in a kind of “bring your own thinking” kind of way).
They were implemented as a Turing test, where mods would have to guess which was the real comment from a high karma user. If they’d been merged in, it would have been interesting to see the stats on guessability.
Part of it may be that current LLMs aren’t very agentic. If you give them a specific question, they often come up with a very good answer. But an open ended request like, write an article for Less wrong, and they flounder.
I think you don’t need a lot of agency to write comments on Less Wrong. I imagine an algorithm like this:
After an article is written, wait a random time interval between 15 minutes and 24 hours.
Read the article and the existing comments.
Prepare a comment that you believe would get most karma (you can learn from existing LW comments and their karma).
Think again how much karma would your prepared comment probably get. In your estimation, include the fact that LW readers do not like comments that seem like written by an LLM.
If you conclude that the expected karma is 5 or more, post the comment. Otherwise, do not comment on this article.
I imagine that the “learn from existing LW comments and their karma” would be the difficult part, because the more comments you need to process, the more work it is. But otherwise, this seems relatively simple; I am curious how well would such algorithm be received.
This should be probably only attempted with clear and huge warning that it’s a LLM authored comment. Because LLMs are good at matching style without matching the content, it could go with exploiting heuristics of the users calibrated only for human level of honesty / reliability / non-bulshitting.
Also check this comment about how conditioning on the karma score can give you hallucinated strong evidence:
https://www.lesswrong.com/posts/PQaZiATafCh7n5Luf/gwern-s-shortform?commentId=smBq9zcrWaAavL9G7
Okay, that pretty much ruins the idea.
Makes me think, what about humans who would do the same thing? But probably the difference is that humans can build their credibility over time, and if someone new posted an unlikely comment, they would be called out on that.
It’s really hard for humans to match the style / presentation / language without putting a lot of work into understanding the target of the comment. LLMs are inherently worse (right now) at doing the understanding, coming up with things worth saying, being calibrated about being critical AND they are a lot better at just imitating the style.
This just invalidates some side signals humans habitually use on one another.
Presumably, this happens: https://slatestarcodex.com/2016/12/12/might-people-on-the-internet-sometimes-lie/
I do often notice how the top upvoted reddit comment in big subs is confidently wrong, with a more correct/nuanced take sitting much lower.
This seems quite technologically feasible now, and I expect the outcome would mostly depend on the quality and care that went into the specific implementation. I am even more confident that if the detail of ‘the comments of the bot get further tuning via feedback, so that initial flaws get corrected’, then the bot would quickly (after a few hundred such feedbacks) get ‘good enough’ to pass most people’s bars for inclusion.
We should have empirical evidence about this, actually, since the LW team has been experimenting with a “virtual comments” feature. @Raemon, the EDT issue aside, were the comments any good if you forgot they’re written by an LLM? Can you share a few (preferably a lot) examples?
It’s been a long time since I looked at virtual comments, as we never actually merged them in. IIRC, none were great, but sometimes they were interesting (in a kind of “bring your own thinking” kind of way).
They were implemented as a Turing test, where mods would have to guess which was the real comment from a high karma user. If they’d been merged in, it would have been interesting to see the stats on guessability.
@kave @habryka