habryka comments on Habryka’s Shortform Feed

habryka 20 Dec 2024 20:44 UTC
20 points
9
An obvious thing to have would be a very easy “flag” button that a user can press if they receive a DM, and if they press that we can look at the DM content they flagged, and then take appropriate action. That’s still kind of late in the game (I would like to avoid most spam and harassment before it reaches the user), but it does seem like something we should have.
- tailcalled 20 Dec 2024 22:53 UTC
  2 points
  0
  Parent
  I wonder if you could also do something like, have an LLM evaluate whether a message contains especially-private information (not sure what that would be… gossip/reputationally-charged stuff? sexually explicit stuff? planning rebellions? doxxable stuff?), and hide those messages while looking at other ones.
  
  Though maybe that’s unhelpful because spambot authors would just create messages that trigger these filters?
  - Dagon 21 Dec 2024 0:11 UTC
    4 points
    1
    Parent
    This is going the wrong direction. If privacy from admins is important (I argue that it’s not for LW messages, but that’s a separate discussion), then breaches of privacy should be exceptions for specific purposes, not allowed unless “really secret contents”.
    
    Don’t make this filter-in for privacy. Make it filter-out—if it’s detected as likely-spam, THEN take more intrusive measures. Privacy-preserving measures include quarantining or asking a few recipients if they consider it harmful before delevering (or not) the rest, automated content filters, etc. This infrastructure requires a fair bit of data-handling work to get it right, and a mitigation process where a sender can find out they’re blocked and explicitly ask the moderator(s) to allow it.
    - tailcalled 24 Dec 2024 10:54 UTC
      2 points
      0
      Parent
      The reason I suggest making it filter-in is because it seems to me that it’s easier to make a meaningful filter that accurately detects a lot of sensitive stuff than a filter that accurately detects spam, because “spam” is kind of open-ended. Or I guess in practice spam tends to be porn bots and crypto scams? (Even on LessWrong?!) But e.g. truly sensitive talk seems disproportionately likely to involve cryptography and/or sexuality, so trying to filter for porn bots and crypto scams seems relatively likely to have reveal sensitive stuff.
      The filter-in vs filter-out in my proposal is not so much about the degree of visibility. Like you could guard my filter-out proposal with the other filter-in proposals, like to only show metadata and only inspect suspected spammers, rather than making it available for everyone.