Whistleblowing Twitter Bot
In this post, I propose an idea that could improve whistleblowing efficiency, thus hopefully improving AI Safety by making unsafe practices discovered marginally faster.
I’m looking for feedback, ideas for improvement, and people interested in making it happen.
It has been proposed before, that it’s beneficial to have an efficient and trustworthy whistleblowing mechanism The technology that makes it possible has become easy and convenient. For example, here is Proof of Organization, built on top of ZK Email: a message board that allows people owning an email address at their company’s domain to post without revealing their identity And here is an application for ring signatures using GitHub SSH keys that allows creating a signature that proves that you own one of the keys from any subgroup you define (e.g., EvilCorp repository contributors)
However, as one may have guessed, it hasn’t been widely used. Hence, when the critical moment arrives, the whistleblower may not be aware of such technology, and even if they were, they probably wouldn’t trust it enough to use it. I think trust comes from either code being audited by a well-established and trusted entity or, more commonly—through practice (e.g., I don’t need to verify that a certain password manager is secure if I know that millions are using it, and there haven’t been any password breaches reported)
Hence, I was considering how to make a privacy-preserving communication tool that would be commonly used, demonstrating its legitimacy and becoming trusted
The best idea I have so far is to create a set of Twitter bots for each interesting company (or community), where only the people in question could post. Depending on the particular Twitter bot in question, access could be gated by ownership of a LinkedIn account, email domain, or, e.g., an LW/AI-Alignment forum account of a certain age
I imagine this could become viral and interesting in gossipy cases, like the Sam Altman drama or the Biden dropout drama.
Some questions that came up during consideration:
How to deal with moderation of the content (if everything is posted, anyone could deliberately post some profanity to get the bot banned)?
I would aggressively moderate myself and replace moderated posts with a link to a separate website where all posts get through
How do we balance convenience and privacy?
I’d make a hosted opensource tool, which I expect most people would feel content to use for any gossip case that doesn’t put your job on the line but has instructions available to download it and run locally and submit posts through Tor, etc. for cases where such effort is warranted
What if people use this tool to make false accusations?
I do think this is an actual downside, but I hope that the benefits of the tool would be worth it
What if someone creates a fake dialogue, pretending to be two people debating a topic?
Although it’s technically possible to make a tool that would allow proving that you have not posted before, this functionality shouldn’t exist since. Otherwise, one can be forced to make such proof or confess. It is a thing to be aware of, but not too much of a problem, in my opinion
I’m curious to learn what others think and about other ideas for making a gossip/whistleblower tool that could become widely known and trusted.
I think it would be good to automate the moderation process. Current LLM should be able to make the decision about whether a post is containing the kind of profanity that would lead to account bans.
I agree, though I think it would be a very ridiculous own-goal if e.g. GPT-4o decided to block a whistleblowing report about OpenAI because it was trained to serve OpenAI’s interests. I think any model used by this kind of whistleblowing tool should be open-source (nothing fancy / more dangerous than what’s already out there), run locally by the operators of the tool, and tested to make sure it doesn’t block legitimate posts.
I can also unblock it manually at any point, and keep the full uncensored log of posts on a blockchain
My gut instinct is that this would have been a fantastic thing to create 2-4 years ago. My biggest hesitation is that the probability a tool like this decreases existential risk is proportional to the fraction of lab researchers who know about it and adoption can be a slow / hard thing to make happen. I still think that this kind of program could be incredibly valuable under the right circumstances so someone should probably be working on this.
Also, I have a very amateurish security question: if someone provides their work email to verify their authenticity with this tool, can their employer find out? For example, I wouldn’t put it past OpenAI to check if an employee’s email account got pinged by this tool and then to pressure / fire that employee.
Thanks for sharing your opinion. Regarding security: Using a full body of an email you can generate a zero knowledge using an offline tool (since all emails are hashed and signed by the email server). No new emails need to be exchanged