This is interesting work, and I appreciate you taking the time to compile and share it.
I think it will be much more difficult for a model to successfully blackmail anyone than to successfully harass them. Humans are limited in their ability to harass a single target by time and effort more than anything—nonspecific death threats and vitriol require little to no knowledge of the target beyond a surface level. These models could churn out countless variations of this sort of attack relentlessly, which could certainly detrimentally affect someone’s mental health/wellbeing to an equal or greater extent than similar human attacks.
However, in the case of traditional blackmail, the key component of fear is that the attacker will publicize something generally unknown, which is not a strong point of current LLMs. Blasting negative public information everywhere could still be detrimental, especially to someone not inured to such attacks (e.g. a non-celebrity), but I view this as having a low ceiling of efficacy based on current capabilities. LLMs scrape public knowledge. A malicious AI agent would have to first acquire information that is hidden, which means targeting the right person or people with the right threats/bribes to achieve that information. Establishing those connections would be incredibly difficult as well as both time- and resource-intensive.
Alternatively, the LLM could trick the target directly into saying or doing something compromising. This second state is, in my view, much more dangerous and already possible with the current state of LLMs. A refined LLM that emulates a “lifelike” AI romantic partner could be used by a bad actor to catfish someone into sending nude pictures or other compromising information with little adjustment. Spitballing here: these attacks could be shotgunned to several targets without the time investment of human catfish. Then, they could theoretically alert a human organizer when a sensitive point is reached in a conversation to seal the deal, so to speak.
Effective attacks like this are much closer on the horizon than the sort of blackmail utilized by Commander in this post, based on current capabilities. I would be curious to know your thoughts on this and whether this is something we’re seeing an uptick in at all.
Thanks for the task ideas. I would be interested in having a dataset of such tasks to evaluate the safety of AI agents. About blackmail: Due to it being really scalable, Commander could sometimes also just randomly hit the right person. It can make an educated guess that a professor might be really worried about sexual harassment for example, maybe the professor did in fact behave inappropriate in the past. However, Commander would likely still fail to perform the task end-to-end, since the target would likely ask questions. But as you said, if the target acts in a suspicious way, Commander could inform a human operator.
This is interesting work, and I appreciate you taking the time to compile and share it.
I think it will be much more difficult for a model to successfully blackmail anyone than to successfully harass them. Humans are limited in their ability to harass a single target by time and effort more than anything—nonspecific death threats and vitriol require little to no knowledge of the target beyond a surface level. These models could churn out countless variations of this sort of attack relentlessly, which could certainly detrimentally affect someone’s mental health/wellbeing to an equal or greater extent than similar human attacks.
However, in the case of traditional blackmail, the key component of fear is that the attacker will publicize something generally unknown, which is not a strong point of current LLMs. Blasting negative public information everywhere could still be detrimental, especially to someone not inured to such attacks (e.g. a non-celebrity), but I view this as having a low ceiling of efficacy based on current capabilities. LLMs scrape public knowledge. A malicious AI agent would have to first acquire information that is hidden, which means targeting the right person or people with the right threats/bribes to achieve that information. Establishing those connections would be incredibly difficult as well as both time- and resource-intensive.
Alternatively, the LLM could trick the target directly into saying or doing something compromising. This second state is, in my view, much more dangerous and already possible with the current state of LLMs. A refined LLM that emulates a “lifelike” AI romantic partner could be used by a bad actor to catfish someone into sending nude pictures or other compromising information with little adjustment. Spitballing here: these attacks could be shotgunned to several targets without the time investment of human catfish. Then, they could theoretically alert a human organizer when a sensitive point is reached in a conversation to seal the deal, so to speak.
Effective attacks like this are much closer on the horizon than the sort of blackmail utilized by Commander in this post, based on current capabilities. I would be curious to know your thoughts on this and whether this is something we’re seeing an uptick in at all.
Thanks for the task ideas. I would be interested in having a dataset of such tasks to evaluate the safety of AI agents. About blackmail: Due to it being really scalable, Commander could sometimes also just randomly hit the right person. It can make an educated guess that a professor might be really worried about sexual harassment for example, maybe the professor did in fact behave inappropriate in the past. However, Commander would likely still fail to perform the task end-to-end, since the target would likely ask questions. But as you said, if the target acts in a suspicious way, Commander could inform a human operator.