Rana Dexsin comments on Public Weights?

Rana Dexsin 2 Nov 2023 7:31 UTC
14 points
6
Something I haven’t yet personally observed in threads on this broad topic is the difference in risk modeling from the perspective of the potential malefactor. You note that outside a hackathon context, one could “take a biology class, read textbooks, or pay experienced people to answer your questions”—but especially that last one has some big-feeling risks associated with it. What happens if the experienced person catches onto what you’re trying to do, stops answering questions, and alerts someone? The biology class is more straightforward, but still involves the risky-feeling action of talking to people and committing in ways that leave a trail. The textbooks have the lowest risk of those options but also require you to do a lot more intellectual work to get from the base knowledge to the synthesized form.

This restraining effect comes only partly in the form of real social risks to doing things that look ‘hinky’, and much more immediately in the form of psychological barriers from imagined such risks. People who are of the mindset to attempt competent social engineering attacks often report them being surprisingly easy, but most people are not master criminals and shy away from doing things that feel suspicious by reflex.

When we move to the LLM-encoded knowledge side of things, we get a different risk profile. Using a centralized, interface-access-only LLM involves some social risk to a malefactor via the possibility of surveillance, especially if the surveillance itself involves powerful automatic classification systems. Content policy violation warnings in ChatGPT are a very visible example of this; many people have of course posted about how to ‘jailbreak’ such systems, but it’s also possible that there are other hidden tripwires.

For an published-weights LLM being run on local, owned hardware through generic code that’s unlikely to contain relevant hidden surveillance, the social risk to experimenting drops into negligible range, and someone who understands the technology well enough may also understand this instinctively. Getting a rejection response when you haven’t de-safed the model enough isn’t potentially making everyone around you more suspicious or adding to a hidden tripwire counter somewhere in a Microsoft server room. You get unlimited retries that are punishment-free from this psychological social risk modeling perspective, and they stay punishment-free pretty much up until the point where you start executing on a concrete plan for harm in other ways that are likely to leave suspicious ripples.

Structurally this feels similar to untracked proliferation of other mixed-use knowledge or knowledge-related technology, but it seems worth having the concrete form written out here for potential discussion.

This is the main driving force behind why my intuition agrees with you that the accessibility of danger goes up a lot with a published-weights LLM. Emotionally I also agree with you that it would be sad if this meant it were too dangerous to continue open distribution of such technology. I don’t currently have a well-formed policy position based on any of that.
- Ricardo Meneghin 2 Nov 2023 15:28 UTC
  5 points
  4
  Parent
  The vast majority of the risk seems to lie on following through with synthesizing and releasing the pathogen, not learning how to do it, and I think open-source LLMs change little about that.