David Udell comments on HCH and Adversarial Questions

David Udell 2 Mar 2022 16:59 UTC
3 points
On reflection, I think you’re right. As long as we make sure we don’t spawn any adversaries in HCH, adversarial examples in this sense will be less of an issue.

I thought your linked HCH post was great btw—I had missed it in my literature review. This point about non-self-correcting memes
But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren’t just repetitious, they’re self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.
really impressed me w/r/t the relevance of the attractor formalism. I think what I had in mind in this project, just thinking from the armchair about possible inputs into humans, was exactly the seizure lights example and their text analogues, so I updated significantly here.