I think adversarial examples are a somewhat misleading analogy for failure modes of HCH, and tend to think of them more like attractors in dynamical systems. Adversarial examples are almost uneradicable, and yet are simultaneously not that important because they probably won’t show up if there’s no powerful adversary searching over ways to mess up your classifier. Unhelpful attractors, on the other hand, are more prone to being wiped out by changes in the parameters of the system, but don’t require any outside adversary—they’re places where human nature is already doing the search for self-reinforcing patterns.
On reflection, I think you’re right. As long as we make sure we don’t spawn any adversaries in HCH, adversarial examples in this sense will be less of an issue.
I thought your linked HCH post was great btw—I had missed it in my literature review. This point about non-self-correcting memes
But I do have some guesses about possible attractors for humans in HCH. An important trick for thinking about them is that attractors aren’t just repetitious, they’re self-repairing. If the human gets an input that deviates from the pattern a little, their natural dynamics will steer them into outputting something that deviates less. This means that a highly optimized pattern of flashing lights that brainwashes the viewer into passing it on is a terrible attractor, and that bigger, better attractors are going to look like ordinary human nature, just turned up to 11.
really impressed me w/r/t the relevance of the attractor formalism. I think what I had in mind in this project, just thinking from the armchair about possible inputs into humans, was exactly the seizure lights example and their text analogues, so I updated significantly here.
I think adversarial examples are a somewhat misleading analogy for failure modes of HCH, and tend to think of them more like attractors in dynamical systems. Adversarial examples are almost uneradicable, and yet are simultaneously not that important because they probably won’t show up if there’s no powerful adversary searching over ways to mess up your classifier. Unhelpful attractors, on the other hand, are more prone to being wiped out by changes in the parameters of the system, but don’t require any outside adversary—they’re places where human nature is already doing the search for self-reinforcing patterns.
On reflection, I think you’re right. As long as we make sure we don’t spawn any adversaries in HCH, adversarial examples in this sense will be less of an issue.
I thought your linked HCH post was great btw—I had missed it in my literature review. This point about non-self-correcting memes
really impressed me w/r/t the relevance of the attractor formalism. I think what I had in mind in this project, just thinking from the armchair about possible inputs into humans, was exactly the seizure lights example and their text analogues, so I updated significantly here.