Can you be more specific about the daemons you’re thinking about? I had tried to argue that daemons wouldn’t occur under certain circumstances, or at least wouldn’t cause malign failures...
Which part are you referring to?
Anyway, I’m worried that a daemon will arise while searching for models which do a good job of predicting masked bits. As it was put here: ”...we note that trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners.” Initially daemons seemed implausible to me, but then I thought of a few ways they could happen—hoping to write posts about this before too long. I encourage others to brainstorm as well, so we can try & think of all the plausible ways daemons could get created.
I started my own list of pathological things that might happen with self-supervised learning systems, maybe I’ll show you when it’s ready and we can compare notes...?
RE daemons, I wrote in that post (and have been assuming all along): “I’m assuming that we will not do a meta-level search for self-supervised learning algorithms… Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. “Transformer + gradient descent” or “whatever the brain does”), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess.” The contents of a world-model, as I imagine it, is a big data structure consisting of gajillions of “concepts” and “transformations between concepts”. It’s a passive data structure, therefore not a “daemon” in the usual sense. Then there’s a KANSI (Known Algorithm Non Self Improving) system that’s accessing and editing the world model. I also wouldn’t call that a “daemon”, instead I would say “This algorithm we wrote can have pathological behavior...”
Which part are you referring to?
Anyway, I’m worried that a daemon will arise while searching for models which do a good job of predicting masked bits. As it was put here: ”...we note that trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners.” Initially daemons seemed implausible to me, but then I thought of a few ways they could happen—hoping to write posts about this before too long. I encourage others to brainstorm as well, so we can try & think of all the plausible ways daemons could get created.
Sure!
Ah, thanks for clarifying.
The first entry on my “list of pathological things” wound up being a full blog post in length: See Self-supervised learning and manipulative predictions.
RE daemons, I wrote in that post (and have been assuming all along): “I’m assuming that we will not do a meta-level search for self-supervised learning algorithms… Instead, I am assuming that the self-supervised learning algorithm is known and fixed (e.g. “Transformer + gradient descent” or “whatever the brain does”), and that the predictive model it creates has a known framework, structure, and modification rules, and that only its specific contents are a hard-to-interpret complicated mess.” The contents of a world-model, as I imagine it, is a big data structure consisting of gajillions of “concepts” and “transformations between concepts”. It’s a passive data structure, therefore not a “daemon” in the usual sense. Then there’s a KANSI (Known Algorithm Non Self Improving) system that’s accessing and editing the world model. I also wouldn’t call that a “daemon”, instead I would say “This algorithm we wrote can have pathological behavior...”