The first moral that I’d draw is simple but crucial: If you’re trying to understand some phenomenon by interpreting some data, the kind of data you’re interpreting is key. It’s not enough for the data to be tightly related to the phenomenon——or to be downstream of the phenomenon, or enough to pin it down in the eyes of Solomonoff induction, or only predictable by understanding it. If you want to understand how a computer operating system works by interacting with one, it’s far far better to interact with the operating at or near the conceptual/structural regime at which the operating system is constituted.
What’s operating-system-y about an operating system is that it manages memory and caching, it manages CPU sharing between process, it manages access to hardware devices, and so on. If you can read and interact with the code that talks about those things, that’s much better than trying to understand operating systems by watching capacitors in RAM flickering, even if the sum of RAM+CPU+buses+storage gives you a reflection, an image, a projection of the operating system, which in some sense “doesn’t leave anything out”. What’s mind-ish about a human mind is reflected in neural firing and rewiring, in that a difference in mental state implies a difference in neurons. But if you want to come to understand minds, you should look at the operations of the mind in descriptive and manipulative terms that center around, and fan out from, the distinctions that the mind makes internally for its own benefit. In trying to interpret a mind, you’re trying to get the theory of the program.
You’ll have to be a little more direct to get your point across I fear. I am sensing you think mechinterp, SLT, and neuroscience aren’t at a high enough level of abstraction. I am curious why you think so and would benefit from understanding more clearly what you are proposing instead.
They aren’t close to the right kind of abstraction. You can tell because they use a low-level ontology, such that mental content, to be represented there, would have to be homogenized, stripped of mental meaning, and encoded. Compare trying to learn about arithmetic, and doing so by explaining a calculator in terms of transistors vs. in terms of arithmetic. The latter is the right level of abstraction; the former is wrong (it would be right if you were trying to understand transistors or trying to understand some further implementational aspects of arithmetic beyond the core structure of arithmetic).
I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?
I’m unsure whether it’s a good analogy. Let me make a remark, and then you could reask or rephrase.
The discovery that the phenome is largely a result of the genome, is of course super important for understanding and also useful. The discovery of mechanically how (transcribe, splice, translate, enhance/promote/silence, trans-regulation, …) the phenome is a result of the genome is separately important, and still ongoing. The understanding of “structurally how” characters are made, both in ontogeny and phylogeny, is a blob of open problems (evodevo, niches, …). Likewise, more simply, “structurally what”—how to even think of characters. Cf. Günter Wagner, Rupert Riedl.
I would say the “structurally how” and “structurally what” is most analogous. The questions we want to answer about minds aren’t like “what is a sufficient set of physical conditions to determine—however opaquely—a mind’s effects”, but rather “what smallish, accessible-ish, designable-ish structures in a mind can [understandably to us, after learning how] determine a mind’s effects, specifically as we think of those effects”. That is more like organology and developmental biology and telic/partial-niche evodevo (<-made up term but hopefully you see what I mean).
I suppose it depends on what one wants to do with their “understanding” of the system? Here’s one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states—your “structurally what”, if I’m understanding correctly, which to be honest I’m not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
But somehow this topic is not afforded much care or interest. Some people will pay lip service to caring, others will deny that mental states exist, but either way the field of alignment doesn’t put much force (money, smart young/new people, social support) toward these questions. This is understandable, as we have much less legible traction on this topic, but that’s… undignified, I guess is the expression.
a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
Even if you do understand the lower level, you couldn’t stop such an adversarial AI from exploiting it, or exploiting something else, and taking control. If you understand the mental states (yeah, the structure), then maybe you can figure out how to make an AI that wants to not do that. In other words, it’s not sufficient, and probably not necessary / not a priority.
This really clicked for me. I don’t blame you for making up the term because, although I can see the theory and examples of papers in that topic, I can’t think of a unifying term that isn’t horrendously broad (e.g. molecular ecology).
I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
I think I am asking a very fair question.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g.
I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
People in these fields understand very well the problem you are pointing towards.
I don’t believe you.
if the alignment community would outlaw mechinterp/slt/ neuroscience
This is an insane strawman. Why are you strawmanning what I’m saying?
I dont think progress on this question will be made by blanket dismissals
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.
What is a defeater and can you give some examples ?
A thing that makes alignment hard / would defeat various alignment plans or alignment research plans.
E.g.s: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities#Section_B_
E.g. the things you’re studying aren’t stable under reflection.
E.g. the things you’re studying are at the wrong level of abstraction (SLT, interp, neuro) https://www.lesswrong.com/posts/unCG3rhyMJpGJpoLd/koan-divining-alien-datastructures-from-ram-activations
E.g. https://tsvibt.blogspot.com/2023/03/the-fraught-voyage-of-aligned-novelty.html
This just in: Alignment researchers fail to notice skulls from famous blog post “Yes, we have noticed the skulls”.
“E.g. the things you’re studying are at the wrong level of abstraction (SLT, interp, neuro)”
Let’s hear it. What do you mean here exactly?
From the linked post:
You’ll have to be a little more direct to get your point across I fear.
I am sensing you think mechinterp, SLT, and neuroscience aren’t at a high enough level of abstraction. I am curious why you think so and would benefit from understanding more clearly what you are proposing instead.
They aren’t close to the right kind of abstraction. You can tell because they use a low-level ontology, such that mental content, to be represented there, would have to be homogenized, stripped of mental meaning, and encoded. Compare trying to learn about arithmetic, and doing so by explaining a calculator in terms of transistors vs. in terms of arithmetic. The latter is the right level of abstraction; the former is wrong (it would be right if you were trying to understand transistors or trying to understand some further implementational aspects of arithmetic beyond the core structure of arithmetic).
What I’m proposing instead, is theory.
I think I disagree, or need some clarification. As an example, the phenomenon in question is that the physical features of children look more or less like combinations of the parents features. Is the right kind of abstraction a taxonomy and theory of physical features at the level of nose-shapes and eyebrow thickness? Or is it at the low-level ontology of molecules and genes, or is it in the understanding of how those levels relate to eachother?
Or is that not a good analogy?
I’m unsure whether it’s a good analogy. Let me make a remark, and then you could reask or rephrase.
The discovery that the phenome is largely a result of the genome, is of course super important for understanding and also useful. The discovery of mechanically how (transcribe, splice, translate, enhance/promote/silence, trans-regulation, …) the phenome is a result of the genome is separately important, and still ongoing. The understanding of “structurally how” characters are made, both in ontogeny and phylogeny, is a blob of open problems (evodevo, niches, …). Likewise, more simply, “structurally what”—how to even think of characters. Cf. Günter Wagner, Rupert Riedl.
I would say the “structurally how” and “structurally what” is most analogous. The questions we want to answer about minds aren’t like “what is a sufficient set of physical conditions to determine—however opaquely—a mind’s effects”, but rather “what smallish, accessible-ish, designable-ish structures in a mind can [understandably to us, after learning how] determine a mind’s effects, specifically as we think of those effects”. That is more like organology and developmental biology and telic/partial-niche evodevo (<-made up term but hopefully you see what I mean).
https://tsvibt.blogspot.com/2023/04/fundamental-question-what-determines.html
I suppose it depends on what one wants to do with their “understanding” of the system? Here’s one AI safety case I worry about: if we (humans) don’t understand the lower-level ontology that gives rise to the phenomenon that we are more directly interested in (in this case I think thats something like an AI systems behavior/internal “mental” states—your “structurally what”, if I’m understanding correctly, which to be honest I’m not very confident I am), then a sufficiently intelligent AI system that does understand that relationship will be able to exploit the extra degrees of freedom in the lower level ontology to our disadvantage, and we won’t be able to see it coming.
I very much agree that structurally what matters a lot, but that seems like half the battle to me.
But somehow this topic is not afforded much care or interest. Some people will pay lip service to caring, others will deny that mental states exist, but either way the field of alignment doesn’t put much force (money, smart young/new people, social support) toward these questions. This is understandable, as we have much less legible traction on this topic, but that’s… undignified, I guess is the expression.
Even if you do understand the lower level, you couldn’t stop such an adversarial AI from exploiting it, or exploiting something else, and taking control. If you understand the mental states (yeah, the structure), then maybe you can figure out how to make an AI that wants to not do that. In other words, it’s not sufficient, and probably not necessary / not a priority.
This really clicked for me. I don’t blame you for making up the term because, although I can see the theory and examples of papers in that topic, I can’t think of a unifying term that isn’t horrendously broad (e.g. molecular ecology).
Ok. How would this theory look like and how would it cache out into real world consequences ?
This is a derail. I can know that something won’t work without knowing what would work. I don’t claim to know something that would work. If you want my partial thoughts, some of them are here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
In general, there’s more feedback available at the level of “philosophy of mind” than is appreciated.
I think I am asking a very fair question.
What is the theory of change of your philosophy of mind caching out into something with real-world consequences ?
I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g. I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
I don’t believe you.
This is an insane strawman. Why are you strawmanning what I’m saying?
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Okay seems like the commentariat agrees I am too combative. I apologize if you feel strawmanned.
Feels like we got a bit stuck. When you say “defeater” what I hear is a very confident blanket dismissal. Maybe that’s not what you have in mind.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.