I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
I think I am asking a very fair question.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g.
I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
People in these fields understand very well the problem you are pointing towards.
I don’t believe you.
if the alignment community would outlaw mechinterp/slt/ neuroscience
This is an insane strawman. Why are you strawmanning what I’m saying?
I dont think progress on this question will be made by blanket dismissals
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.
Ok. How would this theory look like and how would it cache out into real world consequences ?
This is a derail. I can know that something won’t work without knowing what would work. I don’t claim to know something that would work. If you want my partial thoughts, some of them are here: https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html
In general, there’s more feedback available at the level of “philosophy of mind” than is appreciated.
I think I am asking a very fair question.
What is the theory of change of your philosophy of mind caching out into something with real-world consequences ?
I.e. a training technique? Design principles? A piece of math ? Etc
All of those, sure? First you understand, then you know what to do. This is a bad way to do peacetime science, but seems more hopeful for
cruel deadline,
requires understanding as-yet-unconceived aspects of Mind.
No, you’re derailing from the topic, which is the fact that the field of alignment keeps failing to even try to avoid / address major partial-consensus defeaters to alignment.
I’m confused why you are so confident in these “defeaters” by which I gather objection/counterarguments to certain lines of attack on the alignment problem.
E.g. I doubt it would be good if the alignment community would outlaw mechinterp/slt/ neuroscience just because of some vague intuition that they don’t operate at the right abstraction.
Certainly, the right level of abstraction is a crucial concern but I dont think progress on this question will be made by blanket dismissals. People in these fields understand very well the problem you are pointing towards. Many people are thinking deeply how to resolve this issue.
More than any one defeater, I’m confident that most people in the alignment field don’t understand the defeaters. Why? I mean, from talking to many of them, and from their choices of research.
I don’t believe you.
This is an insane strawman. Why are you strawmanning what I’m saying?
Progress could only be made by understanding the problems, which can only be done by stating the problems, which you’re calling “blanket dismissals”.
Okay seems like the commentariat agrees I am too combative. I apologize if you feel strawmanned.
Feels like we got a bit stuck. When you say “defeater” what I hear is a very confident blanket dismissal. Maybe that’s not what you have in mind.
Defeater, in my mind, is a failure mode which if you don’t address you will not succeed at aligning sufficiently powerful systems.[1] It does not mean work outside of that focused on them is useless, but at some point you have to deal with the defeaters, and if the vast majority of people working towards alignment don’t get them clearly, and the people who do get them claim we’re nowhere near on track to find a way to beat the defeaters, then that is a scary situation.
This is true even if some of the work being done by people unaware of the defeaters is not useless, e.g. maybe it is successfully averting earlier forms of doom than the ones that require routing around the defeaters.
Not best considered as an argument against specific lines of attack, but as a problem which if unsolved leads inevitably to doom. People with a strong grok of a bunch of these often think that way more timelines are lost to “we didn’t solve these defeaters” than the problems being even plausibly addressed by the class of work being done by most of the field. This does unfortunately make it get used as (and feel like) an argument against those approaches by people who don’t and don’t claim to understand those approaches, but that’s not the generator or important nature of it.