Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I’m excited about:
I like the general ideas of looking for more comprehensive consistency checks (as in the “Better representation of probabilities” section), connecting this to mechanistic interpretability, and looking for things other than truth we could try to discover this way. (Haven’t thought much about your specific proposals for these directions)
Quite a few of your proposals are of the type “try X and see if/how that changes performance”. I’d be a bit weary of these because I think they don’t really help resolve uncertainty about the most important open questions. If one of these increases performance by 5%, that doesn’t tell you much about how promising the whole DLK approach is in the long term, or what the most likely failure modes are. If something doesn’t increase performance, that also doesn’t tell you too much about these.
Two exceptions to the previous point: (1) these types of experiments are pretty straightforward compared to more ambitious extensions, so I think they’re good if your main goal is to get more hands-on ML research experience. (2) Maybe you have some uncertainty about why/how the results in the paper are what they are, and making some small changes can give you evidence about that. This seems like an excellent thing to start with, assuming you have concrete things about the results you’re confused about. (Randomly trying things might also reveal surprising phenomena, but I’m much less sure that’s worth the time.)
So what could you aim for instead, if not improving performance a bit? I think one great general direction would be to look for cases where the current method just fails completely. Then work on solving the simplest such case you can find. Maybe the inverse scaling dataset is a good place to start, though I’d also encourage you to brainstorm other mechanistic ways why the current method might go wrong, and then come up with cases where those might happen. (Example of what I mean: maybe “truth” isn’t encoded in a linearly detectable way, and once you make the probe more complex, your constraints aren’t enough anymore to nail down the truth concept in practice).
ETA: I think the “adding labeled data” idea is a good illustration of what I’m talking about. Imagine you have problems where the method currently doesn’t work at all. If even large amounts of supervised data don’t help much on these, this suggests your probe can’t find a truth encoding (maybe because you’d need a higher capacity probe or if you already have that, maybe because the optimization is difficult). On the other hand, if you get good performance with supervised data, it suggests that you need stronger consistency checks. You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal. But if you do all this on a dataset where the unsupervised method already works pretty well, then the only evidence you get is something like “does it improve performance by 2%, 5%, 10%, …?”, the signal is less clear, and it’s much harder to say which of these explanations a 5% improvement indicates. All that is in addition to the fact that finding cases which are difficult for the current method is really important in its own right.
You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal.
Yep, I just had this idea this morning and came here to check if anyone else had thought of it. It seems plausible that a semi-supervised version of CCS could outperform naive logistic regression in generalization performance.
Nice project, there are several ideas in here I think are great research directions. Some quick thoughts on what I’m excited about:
I like the general ideas of looking for more comprehensive consistency checks (as in the “Better representation of probabilities” section), connecting this to mechanistic interpretability, and looking for things other than truth we could try to discover this way. (Haven’t thought much about your specific proposals for these directions)
Quite a few of your proposals are of the type “try X and see if/how that changes performance”. I’d be a bit weary of these because I think they don’t really help resolve uncertainty about the most important open questions. If one of these increases performance by 5%, that doesn’t tell you much about how promising the whole DLK approach is in the long term, or what the most likely failure modes are. If something doesn’t increase performance, that also doesn’t tell you too much about these.
Two exceptions to the previous point: (1) these types of experiments are pretty straightforward compared to more ambitious extensions, so I think they’re good if your main goal is to get more hands-on ML research experience. (2) Maybe you have some uncertainty about why/how the results in the paper are what they are, and making some small changes can give you evidence about that. This seems like an excellent thing to start with, assuming you have concrete things about the results you’re confused about. (Randomly trying things might also reveal surprising phenomena, but I’m much less sure that’s worth the time.)
So what could you aim for instead, if not improving performance a bit? I think one great general direction would be to look for cases where the current method just fails completely. Then work on solving the simplest such case you can find. Maybe the inverse scaling dataset is a good place to start, though I’d also encourage you to brainstorm other mechanistic ways why the current method might go wrong, and then come up with cases where those might happen. (Example of what I mean: maybe “truth” isn’t encoded in a linearly detectable way, and once you make the probe more complex, your constraints aren’t enough anymore to nail down the truth concept in practice).
ETA: I think the “adding labeled data” idea is a good illustration of what I’m talking about. Imagine you have problems where the method currently doesn’t work at all. If even large amounts of supervised data don’t help much on these, this suggests your probe can’t find a truth encoding (maybe because you’d need a higher capacity probe or if you already have that, maybe because the optimization is difficult). On the other hand, if you get good performance with supervised data, it suggests that you need stronger consistency checks. You can then also try things like adding supervised data in only one domain and check generalization, and you can expect a reasonably clear signal. But if you do all this on a dataset where the unsupervised method already works pretty well, then the only evidence you get is something like “does it improve performance by 2%, 5%, 10%, …?”, the signal is less clear, and it’s much harder to say which of these explanations a 5% improvement indicates. All that is in addition to the fact that finding cases which are difficult for the current method is really important in its own right.
Yep, I just had this idea this morning and came here to check if anyone else had thought of it. It seems plausible that a semi-supervised version of CCS could outperform naive logistic regression in generalization performance.