Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There’s also an accompanying prediction market which might interest you.
I did not include the ‘getting drunk’ interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that ‘getting drunk’ shouldn’t be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).
Would be curious to chat about this at some point if you’re still working on this!
Seems like we were thinking along very similar lines. I wrote up a similar experiment in shortform here. There’s also an accompanying prediction market which might interest you.
I did not include the ‘getting drunk’ interventions, which are an interesting idea, but I believe that fine-grained capabilities in many domains are de-correlated enough that ‘getting drunk’ shouldn’t be needed to get strong evidence for use of introspection (as opposed to knowledge of general 3rd person capability levels of similar AI).
Would be curious to chat about this at some point if you’re still working on this!