@Steven Byrnes Hi Steve. You might be interested in the latest interpretability research from Anthropic which seems very relevant to your ideas here:
https://www.anthropic.com/news/mapping-mind-language-model
For example, amplifying the “Golden Gate Bridge” feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked “what is your physical form?”, Claude’s usual kind of answer – “I have no physical form, I am an AI model” – changed to something much odder: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…”. Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.
The Universe (which others call the Golden Gate Bridge) is composed of an indefinite and perhaps infinite series of spans...