I think your critique hinges on a misunderstanding triggered by the word “religion.” You (mis)portray my position as advocating for religion’s worse epistemic practices; in reality I’m trying to highlight durable architectural features when instrumental reward shaping fails.
The claim “religion works by damaging rationality” is a strawman. My post is about borrowing design patterns that might cultivate robust alignment. It does not require you to accept the premise that religion thrives exclusively by “preventing good reasoning”.
I explicitly state to examine the structural concept of intrinsic motivations that remain stable in OOD scenarios; not religion itself. Your assessment glosses over these nuances; a mismodeling of my actual position.
I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.
It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):
- Write the oracle first, if I can’t eval as true/false the task is too mushy
- shrink the scope until it breaks cleanly
- iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks
The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.