Right, maybe even go 1 step deeper under the hood, and just extract the LLMs probability estimate for the 1st word in the response being “Yes”, then compare it to some safety threshold (which allows making it more conservative if desired)
Right, maybe even go 1 step deeper under the hood, and just extract the LLMs probability estimate for the 1st word in the response being “Yes”, then compare it to some safety threshold (which allows making it more conservative if desired)