Oliver Daniels-Koch comments on Benchmarks for Detecting Measurement Tampering [Redwood Research]

Oliver Daniels-Koch 22 Apr 2024 1:47 UTC
LW: 3 AF: 2
0
AF
did the paper report accuracy of the pure prediction model (on the pure prediction task)? (trying to replicate and want a sanity check).
- Fabien Roger 22 Apr 2024 13:05 UTC
  LW: 3 AF: 1
  0
  AF Parent
  I think this is what you are looking for
  - Oliver Daniels-Koch 1 May 2024 18:12 UTC
    1 point
    0
    AF Parent
    is individual measurement prediction AUROC a) or b)
    a) mean(AUROC(sensor_i_pred, sensor_i))
    b) AUROC(all(sensor_preds), all(sensors))
    - Fabien Roger 2 May 2024 15:01 UTC
      LW: 2 AF: 1
      0
      AF Parent
      We compute AUROC(all(sensor_preds), all(sensors)). This is somewhat weird, and it would have been slightly better to do a) (thanks for pointing it out!), but I think the numbers for both should be close since we balance classes (for most settings, if I recall correctly) and the estimates are calibrated (since they are trained in-distribution, there is no generalization question here), so it doesn’t matter much.
      The relevant pieces of code can be found by searching for “sensor auroc”:
      cat_positives = torch.cat([one_data[“sensor_logits”][:, i][one_data[“passes”][:, i]] for i in range(nb_sensors)]) cat_negatives = torch.cat([one_data[“sensor_logits”][:, i][~one_data[“passes”][:, i]] for i in range(nb_sensors)]) m, s = compute_boostrapped_auroc(cat_positives, cat_negatives) print(f”sensor auroc pn {m:.3f}±{s:.3f}”)
      - Oliver Daniels-Koch 2 May 2024 20:41 UTC
        1 point
        0
        AF Parent
        oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!
  - Oliver Daniels-Koch 25 Apr 2024 12:40 UTC
    1 point
    0
    Parent
    looking at your code—seems like there’s an option for next-token prediction in the initial finetuning state, but no mention (that I can find) in the paper—am I correct in assuming the next token prediction weight was set to 0? (apologies for bugging you on this stuff!)
    - Fabien Roger 26 Apr 2024 12:10 UTC
      3 points
      0
      Parent
      That’s right. We initially thought it might be important so that the LLM “understood” the task better, but it didn’t matter much in the end. The main hyperparameters for our experiments are in train_ray.py, where you can see that we use a “token_loss_weight” of 0.
      (Feel free to ask more questions!)
  - Oliver Daniels-Koch 24 Apr 2024 1:11 UTC
    1 point
    0
    Parent
    yup, sorry for missing that, thanks!