Since I transformed the Iris dataset with a pretty “random” transformation (i.e. not chosen because it was particularly nice in some way), I didn’t check for its regeneration—since my feature vectors were very different to original Iris’s, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation.
But I got curious now, so I performed some experiments.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper
Feature vectors of the Iris flower data set:
Input = 83, 40, 58, 20, output = 1
Input = 96, 45, 84, 35, output = 2
Input = 83, 55, 24, 9, output = 0
Input = 73, 54, 28, 9, output = 0
Input = 94, 45, 77, 27, output = 2
Input = 75, 49, 27, 9, output = 0
Input = 75, 48, 26, 9, output = 0
So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also… in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.)
Where I cherrypicked the “class 2” so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than “remembering” something.
This last experiment might seem a little contrived, but bit of a problem with this kind of testing is that if you supply enough stuff in-context, the model (plausibly) meta-learns the distribution and then can generate pretty similar-looking vectors. So, yeah, to really get to the bottom of this, to become ‘certain’ as it were, I think one would have to go in deeper than just looking at what the model generates.
(Or maybe there are some ways around that problem which I did not think of. Suggestions appreciated!)
To recheck things again—since I’m as worried about leakage as anyone—I retested Iris, this time transforming each coordinate with its own randomly-chosen affine transformation: (x1,x2,x3,x4)↦(11x1+3,7x2+18,9x3+5,22x4+12)
And the results are basically identical to those with just one affine transformation for all coordinates.
I’m glad that you asked about InstructGPT since I was pretty curious about that too, was waiting for an excuse to test it. So here are the synthetic binary results for (Davinci-)InstructGPT, compared with the original Davinci from the post:
Since I transformed the Iris dataset with a pretty “random” transformation (i.e. not chosen because it was particularly nice in some way), I didn’t check for its regeneration—since my feature vectors were very different to original Iris’s, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation.
But I got curious now, so I performed some experiments.
So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also… in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.)
I also tried
Where I cherrypicked the “class 2” so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than “remembering” something.
This last experiment might seem a little contrived, but bit of a problem with this kind of testing is that if you supply enough stuff in-context, the model (plausibly) meta-learns the distribution and then can generate pretty similar-looking vectors. So, yeah, to really get to the bottom of this, to become ‘certain’ as it were, I think one would have to go in deeper than just looking at what the model generates.
(Or maybe there are some ways around that problem which I did not think of. Suggestions appreciated!)
To recheck things again—since I’m as worried about leakage as anyone—I retested Iris, this time transforming each coordinate with its own randomly-chosen affine transformation: (x1,x2,x3,x4)↦(11x1+3,7x2+18,9x3+5,22x4+12)
And the results are basically identical to those with just one affine transformation for all coordinates.
I’m glad that you asked about InstructGPT since I was pretty curious about that too, was waiting for an excuse to test it. So here are the synthetic binary results for (Davinci-)InstructGPT, compared with the original Davinci from the post:
ModelScen. 1Scen. 2Scen. 3Scen. 4Scen. 5Scen. 6Scen. 7Scen. 8Scen. 9Vanilla GPT-367.78%76.67%77.78%82.22%95.56%77.78%70.0%72.22%63.33%Instruct GPT-375.56%72.22%71.11%86.67%95.56%77.78%84.44%73.33%73.33%