my understanding is that there are various sets of data files: one set describe all the "raw sets of temperature measurements" the BE team could find, which includes some duplicated data sets (due to different aggregations of datasets done in the past and found by BE including the same sets of observations), a set where they've done removal of duplicate or completely spurious sets of observations and then various results of their reconstructions like the one you mention. (I think the "generated" word is there just because this is a file of gridded data whereas temperature stations are in non-uniform locations, so they've first "solved" for the temperature field (and anomaly) at those locations, then "generated" values on a uniform grid by interpolation.) At the moment I'm experimenting with that middle set that has been processed in an attempt to remove "gross record-keeping issues" but none of the detailed statistical techniques to remove other confounding factors. In general, from what I've read given the things they're assuming the BE team seem to have done a good job of analysis. However, it's seems to me that while the assumptions BE are making seem reasonable they're not obviously the best set of assumptions to be making.
Incidentally, testing statistical methodologies against a simulation is a good idea: it's one of the methods BE mentioned they used to validate their statistical approach. The one drawback is that you tend to generate simulation data that matches your statistical model.