Kirk's rationale was (I believe) that without this detailed analysis systematic error is possible, and therefore aggregating likely results unsafe.
It's not that specific. As a general rule applied to _all_ scientific research reports, the report must stand on its own first and foremost. This means not just that a systematic error might be possible, but that one looks at the whole paper or report for minimum acceptable quality. That's why the ref Alain gave the other day fails. The authors might have done some good work, but you couldn't tell from the report, so it gets rejected until it gets cleaned up. If it doesn't pass muster, it should simply be rejected, perhaps with the option to reconsider if improvements in the experiment and/or theory are added to it.
This is one of the fallacies the CFers use. They do some sloppy work that sort of looks like what someone else got and claim that, since they get the 'same' thing, their work should be accepted as is and folded into the consensus thinking on the subject. That's how we get 'thousands' of 'replications' and chickens doing cold fusion. Not good science.
As an aside, detecting systematic error can be really hard, because the error is often in the way the experiment is conducted, but in many cases that methodology is developed from the current best knowledge of how to do things. I once saw a history of the accepted value for the speed of light that showed this pretty well. The value was determined in one fashion, accepted as well done, but then later, improvements showed that that method was off significantly. Unfortunately, I don't recall where I saw this so I can't give a ref.