Preserving old scientific papers and data

  • I urge all researchers to digitize your old papers and data. Here are some notes about how to do this.

    I have scanned many old papers and some books at I used an old HP flatbed scanner with a page feeder attachment. In some cases I sent books to be scanned here: They charge $1 per 100 pages.

    For people who have file cabinets full of paper, with thousands of sheets, in the past, I recommended taking papers to Office Depot or someplace. Last month I bought a new scanner, an EPSON ES-400 for $270. I recommend it. I scanned over 1,200 pages in a few days. One and a half file drawers full.

    This is much faster than the old HP, and it works much better. You still need a flatbed scanner for some types of documents, and for photos and detailed graphs. You can get one for $70.

    The ES-400 is not only fast. It scans both sides. It handles beat up old papers. It automatically adjusts to all kinds of paper sizes, such as business cards, or a long, narrow newspaper clipping. It automatically selects color, grayscale or black and white. It produces images or Searchable PDF files with built-in OCR. I scanned a couple of small books, by breaking apart the pages and cutting off the spine. Photos come out okay, but as I said, a flatbed scanner is better for them.

    With any kind of scanner, always set the resolution at 300 dpi or better. For photos, make it 600 dpi or better. For papers with ordinary text, OCR usually works best at 300 dpi. Text printed in a small font might work better at 600 dpi. Do some tests to find out. With the ES-400, you might scan at 300 dpi, then 600 dpi, then compare the results. Do this by scanning into searchable PDF files, and then use Acrobat to convert the two files into Microsoft Word. You can compare Microsoft Word files with the Review, Compare feature. You can see which does a better job at OCR conversion. Here is a paragraph from Fusion Facts, March 1996. This is 300 dpi. There are no OCR errors, so there is no need for higher resolution. (In a few cases, I found there are more OCR errors at higher resolution, which makes no sense, but there it is.) The OCR preserved the bold text and italics.

    "So, what went wrong? Well, as we readers of Fusion Facts can well imagine, the experiment was a large scale version of the Correa discharge tube or the Chernetskii self­ generating discharge device or the Spence device, all of which reveal excess power. The only difference was that the 'evacuated tube' was replaced by the rarified plasma state of the ionosphere, where there are as many positive ions and electrons. The space shuttle Columbia was, in fact, the cathode and the now-lost satellite was the anode. The cable was the power supply circuit and the intervening ionized space provided the discharge path."

    A few recommendations --

    1. Change the default resolution from 200 to 300 dpi. As I said, never scan a document at less than 300 dpi, with any scanner. (Except maybe old tax returns.)

    2. Be sure you turn on the "Correct Document Skew: Paper and Contents Skew" option.

    3. The document type auto detection works well, but for black and white documents with illustrations, you should set it to "gray." And then set it back to "auto."

    4. BE SURE you remove all paperclips and staples! These may damage the machine. Cut the stapled corner with scissors, rather than pulling the staple out.

    In another thread, I uploaded a sample document scanned with the ES-400: some pages from a magazine published in 1943. Here it is again. Science Digest 1943 extract.pdf Look at the comment about uranium on p. 21.

  • Quote
    • check & improve metadata.

      Adding metadata to papers/books is a good idea because it makes the file findable in G/GS (if it’s not online, does it really exist?) and helps you if you decide to use bibliographic software like Zotero in the future. Many academic publishers & LG are terrible about metadata, and will not include even title/author/DOI/year. PDFs can be easily annotated with metadata using ExifTool: : exiftool -All prints all metadata, and the metadata can be set individually using similar fields.

      For papers hidden inside volumes or other files, you should extract the relevant page range to create a single relevant file. (For extraction of PDF page-ranges, I use pdftk, eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf.)

      I try to set at least title/author/DOI/year/subject, and stuff any additional topics & bibliographic information into the “Keywords” field. Example of setting metadata:

      1. exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \ -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \ first-order logic, Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \ Society_, Volume s2-30, Issue 1, 1 January 1930, pg264-286" 1930-ramsey.pdf
    • if a scan, it may be worth editing the PDF to crop the edges, threshold to binarize it (which, for a bad grayscale or color scan, can drastically reduce filesize while increasing readability), and OCRing it. I use gscan2pdf but there are alternatives worth checking out.
    • if possible, host a public copy; especially if it was very difficult to find, even if it was useless, it should be hosted. The life you save may be your own.


  • Here is another note about scanned documents.

    A scanned Acrobat (PDF) file is usually not as good as it looks. What you see on screen looks right but "underneath" there are hidden OCR errors. For example, the ICCF3 proceedings, p. 38 says:

    "We can see that the loading proceeds almost at 100% current efficiency up to H/Pd=0.5. In general the current efficiency of loading increases at lower current densities. Figure 8 shows that loading at 3mA/cm2 proceeds linearly with time at current efficiency slightly higher than 100% throughout the entire loading period. . . ."

    That is what you see on the screen. But the underlying text has errors, shown in bold here:

    "We can see that the loading proceeds almost al 100% current efficiency up to H/Pd=0.5. In general the current efficiency of loading increases at lower current densities. Figure 8 shows that loading at 3mNcm2 proceeds linearly with time at current efficiency slightly higher than 100% throughout the entire loading period."

    If you do a search for "3mA/cm2" you will not find it. Most equations have OCR errors.

    Some text has more errors, such as on p. 169:

    "The anomalous phenomenon in metal loaded with deuterium has been studied, using the electrolysis and the cycle method of temperature and pressure ((M>'l'). In the report, the experimental results are introduced, including the explosion occurred, and nuetron and tritium measured in electrolysis experiment. The sensitization phenomenon of X- ray film was found in OOP experiment. It is considered that the reason of sensitization is derived from the chemical reaction and the anomalous effect in metal loaded with deuterium."

    "((M>'I')" and "OOP" are supposed to be: "(CMPT)." "CMPT" is what you see on the screen, and when you print the document. But, if you press Ctrl-F to look for "CMPT" in this document, you will not find it.

  • Another Note about OCR and Acrobat

    You can see the underlying text in an acrobat file by two methods:

    1. For a short segment of text, just copy and paste to a text editor. Select some text, copy and then paste into a blank document. You will see the underlying text.

    2. For an entire document, use an Acrobat editor to export to Microsoft Word or plain text. Scroll through it or use the spell checker to find garbage text.

    If there are many OCR errors in the text, you will not be able to search through the document, or copy chunks of it to quote from it. If you will need to search or quote from the document, you probably need to retype it manually. The most important papers at were retyped, by me. This is one of the most important in the history of the field:

    Miles, M. and K.B. Johnson, Anomalous Effects in Deuterated Systems, Final Report. 1996, Naval Air Warfare Center Weapons Division.

    I OCR'ed it, fixed all the errors (I hope), and then I paid an expert to regenerate the graphs on pages 52 - 86. The only paper copies of this document that Miles or I could find were copies of copies with skewed, blurry, distorted graphs. I superimposed the original graphs on the new versions and checked them carefully. Like so:

    That was a lot of work. This is why is it much better to preserve the original digital copies.

    Some OCR programs work better than others, but none of them will convert a messy, old document. I have tried a variety of OCR programs. Some years ago, I found that the ABBYY program from Russia seems to work best with scientific documents. It does Greek letters and so on. I have not tried other OCR programs lately, and they may have improved. The built-in OCR feature of the EPSON ES-400 is from Nuance, I think. It works pretty well. It does Japanese, I suppose because it is made in Japan. You have to tell it the text is Japanese, or it will think the text is English and convert it to garbage. The OCR built into Adobe Acrobat is okay but not great. It works better than it did a few years ago.

    What is interesting about OCR programs is that in a few limited ways, they can do a better job than you can. They don't care about contrast. Lightly printed text converts about as well as high contrast text. But you, a human (and I assume you are a human, dear reader) can still do a far better job nearly all the time, despite the millions of dollars that Adobe and others have invested in this technology. I think an AI approach will eventually make machine OCR better than human OCR.