Preserving old scientific papers and data

**JedRothwell** · Dec 4th 2019

I urge all researchers to digitize your old papers and data. Here are some notes about how to do this.

I have scanned many old papers and some books at LENR-CANR.org. I used an old HP flatbed scanner with a page feeder attachment. In some cases I sent books to be scanned here: https://1dollarscan.com/ They charge $1 per 100 pages.

For people who have file cabinets full of paper, with thousands of sheets, in the past, I recommended taking papers to Office Depot or someplace. Last month I bought a new scanner, an EPSON ES-400 for $270. I recommend it. I scanned over 1,200 pages in a few days. One and a half file drawers full.

This is much faster than the old HP, and it works much better. You still need a flatbed scanner for some types of documents, and for photos and detailed graphs. You can get one for $70.

The ES-400 is not only fast. It scans both sides. It handles beat up old papers. It automatically adjusts to all kinds of paper sizes, such as business cards, or a long, narrow newspaper clipping. It automatically selects color, grayscale or black and white. It produces images or Searchable PDF files with built-in OCR. I scanned a couple of small books, by breaking apart the pages and cutting off the spine. Photos come out okay, but as I said, a flatbed scanner is better for them.

With any kind of scanner, always set the resolution at 300 dpi or better. For photos, make it 600 dpi or better. For papers with ordinary text, OCR usually works best at 300 dpi. Text printed in a small font might work better at 600 dpi. Do some tests to find out. With the ES-400, you might scan at 300 dpi, then 600 dpi, then compare the results. Do this by scanning into searchable PDF files, and then use Acrobat to convert the two files into Microsoft Word. You can compare Microsoft Word files with the Review, Compare feature. You can see which does a better job at OCR conversion. Here is a paragraph from Fusion Facts, March 1996. This is 300 dpi. There are no OCR errors, so there is no need for higher resolution. (In a few cases, I found there are more OCR errors at higher resolution, which makes no sense, but there it is.) The OCR preserved the bold text and italics.

"So, what went wrong? Well, as we readers of Fusion Facts can well imagine, the experiment was a large scale version of the Correa discharge tube or the Chernetskii self generating discharge device or the Spence device, all of which reveal excess power. The only difference was that the 'evacuated tube' was replaced by the rarified plasma state of the ionosphere, where there are as many positive ions and electrons. The space shuttle Columbia was, in fact, the cathode and the now-lost satellite was the anode. The cable was the power supply circuit and the intervening ionized space provided the discharge path."

A few recommendations --

1. Change the default resolution from 200 to 300 dpi. As I said, never scan a document at less than 300 dpi, with any scanner. (Except maybe old tax returns.)

2. Be sure you turn on the "Correct Document Skew: Paper and Contents Skew" option.

3. The document type auto detection works well, but for black and white documents with illustrations, you should set it to "gray." And then set it back to "auto."

4. BE SURE you remove all paperclips and staples! These may damage the machine. Cut the stapled corner with scissors, rather than pulling the staple out.

In another thread, I uploaded a sample document scanned with the ES-400: some pages from a magazine published in 1943. Here it is again. Science Digest 1943 extract.pdf Look at the comment about uranium on p. 21.

**orsova** · Dec 5th 2019

Quote

Scanning:

destructive vs non-destructive: destructively debinding books with a razor or guillotine cutter works much better & is much less time-consuming than spreading them on a flatbed scanner to scan one-by-one¹¹, because it allows use of a sheet-fed scanner instead, which is easily 5x faster and will give higher-quality scans (because the sheets will be flat, scanned edge-to-edge, and much more closely aligned).
Tools:
For simple debinding of a few books a year, an X-acto knife/razor is good (avoid the ‘triangle’ blades, get curved blades intended for large cuts instead of detail work)
once you start doing more than one a month, it’s time to upgrade to a guillotine blade paper cutter (a fancier swinging-arm paper cutter, which uses a two-joint system to clamp down and cut uniformly).
A guillotine blade can cut chunks of 200 pages easily without much slippage, so for books with more pages, I use both: an X-acto to cut along the spine and turn it into several 200-page chunks for the guillotine cutter.
at some point, it may make sense to switch to a scanning service like 1DollarScan (1DS has acceptable quality for the black-white scans I have used them for thus far, but watch out for their nickel-and-diming fees for OCR or “setting the PDF title”; these can be done in no time yourself using gscan2pdf/exiftool/ocrmypdf and will save a lot of money as they, amazingly, bill by 100-page units). Books can be sent directly to 1DS, reducing logistical hassles.
after scanning, crop/threshold/OCR/add metadata
Adding metadata: same principles as papers. While more elaborate metadata can be added, like bookmarks, I have not experimented with those yet.
Saving files:
In the past, I used DjVu for documents I produce myself, as it produces much smaller scans than gscan2pdf’s default PDF settings due to a buggy Perl library(at least half the size, sometimes one-tenth the size), making them more easily hosted & a superior browsing experience.
The downsides of DjVu are that not all PDF viewers can handle DjVu files, and it appears that G/GS ignore all DjVu files (despite the format being 20 years old), rendering them completely unfindable online. In addition, DjVu is an increasingly obscure format and has, for example, been dropped by the IA as of 2016. The former is a relatively small issue, but the latter is fatal—being consigned to oblivion by search engines largely defeats the point of scanning! (“If it’s not in Google, it doesn’t exist.”) Hence, despite being a worse format, I now recommend PDF and have stopped using DjVu for new scans¹² and have converted my old DjVu files to PDF.
Uploading: to LibGen, usually. For backups, filelockers like Dropbox, Mega, MediaFire, or Google Drive are good. I usually upload 3 copies including LG. I rotate accounts once a year, to avoid putting too many files into a single account.
Hosting: hosting papers is easy but books come with risk:
Books can be dangerous; in deciding whether to host a book, my rule of thumb is host only books pre-2000 and which do not have Kindle editions or other signs of active exploitation and is effectively an ‘orphan work’.
As of 23 October 2019, hosting 4090 files over 9 years (very roughly, assuming linear growth, <6.7 million document-days of hosting: 3763⋅0.5⋅8⋅365.25=6722426), I’ve received 4 takedown orders: a behavioral genetics textbook (2013), The Handbook of Psychopathy (2005), a recent meta-analysis paper (Roberts et al 2016), and a CUP DMCA takedown order for 27 files. I broke my rule of thumb to host the 2 books (my mistake), which leaves only the 1 paper, which I think was a fluke. So, as long as one avoids relatively recent books, the risk should be minimal.

Display More

Quote
check & improve metadata.
Adding metadata to papers/books is a good idea because it makes the file findable in G/GS (if it’s not online, does it really exist?) and helps you if you decide to use bibliographic software like Zotero in the future. Many academic publishers & LG are terrible about metadata, and will not include even title/author/DOI/year. PDFs can be easily annotated with metadata using ExifTool: : exiftool -All prints all metadata, and the metadata can be set individually using similar fields.
For papers hidden inside volumes or other files, you should extract the relevant page range to create a single relevant file. (For extraction of PDF page-ranges, I use pdftk, eg: pdftk 2010-davidson-wellplayed10-videogamesvaluemeaning.pdf cat 180-196 output 2009-fortugno.pdf.)
I try to set at least title/author/DOI/year/subject, and stuff any additional topics & bibliographic information into the “Keywords” field. Example of setting metadata:
Code
exiftool -Author="Frank P. Ramsey" -Date=1930 -Title="On a Problem of Formal Logic" -DOI="10.1112/plms/s2-30.1.264" \    -Subject="mathematics" -Keywords="Ramsey theory, Ramsey's theorem, combinatorics, mathematical logic, decidability, \    first-order logic,  Bernays-Schönfinkel-Ramsey class of first-order logic, _Proceedings of the London Mathematical \    Society_, Volume s2-30, Issue 1, 1 January 1930, pg264-286" 1930-ramsey.pdf
if a scan, it may be worth editing the PDF to crop the edges, threshold to binarize it (which, for a bad grayscale or color scan, can drastically reduce filesize while increasing readability), and OCRing it. I use gscan2pdf but there are alternatives worth checking out.
if possible, host a public copy; especially if it was very difficult to find, even if it was useless, it should be hosted. The life you save may be your own.

from: https://www.gwern.net/Search

**Alan Smith** · Dec 5th 2019

Thank you Jed - some very useful hints and tips.

**JedRothwell** · Dec 5th 2019

Someone asked to see the complete Science Digest magazine from 1943. It is uploaded here:

https://lenr-canr.org/Collections/ScienceDigest1943.pdf

**JedRothwell** · Dec 5th 2019

Here is another note about scanned documents.

A scanned Acrobat (PDF) file is usually not as good as it looks. What you see on screen looks right but "underneath" there are hidden OCR errors. For example, the ICCF3 proceedings, p. 38 says:

"We can see that the loading proceeds almost at 100% current efficiency up to H/Pd=0.5. In general the current efficiency of loading increases at lower current densities. Figure 8 shows that loading at 3mA/cm² proceeds linearly with time at current efficiency slightly higher than 100% throughout the entire loading period. . . ."

https://www.lenr-canr.org/acrobat/IkegamiHthirdinter.pdf

That is what you see on the screen. But the underlying text has errors, shown in bold here:

"We can see that the loading proceeds almost al 100% current efficiency up to H/Pd=0.5. In general the current efficiency of loading increases at lower current densities. Figure 8 shows that loading at 3mNcm2 proceeds linearly with time at current efficiency slightly higher than 100% throughout the entire loading period."

If you do a search for "3mA/cm2" you will not find it. Most equations have OCR errors.

Some text has more errors, such as on p. 169:

"The anomalous phenomenon in metal loaded with deuterium has been studied, using the electrolysis and the cycle method of temperature and pressure ((M>'l'). In the report, the experimental results are introduced, including the explosion occurred, and nuetron and tritium measured in electrolysis experiment. The sensitization phenomenon of X- ray film was found in OOP experiment. It is considered that the reason of sensitization is derived from the chemical reaction and the anomalous effect in metal loaded with deuterium."

"((M>'I')" and "OOP" are supposed to be: "(CMPT)." "CMPT" is what you see on the screen, and when you print the document. But, if you press Ctrl-F to look for "CMPT" in this document, you will not find it.

**JedRothwell** · Dec 6th 2019

Another Note about OCR and Acrobat

You can see the underlying text in an acrobat file by two methods:

1. For a short segment of text, just copy and paste to a text editor. Select some text, copy and then paste into a blank document. You will see the underlying text.

2. For an entire document, use an Acrobat editor to export to Microsoft Word or plain text. Scroll through it or use the spell checker to find garbage text.

If there are many OCR errors in the text, you will not be able to search through the document, or copy chunks of it to quote from it. If you will need to search or quote from the document, you probably need to retype it manually. The most important papers at LENR-CANR.org were retyped, by me. This is one of the most important in the history of the field:

Miles, M. and K.B. Johnson, Anomalous Effects in Deuterated Systems, Final Report. 1996, Naval Air Warfare Center Weapons Division. https://www.lenr-canr.org/acrobat/MilesManomalousea.pdf

I OCR'ed it, fixed all the errors (I hope), and then I paid an expert to regenerate the graphs on pages 52 - 86. The only paper copies of this document that Miles or I could find were copies of copies with skewed, blurry, distorted graphs. I superimposed the original graphs on the new versions and checked them carefully. Like so:

That was a lot of work. This is why is it much better to preserve the original digital copies.

Some OCR programs work better than others, but none of them will convert a messy, old document. I have tried a variety of OCR programs. Some years ago, I found that the ABBYY program from Russia seems to work best with scientific documents. It does Greek letters and so on. I have not tried other OCR programs lately, and they may have improved. The built-in OCR feature of the EPSON ES-400 is from Nuance, I think. It works pretty well. It does Japanese, I suppose because it is made in Japan. You have to tell it the text is Japanese, or it will think the text is English and convert it to garbage. The OCR built into Adobe Acrobat is okay but not great. It works better than it did a few years ago.

What is interesting about OCR programs is that in a few limited ways, they can do a better job than you can. They don't care about contrast. Lightly printed text converts about as well as high contrast text. But you, a human (and I assume you are a human, dear reader) can still do a far better job nearly all the time, despite the millions of dollars that Adobe and others have invested in this technology. I think an AI approach will eventually make machine OCR better than human OCR.

Preserving old scientific papers and data

EMDrive: Newton's Laws can be "bypassed"?

FreeL Tech's Vacuum Capacitor - Utmost Importance!

Generator Tarasenko based on the model of the planet Earth

The Playground

LENR vs Solar/Wind, and emerging Green Technologies.

FreeL Tech's Vacuum Capacitor - Utmost Importance!

Following the webinar by Klimov-Zatelepin on March 27, 2024

AI and LENR - The Bots Get Busy.

X-rays and sticky tape...

Electrogravity (electron-gravity) as a cause of nuclear reactions.

Share

Tags

Subscribe to our newsletter

Share

Tags