Lesson 4 is all about the creation of the search index for online collections via the scan and OCR processes. This lesson will discuss the basics of OCR and its VAST IMPORTANCE in creating the results from your online searches.
To get from newsprint to an index that you can search online for newspaper articles requires a number of steps and processes that require scanning and the application of Optical Character Recognition (OCR) software.
Laying competency aside as a factor, the biggest reason that articles are not found is that scanning of one and two hundred year old newspapers, either from paper or from microfilm, produces way less than optimal results.
Old newspapers are often in terrible shape, with folds, creases, ink blots, old and different fonts, etc. Even though it is easier and cheaper to scan from microfilm, the microfilm had to have been created from a scan of original newsprint at some point.
Simply, the way it works is that a scanner scans an original newsprint page or a page from microfilm. The OCR software process determines what the letters are and creates an index from those letters to the place on the page where the letters come from. Some of the time those letters make up a legitimate word in the language, or a name or place, and sometimes the letters are gobbledygook. This is all dependent on the quality of the original materials and the capabilities of the scanner and the OCR software. Unfortunately in the case of newspapers, there is an abundance of gobbledygook.
Here are two terrific articles from the University of Illinois and the University of Utah that explain the limitations of digitized newspapers and the indexes that are created:
- Best Practices for Newspaper Digitization
- Microfilm, Paper and OCR: Issues in Newspaper Digitization
I encourage you to read these articles so that you can understand why you can't find the articles that you seek.
Generally, scanning books can yield upwards of 99% accuracy from the OCR process. You should expect far less percentages with newspapers, as these articles explain.
As a result of primarily the quality of old newspapers, you must expect less than optimal results and set your expectations accordingly.
That does not mean that you cannot find articles - all is not lost. Take a look at the following articles on how to overcome some of these limitations and improve your ability to find "stuff."
- 8 Ways to Overcome OCR Errors when Searching Newspapers
- The ONE Absolute BEST Way to Find More Ancestor Articles in Historic Newspapers Online
There also will be future lessons in this series that will provide you with further tricks to find ancestor articles.