The Ancestor Hunt
  • Home
  • Blog
  • Newspapers
  • Newspaper Links
  • Obituaries
  • BMD Records
  • Photos
  • Yearbooks
  • Directories
  • By Location
  • Cemetery Records
  • Divorce Records
  • Naturalizations
  • Immigration
  • Mortuary Records
  • Church Records
  • School Records
  • Voter Records
  • Coroner Records
  • Probate and Wills
  • Alumni Records
  • Newsletter Page
  • Tools
  • Genealogy News
  • California Genealogy
  • Videos
  • Fun With Newspapers
  • About
  • Contact
  • Privacy Policy

8 Ways to Overcome OCR Errors when Searching Newspapers

9/3/2013

13 Comments

 
Picture
Everyone who has searched newspapers online will fail to find something. It happens incredibly often. The stakes are high for genealogy researchers, where finding an article about an ancestor can make a huge difference in filling out a family tree.

I have often heard researchers say "I can't find a single article about my ancestor, even though I have searched for hours!"

Laying competency aside as a factor, the biggest reason is that scanning of one and two hundred year old newspapers, either from paper or from microfilm, produces way less than optimal results.

More importantly, one must know that searching through an index created by humans who have read the source material and then typed the index is far superior to having a machine/software scan and process a dusty old newspaper. Yet the massive size of newspaper collections prevents the creation of the index manually. You must expect inferior results and set your expectations accordingly.

Please take a look at the following list, and hopefully some of these errors and anomalies will provide you with some hints to overcome them and actually find what you are looking for. There are many others - but these are ones that I have personally experienced:

  1. Hyphenated words were often used because of fixed width type as well as the experience and capability of the typesetter. Hyphens are less utilized today but were a staple years ago. Take that into consideration if you are searching for a surname or other search criteria with many letters in one word. Try splitting the search into two words where the hyphen may have been normally used.
  2. If there is an "h" in your search term, try exchanging a "b", since b's and h's are quite similar and can "confuse" the OCR process. As an example, searching the California Digital Newspaper Collection for one of my surnames - "Braunhart" yields 1,507 results. Replacing the "h" with a "b", hence searching for "Braunbart" yields 96 results - for the SAME person. That is approximately another 6%!
  3. For a similar reason as "h" and "b" are confused - the same holds true of "c" and "e".  I have not had as many difficulties with this pair as with "h" and "b".
  4. Likewise, lower case m's and n's are often confused. The m's are often converted to several combinations of letters.  Also r's and n's can be confused.
  5. I's in lower as well as upper case can often be converted to slashes or exclamation points and the numeral 1. And vice versa.
  6. if the original newspaper is "dirty," by that I mean there is excessive ink or the scan is dark - many times spaces will be scanned but not presented as spaces. There are a variety of strange characters that may be picked up.
  7. If the newspaper was scanned and then processed directly with OCR, that is one pass. If the newspaper was scanned to microfilm and scanned again and then OCR'd that is two passes. Thus a two pass operation has the potential to have a decreased quality of results. There isn't much that you can do about it - but it is nice to know.
  8. This one is not really about scanning, but is more of a cultural challenge. Until the last few decades, women were not referred to by their maiden or given name in newspaper publications. So in my mother's case, after she was married, any newspaper articles cited her as Mrs. Robert J Marks, or Mrs R. J. Marks, not Muriel Marks. So looking for "Muriel Marks" with the exception of her obituary - would have led to zero results.

So don't be discouraged by "lack of results" from doing online newspaper searches. You just need to "outsmart" OCR and try various combinations to get to those elusive ancestors. Be persistent.

An additional help would be more crowdsourcing to correct OCR errors and improve the text. An example is reCaptcha processing that is used for Google Books.

Another crowdsourcing example that I personally have used is that of correction on the actual online newspaper site, such as the aforementioned California Digital Newspaper Collection. In this example, registered users can provide edited text that is then incorporated into future searches. Kind of like a newspaper-related "pay it forward." This capability is provided on that site and many others from the fine folks at Elephind.com, who created the software used by the California collection as well as many other online newspaper sites.


For many more details about scanning, OCR and related subjects please read Scanning FAQ by Project Gutenberg.

Another excellent article is Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs, from the March/April 2009 publication of D-Lib magazine.

Good luck - be persistent and have reasonable expectations.


13 Comments
Rorey Cathcart link
9/3/2013 11:31:09 pm

Great points all. Searching by Mrs [all husband variants] has been particularly useful for me. I have also searched successfully using Miss [Surname] but have only used this with small local or regional papers or uncommon surnames to any useful effect.

OCR techniques may improve over time but the two-pass problem may never really be overcome sufficiently.

I've only just started using Elephind. I'll have to check out their crowdsourcing options for papers I'm interested in. Like indexing for FamilySearch, crowdsourcing is such a great way to give back to the community and help others.

Reply
Kenneth R Marks link
9/4/2013 01:11:09 am

Thanks, Rorey. Unfortunately - it is not OCR's "fault" - when dealing with a less than optimal source such as old newspapers - there is not much that a "machine" or software can do automatically. Glad you have been successful.

Reply
Joanne Shackford Parkes link
9/5/2013 03:16:50 am

What great tips! I research for the family SHACKFORD. After reading this article, I searched on multiple OCR databases for SBACKFORD and found all sorts of newspaper articles that I hadn't seen before. In come cases 20%-30% new articles!

Reply
Kenneth R Marks link
9/5/2013 04:14:23 am

Great Joanne! I am glad that the tip helped.

Reply
Alona Tester link
10/20/2013 08:35:59 am

A couple of things I've learned was that when a few men are mentioned in one article, they'll be listed a "Messrs Smith, Little and Baker", rather than Mr Smith, Mr Little and Mr Baker. I've also found searching for the town or region along with a surname helps to narrow down your search.

Reply
Kenneth R Marks link
11/11/2013 06:24:38 am

Great points. Thanks!

Reply
Allen Norton
12/16/2013 12:39:19 am

At the Fultonhistory.com site the OCR will occasionally read capital " S " as the number " 8 ". Thus to find all references to A. S. Norton you will also have to search for A. 8. Norton. Other sites may have this idiosyncrasy also.

Reply
Kenneth R Marks link
12/17/2013 02:16:57 am

Excellent suggestion, Allen! Thanks for contributing.

Reply
Jesse
1/30/2014 06:05:13 am

I'm shocked no one has made search software that inherently knows letters that are confused with each other. For instance, searching for Carpenter will also bring up results for Carpenfer, Carponter, etc. Someone's gonna make a killing if they write such a program.

Reply
Roy Cullum
8/15/2014 06:48:34 am

I think OCRed text searching could be improved by storing glyph (letter) elements individually, rather than ASCII codes for the recognized letters. This would allow searching based on what words
look like. The glyph elements to be stored would be things like risers and danglers, for example. This system might consume slightly more memory for storage, but it would be worth it.

Reply
Kenneth R Marks
9/6/2014 07:46:00 am

Interesting idea Roy. Thanks for offering it.

Reply
Sue
9/8/2014 01:20:58 am

What do I do about "Goth" being recognized as "60th"? I'm getting hundreds of hits with all sorts of "60th" - street, regiment, anniversary, etc.... :-(

Reply
Phil Elliott, Jr.
4/6/2015 05:01:15 am

I found this post to be quite helpful, so I would like to add to it. I've discovered that a capital M is often identified as Al or Ai. While searching for my Mercer ancestors, I tried Alercer and Aiercer, and got several relevant hits.

Reply



Leave a Reply.

    Check Out the NEW Subscription Options

    Save Time

    With the ​By Location Feature
    ​

    Free Resource Links
    ​​

    By Location Newspapers Obituaries BMD Records Directories Photos Yearbooks Cemetery Records Divorce Records Naturalizations Mortuary Records Immigration Church Records School Records Voter Lists Coroners Records Probate and Wills Alumni Records

    Subscribe
    Option 1 - Receive Links to New Published Articles 4 X per month

    Enter Email

    Subscribe
    ​
    Option 2​ - Receive New Complete
    Bi-Monthly Newsletter​ 
    ​
    Enter Email

    Search This Site


    Write or Record Your Autobiography the Easy Way

    Picture
    ​Use the Coupon Code HUNT to get a 10% discount
    Picture
    ​Use the Coupon Code HUNT to get a 10% discount




    Facebook Page
    Picture

    RSS Feed

    Archives

    January 2021
    December 2020
    November 2020
    October 2020
    September 2020
    August 2020
    July 2020
    June 2020
    May 2020
    April 2020
    March 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    April 2019
    March 2019
    February 2019
    January 2019
    December 2018
    November 2018
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    May 2018
    April 2018
    March 2018
    February 2018
    January 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    October 2016
    August 2016
    July 2016
    June 2016
    May 2016
    April 2016
    March 2016
    February 2016
    January 2016
    November 2015
    October 2015
    September 2015
    August 2015
    July 2015
    May 2015
    April 2015
    March 2015
    February 2015
    January 2015
    December 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    February 2014
    January 2014
    December 2013
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    June 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    September 2012
    May 2012
    April 2012
    November 2010
    October 2010


Home
Blog

Newspapers
Newspaper Links
​Obituaries
BMD Records

Photos
​
Yearbooks

Divorce Records
Naturalizations
Immigration and Travel

Church Records
​School Records
Voter Lists
Coroner Records

Probate and Wills
Genealogy News
​
Goodies
California Genealogy
Videos
Fun With Newspapers
About
​Contact
Privacy Policy
Picture




​

©2012-21

Thanks for Visiting The Ancestor Hunt
The Ancestor Hunt is focused on helping primarily hobbyist genealogy and family history researchers to achieve their goals.

"The Ancestor Hunt" is a participant in the Amazon Services LLC Associates Program.  There may be a small commission paid to "The Ancestor Hunt" should you purchase from Amazon.
.
"The Ancestor Hunt" is also an affiliate for "A Life Untold", and "Audiobiography". There may be a small commission paid to "The Ancestor Hunt" should you purchase from these companies.