Accurate, Focused Research on Law, Technology and Knowledge Discovery Since 2002

The Persistence of the OCR Problem in Digital Repository E-Books

Kichuk, Diana. Loose, Falling Characters and Sentences: The Persistence of the OCR Problem in Digital Repository E-Books. Portal: Libraries and the Academy 15, no. 1 (2015): 59–91. doi:10.1353/pla.2015.0005.

“The electronic conversion of scanned image files to readable text using optical character recognition (OCR) software and the subsequent migration of raw OCR text to e-book text file formats are key remediation or media conversion technologies used in digital repository e-book production. Despite real progress, the OCR problem of reliability and accuracy in OCR-derived e-book text and metadata persists. This paper examines a selection of digitized e-books in several prominent digital repositories and discusses the impact of OCR technology on e-book text file formats, metadata, and the online reading experience.” [via James Jacobs]

Sorry, comments are closed for this post.