Cleaning OCR'd text with Regular Expressions

Laura Turner O'Hara

The Programming Historian (May 2013)

Cleaning OCR'd text with Regular Expressions

Laura Turner O'Hara

Affiliations

Laura Turner O'Hara: Office of the Historian at the U.S. House of Representatives

Abstract

Read online

Optical Character Recognition (OCR)—the conversion of scanned images to machine-encoded text—has proven a godsend for historical research. This process allows texts to be searchable on one hand and more easily parsed and mined on the other. But we’ve all noticed that the OCR for historic texts is far from perfect. Old type faces and formats make for unique OCR. How might we improve poor quality OCR? The answer is Regular Expressions or “regex.”

Published in The Programming Historian

ISSN: 2397-2068 (Online)
Publisher: Editorial Board of the Programming Historian
Country of publisher: United Kingdom
LCC subjects: History (General) and history of Europe: History (General); Science: Mathematics: Instruments and machines: Electronic computers. Computer science: Computer software
Website: https://programminghistorian.org/en/

About the journal

Abstract

Keywords