New Technology For Ancient Texts


The advent of optical character recognition software has created an explosion of online texts available for readers. Witness the success of the Google Books initiative that has already scanned and digitized around 25 million books. But texts written in ancient Greek including some of the most seminal books in the human library – Plato, Galen, Aeschylus, the early founders of Christianity and others – have been largely left out of that online revolution.

Bruce Robertson is working to correct that shortfall. The head of the Department of Classics at Mount Allison University has created an initiative to digitize Greek texts and make them available to scholars online. He is using ACENET and Compute Canada resources to create a database featuring some of the most important writings of the Greek world.

Robertson’s work will contribute to a field called corpus linguistics, the study of language based on large collections of day-to-day language stored in computerized databases. Developing and modifying open source optical character recognition (OCR) software, Robertson has created a digital database of around 10 million raw words with about seven million edited words.

“We’re working mainly on text from the period of around 700 B.C. to around 300 A.D.,” he says.

It’s not a simple matter of running text through a scanner and converting it into text files. With difficult accent marks and unusual fonts, ancient Greek texts are notoriously difficult for OCR software to read.

“We couldn’t apply off-the-shelf OCR software,” says Robertson. “We needed to get large-scale OCR working on these texts.” But commercial OCR engines powerful enough to do the job were too expensive. In 2011 Robertson and his team turned to the ACENET system instead. “ACENET has some really cool features,” he says.

Roberston has also created a website to edit raw OCR Data, generating new training data. While he admits that computers aren’t typically thought of as tools of the classicist, Robertson says that they are essential for his work.

“Computers are an ideal tool for working with ancient texts.”

Robertson has maintained a fascination with computers since high school when he hooked his television set up to a Commodore 64 that his parents gave him. He went on to complete his PhD dissertation at the University of Toronto on the subject of Greek names and has used computers as part of his work since the beginning.

“If you can learn ancient Greek you can learn to use a high-level programming language like Python,” he says.