OCR Revisited

Posted on February 3, 2009 | Category: Intrepid Ibex, ocr

A minor disaster the other day: my trusty Acer notebook died. I say minor disaster as my HP Pavillion is as happy as a pig in flight with Intrepid Ibex. However, the Acer still had xp on a partition, and on that xp was Optical Character Recognition software, which was ancient but still did the trick. I need this primarily for the website Patrick Chapman and I founded a while back, Irish Literary Revival. In truth the site has been neglected for a while, but both Patrick and I had discussed additions and I was half way through scanning Heather Brett‘s first book, Abigail Brown.

Anyway, it’s forced me to look at Linux solutions for OCR, and the only real runner that I know of is ocropus, the Google-sponsored open source document analysis and OCR system. I’ve downloaded ocropus-0.3.1.tar.gz, but the Google wiki Documentation for installation on Ubuntu is for 0.5, and looks very complicated, so I’m going to bookmark nubae’s Habari | Linux and Education piece on ocrupus as not only does it look simpler, but it details a bug with regard to Intrepid Ibex

Tesseract source has a bug that doesn’t allow it to compile with gcc 4.3 (Intrepid Ibex comes with this default)

I haven’t time to play with it for the next while, but I’ll document my adventures here when I do.

» Filed Under Intrepid Ibex, ocr

One Response to “OCR Revisited”

  1. Rufus M Says:

    There are some good reference articles on DocumentLab

Leave a Reply

Anti-Spam Protection by WP-SpamFree

Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License | Powered by Wordpress | Design: YGoY