Jumaat, 17 Januari 2014

3150. OCR Scanning.


بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ  , الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ , الرَّحْمَنِ الرَّحِيمِ ,  مَالِكِ يَوْمِ الدِّينِ , إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ , اهْدِنَا الصِّرَاطَ المُسْتَقِيمَ  , صِرَاطَ الَّذِينَ أَنْعَمْتَ عَلَيْهِمْ , غَيْرِ المَغْضُوبِ عَلَيْهِمْ وَلاَ الضَّالِّينَ.

Assalamualaikum w.b.t/السَّلاَمُ عَلَيْكُمْ وَرَحْمَةُ اللهِ وَبَرَكَاتُه
Meja www.peceq.blogspot.com 


THIS BLOG IS ABOUT THE LINUX COMMAND LINE INTERFACE (CLI), WITH AN OCCASIONAL FORAY INTO GRAPHICAL USER INTERFACE TERRITORY. INSTEAD OF JUST GIVING YOU INFORMATION LIKE SOME MAN PAGE, I HOPE TO ILLUSTRATE EACH COMMAND IN REAL-LIFE SCENARIOS.

OCR Scanning


This post describes how to scan pages from a printed book and convert the image to text using Optical Character Recognition (OCR) technology.
The tools that I use are:
  1. SimpleScan
  2. tesseract

Preparation

SimpleScan is a GUI scan application that comes pre-installed in many Linux distributions (including Debian Wheezy).
To manually install it on Debian:
$ sudo apt-get install simple-scan
tesseract is a command-line OCR program.
To install:
$ sudo apt-get install tesseract-ocr
If English is the language used, that is all you need to install. If you require another language, you must install additional tesseract language packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, and tesseract-ocr-fra for French.

OCR Procedure

  1. Scan the pages using SimpleScan.
  2. Save the image. 
  3. Run the tesseract command:
    $ tesseract OnWritingWell.jpg out
    Tesseract Open Source OCR Engine v3.02 with Leptonica
    
    The first parameter is the input image filename. The second parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g.,out.txt.
    If the language is not English, you need to specify the language on the command line using a 3-character language code (refer to the tesseract man page). The following command specifies the use of 3 languages: Russian, German and French.
    $ tesseract OnWritingWell.jpg myout  -l rus+deu+fra 
    

Accuracy

In the above example, there were a total of 734 words. Within the output text file, 119 words (16% of total) require some form of manual correction. This roughly translates to 84% OCR accuracy. The sample size is too small to be scientific, or statistically valid. What is the performance that you are getting from OCR?

THURSDAY, JANUARY 9, 2014

Sumber/Ref:
http://linuxcommando.blogspot.com/2014/01/ocr-scanning.html

vmct7.

7tcmv.






Tiada ulasan: