Logbook

Experiment on English scanned image
Output = test1.txt
 * High quality image

Output = test5.txt
 * Low quality image

Output = test6.txt

Conclusion: Tesseract's accuracy is quite good for (English) images with less noise. For Images which are having some noise, though they are human readable, tesseract cannot accurately detect characters and thus it misinterprets characters. It has less accuracy if noise is there in scanned image so it cannot be trusted.

Experiments on Gujarati scanned image
I have converted many Gujarati Images as our goal is to make tesseract better for Gujarati Language.

I tried some blur images test.blur | test.blu.txt. Some clean white backgrounds test.clear2.png | test.clear2.txt and some with inclined text lines in images. teat.cross1.png | test.cross1.txt

I have tried on many more images that can be found here.

Conclusion: Accuracy of tesseract for Gujarati language is so poor. There were many wrong detection of modifiers. In noisy image, there is errors even in detecting alphabets. Some images with different inclination, images with line slope more than 25-30 degree can't be detected by tesseract. It gives error like file is empty. For vertical line text, output file of this image is just horrible.

Thresholding Operation
Input to tesseract for thresholding.

Output of thresholding

Conclusion: Thresholding operation of tesseract is good but still needs improvement.

Page layout analysis with help of tool
Input to tesseract is any valid image. output is xml file get xml here page layout analysis as output: