How to improve Tesseract4Android's recognition speed? #20

dankito · 2019-09-25T09:24:37Z

dankito
Sep 25, 2019

Hi,

first of all thanks for the library and the hard work you invested to get Tesseract 4 running on Android!

During my tests I saw, that for a 3.1 MB large image Tesseract4Android takes 2 minutes and for a 0.9 MB large file 1:40 minute to recognize the image.

All work is done on a separate thread dedicated only to Tesseract4Android. Giving that thread a high priority didn't help either.

What I saw in top (by executing "adb shell top -m 10") is that Tesseract4Android only uses 16% of the CPU.

  PID USER     PR  NI CPU% S  #THR     VSS     RSS PCY Name
23838 u0_a164  10 -10  16% S    15 1264936K 234692K  fg cz.adaptech.android.tesseract4androidexample

Is there any way to tell Tesseract4Android to use the whole CPU or to speed up recognition otherwise?

Should it be of any relevance, here is the code I used (it's in Kotlin):

    private fun recognizeTextOfFileAsync(imageFile: File) {
        thread(priority = Thread.MAX_PRIORITY) {
            recognizeTextOfFile(imageFile)
        }
    }

    private fun recognizeTextOfFile(imageFile: File) {
        log.info("Starting to recognize file '$imageFile' (${FormatUtils().formatFileSize(imageFile.length())})")
        val startTime = Date().time

        val tesseractApi = TessBaseAPI()
        log.info("Using Tesseract version ${tesseractApi.version}")

        val success = tesseractApi.init(TESSBASE_PATH, DEFAULT_LANGUAGE)

        val options = BitmapFactory.Options()
        options.inPreferredConfig = Bitmap.Config.ARGB_8888
        val bitmap = BitmapFactory.decodeFile(imageFile.absolutePath, options)

        tesseractApi.setImage(bitmap)

        val recognizedText = tesseractApi.utF8Text

        val elapsedMilliseconds = Date().time - startTime

        runOnUiThread {
            txtvwRecognizedText.text = recognizedText
        }

        log.info(String.format("Extracting text of file $imageFile took %02d:%02d.%03d seconds",
                elapsedMilliseconds / (60 * 1000), (elapsedMilliseconds / 1000) % 60, elapsedMilliseconds % 1000))

        bitmap.recycle()

        tesseractApi.clear()
        tesseractApi.end()
    }

Robyer · 2019-09-26T13:53:17Z

Robyer
Sep 26, 2019
Maintainer

From what I know (but I might be wrong) Tesseract is not optimized for multi-threaded environment, that's probably why you are seeing low utilization of the CPU. It should be implemented in Tesseract itself, on our side we can probably only do some parallelization ourselves, for example if you have many images that you want to process, or if the image you want to process can be divided into more smaller images (e.g. based on the layout determining phase of the Tesseract, where you might get info about different blocks of text on the image). But none of that is ideal, Tesseract should support it natively.

Another slow-down is probably computation of dotproduct, which is not HW accelerated on Android, but there is AVX / SSE optimizations for desktop. I saw someone implemented dotproduct with NEON instructions for Android, which might increase performance somewhat. If someone could implement it for current codebase and send pull request (ideally into official Tesseract repository, or at least for this library) would be great.

And another time-affecting aspect is size and quality of the processed image. In official Tesseract repo they recommend doing various own pre-processing to speed-up processing. You can compare by using clean screenshot of text document without distortions versus taking photo of printed text on paper.

Also Tesseract LSTM engine is way slower (but also produces better quality results) than the previous LEGACY engine.

Btw, in your code you don't need to initialize Tesseract on each call, you can reuse the instance.
EDIT: Also I think you shouldn't be measuring time of bitmap loading/decoding.

0 replies

SubhamTyagi · 2019-11-15T15:51:24Z

SubhamTyagi
Nov 15, 2019

in your code you don't need to initialize Tesseract on each call, you can reuse the instance.

@dankito you can see my implantation here and My app is working fine and recognizing text in less than 5 sec

0 replies

Robyer · 2021-01-10T23:58:33Z

Robyer
Jan 10, 2021
Maintainer

Tesseract 5 finally brings some support for NEON instructions, so the processing time is greatly improved (from my quick test it is 30 % faster with tessdata_fast and 40 % faster with standard tessdata).

0 replies

dankito · 2021-01-12T21:16:55Z

dankito
Jan 12, 2021
Author

That would be great!

What i also figured out:
Tesseract 4 usually uses 4 threads, but on Android it uses only one thread.

There's an environment parameter, OMP_THREAD_LIMIT, with which the count threads can be set that Tesseract uses. But i couldn't find where this can be set in source code.

Can you see where in source this can be set and make this setting available via your API?

A hint may can be found in the question of this issue: tesseract-ocr/tesseract#1600

I think this also would greatly improve the performance.

0 replies

Robyer · 2021-01-12T21:49:45Z

Robyer
Jan 12, 2021
Maintainer

@dankito For Tesseract 5 you can try the new branch here (but you have to compile it yourself):
https://github.com/adaptech-cz/Tesseract4Android/tree/tesseract-5.0.0

For multithreading you need to compile the library with OpenMP support. It can't be controlled via runtime parameter. If you want to try that, you need to uncomment these lines:

Tesseract4Android/tesseract4android/build.gradle

Lines 19 to 20 in 8fb4eae

    
           //cppFlags "-fopenmp" 
        
           //cFlags "-fopenmp"

Note that in current Android NDK (or build tools) there is a bug that compiled OpenMP library file is not added to the resulting APK (or AAR in this case). See android/ndk#1028 for some ideas how to work around that. Maybe using specific new version of NDK would help? You can try and tell me.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve Tesseract4Android's recognition speed? #20

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to improve Tesseract4Android's recognition speed? #20

dankito Sep 25, 2019

Replies: 5 comments

Robyer Sep 26, 2019 Maintainer

SubhamTyagi Nov 15, 2019

Robyer Jan 10, 2021 Maintainer

dankito Jan 12, 2021 Author

Robyer Jan 12, 2021 Maintainer

dankito
Sep 25, 2019

Robyer
Sep 26, 2019
Maintainer

SubhamTyagi
Nov 15, 2019

Robyer
Jan 10, 2021
Maintainer

dankito
Jan 12, 2021
Author

Robyer
Jan 12, 2021
Maintainer