-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't index hOCR documents on Windows #174
Comments
Thank you for the detailed bug report, this should be enough to pinpoint the cause of the bug and hopefully find a fix, will report once I've gotten around to probing it (might be a while, currently on parental leave, i.e. will happen when the little one has had a good night and I'm not too swamped with household stuff (-:) |
So I just tried to reproduce the issue with the example document from the OP, but it indexes just fine for me 🤔 Can you share the file that causes the issue? I.e. the actual Also, could you try running the same setup with the same data inside of a Docker container with a Linux system? The plugin was only tested on Linux and uses a few low-level interfaces that might behave differently on Windows systems, would be good to verify if this is the case. |
Thanks for the quick reply My results are: I think you're right. The problem will be in windows (for 0.6.0 version) |
Thank you! I'll try setting up a windows environment to reproduce and hopefully fix the issue, might take a bit, though 😬 |
Some hOCR can't be parsed (0.6.0 version) becasue they use diacritics chars in content. For example chars: "ůá" words: aráme, ků
Ex hOCR file:
throws error:
hOCR without diacritics "ů, á" is OK.
The text was updated successfully, but these errors were encountered: