-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading OCR fragments from S3 #49
Comments
Hallo @jbaiter and thanks for opening this issue. Let me explain our use case and some thoughts about S3 loading. For many reasons but specifically because our system aims to solve resource (money, staff, DEVOPS) constraints found in smaller cultural heritage institutions, we store our assets in S3 using a simply hash structure and use internal php Streamwrappers and Cantaloupe to deal with them as "local" . Also, since we use a lot of IIIF, many times Books can be ingested as a single PDF and individual pages generated on the fly via cantaloupe. Currently one of our team members is testing your wonderful plugin on a custom installation that uses local storage and the results are promising, but without access to local filesystem on many of this installations lazy loading XML files becomes complex for us. Current workflow (pre use of your plugin) How to highlight? Same as many other do right now before your plugin became available. Our problem right now is of course not the initial indexing, since we are already (for post processing) fetching remote sources temporary to local. But these are not kept around. So. S3 (or maybe generic URL retrieval with local cache?) could be a good option, maybe even something similar to what IIIF cantaloupe does for source caching and streaming when dealing with S3/Azure remotes. Pinging here @giancarlobi since he is testing the plugin against filesystem right now. I may see another use case here, that is not urgent but may be interesting for other people: Solr Cloud. In a multi core/multi shard distributed environment, local file access is probably impossible or would require some type of net mount. Finally. We have thought of having even some type of S3FS mount to just handle this use case but since HOCR/OCR would be in our architecture simply just another file saved in a hash based bucket structure we would have to mount basically everything. Not sure now performant that would be. Thanks for reading and kudos on your work on this. It is really great! |
Hey @DiegoPino, thanks for the very detailed use case, this is really helpful!
I'll look into what Cantaloupe is doing for caching, currently the plugin is relying complete on the OS page cache for reducing disk I/O.
This is actually currently supported, although its neither well-documented nor well-tested at the moment :-) You can just leave out the
As mentioned above, the only difference is that you don't need the
From the perspective of the application code, it will always go and fetch from the local file source at the moment. On the OS level, the page cache is used, and we have a small hooks that does "cache warming" once the files to be highlighted are known.
Yes :-)
Definitely an issue and totally within scope, no worries :-)
Yep, it requires a net mount (NFS/SMB/whatever), but this works like a charm, we're running Solrcloud on three shards with ~300million pages here and performance is pretty good (and will get better, once I'm done with the current WIP refactor) from what I can tell (with flash storage on the filer, though).
Performance probably depends on your S3FS implementation, and you can use your OS' page cache, which is an advantage (iirc FUSE goes through the kernel VFS layer, at least on Linux, but I might be wrong), but I can see why mounting everything is not practical for you. So to summarize, I'll look into supporting "source pointers" targeting S3 and what an in-application caching layer could look like, but I can't make any guarantees on when this will result in something usable for you. So, in the meantime, I recommend you maybe experiment with the in-index storage and the MiniOCR format, this is a combination that is pretty similar to what you're currently doing with your JSON format! And please reach out anytime if you hit roadblocks/bugs/documentation gaps/mistakes, either via an issue or a Twitter DM :-) |
@jbaiter Many thanks !!! I tried with inline hOCR and it works, removing only this line from schema: |
Das ist ja wunderbar! I really appreciate your very detailed reply here and I may need some extra time now to explore the capabilities. I totally did not know it was already possible (need to read more code next time) to query HOCR from Solr directly, simply great. We will go for MiniOCR which is already similar to that smaller format (no pretty print) we were planning and will do some integration test and also explore file based S3 mounts for larger deployments. Vielen Dank @jbaiter! |
You mean that parts of the bounding boxes/non-text markup are ending up in the index? This shouldn't happen and is a bug :-/ Do you have more information on when this happens? That is, the schema, a sample of the indexed terms, a sample of a broken snippet, etc.
This filter detects the OCR format of the document and then returns the format-specific character filter. This filter then takes care of extracting the plaintext from the OCR document, while tracking the corresponding offsets in the input OCR document. This results in the input offsets in the actual OCR markup being stored alongside the plaintext tokens. This again allows us to locate the matching terms very quickly in the markup at highlighting time, since we know the offset and can just seek to the position and discover the surrounding context. @DiegoPino |
@jbaiter again thks. About this:
I made a fast check and probably is not a bug but a mine config error. Anyway, I used this schema conf: <fieldtype name="text_ocr_inline" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
</analyzer>
</fieldtype> Ingested using post tool from Solr installation and as doc this json file: {
"id": "ocrdoc-1-i",
"ocr_text_inline": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">
<head>
<meta http-equiv=\"Content-Type\" content=\"text\/html; charset=UTF-8\"\/>
<meta name=\"ocr-system\" content=\"djvu2hocr 0.10.2\"\/>
<meta name=\"ocr-capabilities\" content=\"ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word\"\/>
<title>DjVu hidden text layer<\/title>
<\/head>
<body>
<div class=\"ocr_page\" title=\"bbox 0 0 1836 2596; ppageno 0\" id=\"page_1\">
<span class=\"ocrx_line\" title=\"bbox 385 631 1738 666\">
<span class=\"ocrx_word\" title=\"bbox 385 631 566 666\">ISTITUTO<\/span>
<span class=\"ocrx_word\" title=\"bbox 583 631 621 666\">DI<\/span>
<span class=\"ocrx_word\" title=\"bbox 639 631 820 666\">RICERCA<\/span>
<span class=\"ocrx_word\" title=\"bbox 837 631 972 666\">SULLA<\/span>
<span class=\"ocrx_word\" title=\"bbox 989 631 1190 666\">CRESCITA<\/span>
<span class=\"ocrx_word\" title=\"bbox 1205 631 1459 666\">ECONOMICA<\/span>
<span class=\"ocrx_word\" title=\"bbox 1477 631 1738 666\">SOSTENIBILE<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 451 683 1736 718\">
<span class=\"ocrx_word\" title=\"bbox 451 683 675 718\">RESEARCH<\/span>
<span class=\"ocrx_word\" title=\"bbox 693 683 903 718\">INSTITUTE<\/span>
<span class=\"ocrx_word\" title=\"bbox 922 683 980 718\">ON<\/span>
<span class=\"ocrx_word\" title=\"bbox 999 683 1288 718\">SUSTAINABLE<\/span>
<span class=\"ocrx_word\" title=\"bbox 1304 683 1528 718\">ECONOMIC<\/span>
<span class=\"ocrx_word\" title=\"bbox 1546 683 1736 718\">GROWTH<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 633 1528 1740 1622\">
<span class=\"ocrx_word\" title=\"bbox 633 1532 1000 1603\">Numero<\/span>
<span class=\"ocrx_word\" title=\"bbox 1033 1531 1104 1618\">6,<\/span>
<span class=\"ocrx_word\" title=\"bbox 1140 1528 1486 1622\">maggio<\/span>
<span class=\"ocrx_word\" title=\"bbox 1515 1531 1740 1603\">2018<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 1371 1979 1697 2017\">
<span class=\"ocrx_word\" title=\"bbox 1371 1980 1482 2009\">Follow<\/span>
<span class=\"ocrx_word\" title=\"bbox 1494 1980 1549 2009\">the<\/span>
<span class=\"ocrx_word\" title=\"bbox 1565 1979 1697 2017\">Byterfly<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 1226 2041 1695 2078\">
<span class=\"ocrx_word\" title=\"bbox 1226 2041 1287 2070\">and<\/span>
<span class=\"ocrx_word\" title=\"bbox 1302 2042 1396 2078\">enjoy<\/span>
<span class=\"ocrx_word\" title=\"bbox 1408 2049 1493 2078\">open<\/span>
<span class=\"ocrx_word\" title=\"bbox 1508 2041 1695 2078\">knowledge<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 1082 2155 1698 2189\">
<span class=\"ocrx_word\" title=\"bbox 1082 2155 1293 2183\">GIANCARLO<\/span>
<span class=\"ocrx_word\" title=\"bbox 1304 2155 1457 2189\">BIRELLO,<\/span>
<span class=\"ocrx_word\" title=\"bbox 1469 2156 1577 2183\">ANNA<\/span>
<span class=\"ocrx_word\" title=\"bbox 1590 2156 1698 2183\">PERIN<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 1323 126 1734 164\">
<span class=\"ocrx_word\" title=\"bbox 1323 128 1402 156\">ISSN<\/span>
<span class=\"ocrx_word\" title=\"bbox 1413 126 1536 164\">(print):<\/span>
<span class=\"ocrx_word\" title=\"bbox 1546 128 1734 156\">2421-5783<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 1288 179 1734 216\">
<span class=\"ocrx_word\" title=\"bbox 1288 181 1368 209\">ISSN<\/span>
<span class=\"ocrx_word\" title=\"bbox 1379 179 1434 216\">(on<\/span>
<span class=\"ocrx_word\" title=\"bbox 1447 179 1536 216\">line):<\/span>
<span class=\"ocrx_word\" title=\"bbox 1546 181 1734 209\">2421-5562<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 548 878 1734 1128\">
<span class=\"ocrx_word\" title=\"bbox 548 878 1734 1128\">Rapporto<\/span>
<\/span>
<span class=\"ocrx_line\" title=\"bbox 805 1151 1734 1358\">
<span class=\"ocrx_word\" title=\"bbox 805 1151 1734 1358\">Tecnico<\/span>
<\/span>
<\/div>
<\/body>
<\/html>"
} I think probably is a my post error because if I paste xml into Solr GUI analyse only words are extracted and not number. Now my plan is to check MiniOCR (I like it !!) and I'll make some more check using your reference. |
@jbaiter and @DiegoPino good news: inline works with hOCR and MiniOCR.
|
I test with a couple of field types, both worked with hOCR and MiniOCR. <fieldtype name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
<fieldtype name="text_ocr_inline" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
</analyzer>
</fieldtype>
|
This is interesting, since there's no file handling if you only have the
Ooops, this is a documentation bug that will be fixed asap, thank you! |
The hOCR used was <?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="ocr-system" content="djvu2hocr 0.10.2" />
<meta name="ocr-capabilities" content="ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word" />
<title>DjVu hidden text layer</title>
</head>
<body>
<div class="ocr_page" id="page-0" title="bbox 0 0 1836 2596"><span class="ocrx_line" title="bbox 385 631 1738 666"><span class="ocrx_word" title="bbox 385 631 566 666">ISTITUTO</span> <span class="ocrx_word" title="bbox 583 631 621 666">DI</span> <span class="ocrx_word" title="bbox 639 631 820 666">RICERCA</span> <span class="ocrx_word" title="bbox 837 631 972 666">SULLA</span> <span class="ocrx_word" title="bbox 989 631 1190 666">CRESCITA</span> <span class="ocrx_word" title="bbox 1205 631 1459 666">ECONOMICA</span> <span class="ocrx_word" title="bbox 1477 631 1738 666">SOSTENIBILE</span></span>
<span class="ocrx_line" title="bbox 451 683 1736 718"><span class="ocrx_word" title="bbox 451 683 675 718">RESEARCH</span> <span class="ocrx_word" title="bbox 693 683 903 718">INSTITUTE</span> <span class="ocrx_word" title="bbox 922 683 980 718">ON</span> <span class="ocrx_word" title="bbox 999 683 1288 718">SUSTAINABLE</span> <span class="ocrx_word" title="bbox 1304 683 1528 718">ECONOMIC</span> <span class="ocrx_word" title="bbox 1546 683 1736 718">GROWTH</span></span>
<span class="ocrx_line" title="bbox 633 1528 1740 1622"><span class="ocrx_word" title="bbox 633 1532 1000 1603">Numero</span> <span class="ocrx_word" title="bbox 1032 1531 1104 1618">6,</span> <span class="ocrx_word" title="bbox 1140 1528 1486 1622">maggio</span> <span class="ocrx_word" title="bbox 1515 1531 1740 1603">2018</span></span>
<span class="ocrx_line" title="bbox 1371 1979 1697 2017"><span class="ocrx_word" title="bbox 1371 1980 1482 2009">Follow</span> <span class="ocrx_word" title="bbox 1494 1980 1549 2009">the</span> <span class="ocrx_word" title="bbox 1565 1979 1697 2017">Byterfly</span></span>
<span class="ocrx_line" title="bbox 1226 2041 1695 2078"><span class="ocrx_word" title="bbox 1226 2041 1287 2070">and</span> <span class="ocrx_word" title="bbox 1302 2042 1396 2078">enjoy</span> <span class="ocrx_word" title="bbox 1408 2049 1493 2078">open</span> <span class="ocrx_word" title="bbox 1508 2041 1695 2078">knowledge</span></span>
<span class="ocrx_line" title="bbox 1082 2155 1698 2189"><span class="ocrx_word" title="bbox 1082 2155 1293 2183">GIANCARLO</span> <span class="ocrx_word" title="bbox 1304 2155 1457 2189">BIRELLO,</span> <span class="ocrx_word" title="bbox 1469 2156 1577 2183">ANNA</span> <span class="ocrx_word" title="bbox 1590 2156 1698 2183">PERIN</span></span>
<span class="ocrx_line" title="bbox 1323 126 1734 164"><span class="ocrx_word" title="bbox 1323 128 1402 156">ISSN</span> <span class="ocrx_word" title="bbox 1413 126 1536 164">(print):</span> <span class="ocrx_word" title="bbox 1546 128 1734 156">2421-5783</span></span>
<span class="ocrx_line" title="bbox 1288 179 1734 216"><span class="ocrx_word" title="bbox 1288 181 1368 209">ISSN</span> <span class="ocrx_word" title="bbox 1379 179 1434 216">(on</span> <span class="ocrx_word" title="bbox 1447 179 1536 216">line):</span> <span class="ocrx_word" title="bbox 1546 181 1734 209">2421-5562</span></span>
<span class="ocrx_line" title="bbox 548 878 1734 1128"><span class="ocrx_word" title="bbox 548 878 1734 1128">Rapporto</span></span>
<span class="ocrx_line" title="bbox 805 1151 1734 1358"><span class="ocrx_word" title="bbox 805 1151 1734 1358">Tecnico</span></span>
</div>
</body>
</html> While the MiniOCR (derived from hOCR above): <?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id="0" wh="1836 2596">
<b>
<l><w x="385 631 566 666">ISTITUTO</w> <w x="583 631 621 666">DI</w> <w x="639 631 820 666">RICERCA</w> <w x="837 631 972 666">SULLA</w> <w x="989 631 1190 666">CRESCITA</w> <w x="1205 631 1459 666">ECONOMICA</w> <w x="1477 631 1738 666">SOSTENIBILE</w> </l>
<l><w x="451 683 675 718">RESEARCH</w> <w x="693 683 903 718">INSTITUTE</w> <w x="922 683 980 718">ON</w> <w x="999 683 1288 718">SUSTAINABLE</w> <w x="1304 683 1528 718">ECONOMIC</w> <w x="1546 683 1736 718">GROWTH</w> </l>
<l><w x="633 1532 1000 1603">Numero</w> <w x="1032 1531 1104 1618">6,</w> <w x="1140 1528 1486 1622">maggio</w> <w x="1515 1531 1740 1603">2018</w> </l>
<l><w x="1371 1980 1482 2009">Follow</w> <w x="1494 1980 1549 2009">the</w> <w x="1565 1979 1697 2017">Byterfly</w> </l>
<l><w x="1226 2041 1287 2070">and</w> <w x="1302 2042 1396 2078">enjoy</w> <w x="1408 2049 1493 2078">open</w> <w x="1508 2041 1695 2078">knowledge</w> </l>
<l><w x="1082 2155 1293 2183">GIANCARLO</w> <w x="1304 2155 1457 2189">BIRELLO,</w> <w x="1469 2156 1577 2183">ANNA</w> <w x="1590 2156 1698 2183">PERIN</w> </l>
<l><w x="1323 128 1402 156">ISSN</w> <w x="1413 126 1536 164">(print):</w> <w x="1546 128 1734 156">2421-5783</w> </l>
<l><w x="1288 181 1368 209">ISSN</w> <w x="1379 179 1434 216">(on</w> <w x="1447 179 1536 216">line):</w> <w x="1546 181 1734 209">2421-5562</w> </l>
<l><w x="548 878 1734 1128">Rapporto</w> </l>
<l><w x="805 1151 1734 1358">Tecnico</w> </l>
</b>
</p>
</ocr> |
Finally (next post I'll answer to @jbaiter ), I used this json to update Solr doc: {
"id": "ocrdoc-2-stored",
"ocr_text_stored": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">
<head>
<meta http-equiv=\"Content-Type\" content=\"text\/html; charset=UTF-8\" \/>
<meta name=\"ocr-system\" content=\"djvu2hocr 0.10.2\" \/>
<meta name=\"ocr-capabilities\" content=\"ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word\" \/>
<title>DjVu hidden text layer<\/title>
<\/head>
<body>
<div class=\"ocr_page\" id=\"page-0\" title=\"bbox 0 0 1836 2596\"><span class=\"ocrx_line\" title=\"bbox 385 631 1738 666\"><span class=\"ocrx_word\" title=\"bbox 385 631 566 666\">ISTITUTO<\/span> <span class=\"ocrx_word\" title=\"bbox 583 631 621 666\">DI<\/span> <span class=\"ocrx_word\" title=\"bbox 639 631 820 666\">RICERCA<\/span> <span class=\"ocrx_word\" title=\"bbox 837 631 972 666\">SULLA<\/span> <span class=\"ocrx_word\" title=\"bbox 989 631 1190 666\">CRESCITA<\/span> <span class=\"ocrx_word\" title=\"bbox 1205 631 1459 666\">ECONOMICA<\/span> <span class=\"ocrx_word\" title=\"bbox 1477 631 1738 666\">SOSTENIBILE<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 451 683 1736 718\"><span class=\"ocrx_word\" title=\"bbox 451 683 675 718\">RESEARCH<\/span> <span class=\"ocrx_word\" title=\"bbox 693 683 903 718\">INSTITUTE<\/span> <span class=\"ocrx_word\" title=\"bbox 922 683 980 718\">ON<\/span> <span class=\"ocrx_word\" title=\"bbox 999 683 1288 718\">SUSTAINABLE<\/span> <span class=\"ocrx_word\" title=\"bbox 1304 683 1528 718\">ECONOMIC<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 683 1736 718\">GROWTH<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 633 1528 1740 1622\"><span class=\"ocrx_word\" title=\"bbox 633 1532 1000 1603\">Numero<\/span> <span class=\"ocrx_word\" title=\"bbox 1032 1531 1104 1618\">6,<\/span> <span class=\"ocrx_word\" title=\"bbox 1140 1528 1486 1622\">maggio<\/span> <span class=\"ocrx_word\" title=\"bbox 1515 1531 1740 1603\">2018<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1371 1979 1697 2017\"><span class=\"ocrx_word\" title=\"bbox 1371 1980 1482 2009\">Follow<\/span> <span class=\"ocrx_word\" title=\"bbox 1494 1980 1549 2009\">the<\/span> <span class=\"ocrx_word\" title=\"bbox 1565 1979 1697 2017\">Byterfly<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1226 2041 1695 2078\"><span class=\"ocrx_word\" title=\"bbox 1226 2041 1287 2070\">and<\/span> <span class=\"ocrx_word\" title=\"bbox 1302 2042 1396 2078\">enjoy<\/span> <span class=\"ocrx_word\" title=\"bbox 1408 2049 1493 2078\">open<\/span> <span class=\"ocrx_word\" title=\"bbox 1508 2041 1695 2078\">knowledge<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1082 2155 1698 2189\"><span class=\"ocrx_word\" title=\"bbox 1082 2155 1293 2183\">GIANCARLO<\/span> <span class=\"ocrx_word\" title=\"bbox 1304 2155 1457 2189\">BIRELLO,<\/span> <span class=\"ocrx_word\" title=\"bbox 1469 2156 1577 2183\">ANNA<\/span> <span class=\"ocrx_word\" title=\"bbox 1590 2156 1698 2183\">PERIN<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1323 126 1734 164\"><span class=\"ocrx_word\" title=\"bbox 1323 128 1402 156\">ISSN<\/span> <span class=\"ocrx_word\" title=\"bbox 1413 126 1536 164\">(print):<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 128 1734 156\">2421-5783<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1288 179 1734 216\"><span class=\"ocrx_word\" title=\"bbox 1288 181 1368 209\">ISSN<\/span> <span class=\"ocrx_word\" title=\"bbox 1379 179 1434 216\">(on<\/span> <span class=\"ocrx_word\" title=\"bbox 1447 179 1536 216\">line):<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 181 1734 209\">2421-5562<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 548 878 1734 1128\"><span class=\"ocrx_word\" title=\"bbox 548 878 1734 1128\">Rapporto<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 805 1151 1734 1358\"><span class=\"ocrx_word\" title=\"bbox 805 1151 1734 1358\">Tecnico<\/span><\/span>
<\/div>
<\/body>
<\/html>"
} and for MiniOCR: {
"id": "ocrdoc-1-stored",
"ocr_text_stored": "<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id=\"0\" wh=\"1836 2596\">
<b>
<l><w x=\"385 631 566 666\">ISTITUTO<\/w> <w x=\"583 631 621 666\">DI<\/w> <w x=\"639 631 820 666\">RICERCA<\/w> <w x=\"837 631 972 666\">SULLA<\/w> <w x=\"989 631 1190 666\">CRESCITA<\/w> <w x=\"1205 631 1459 666\">ECONOMICA<\/w> <w x=\"1477 631 1738 666\">SOSTENIBILE<\/w> <\/l>
<l><w x=\"451 683 675 718\">RESEARCH<\/w> <w x=\"693 683 903 718\">INSTITUTE<\/w> <w x=\"922 683 980 718\">ON<\/w> <w x=\"999 683 1288 718\">SUSTAINABLE<\/w> <w x=\"1304 683 1528 718\">ECONOMIC<\/w> <w x=\"1546 683 1736 718\">GROWTH<\/w> <\/l>
<l><w x=\"633 1532 1000 1603\">Numero<\/w> <w x=\"1032 1531 1104 1618\">6,<\/w> <w x=\"1140 1528 1486 1622\">maggio<\/w> <w x=\"1515 1531 1740 1603\">2018<\/w> <\/l>
<l><w x=\"1371 1980 1482 2009\">Follow<\/w> <w x=\"1494 1980 1549 2009\">the<\/w> <w x=\"1565 1979 1697 2017\">Byterfly<\/w> <\/l>
<l><w x=\"1226 2041 1287 2070\">and<\/w> <w x=\"1302 2042 1396 2078\">enjoy<\/w> <w x=\"1408 2049 1493 2078\">open<\/w> <w x=\"1508 2041 1695 2078\">knowledge<\/w> <\/l>
<l><w x=\"1082 2155 1293 2183\">GIANCARLO<\/w> <w x=\"1304 2155 1457 2189\">BIRELLO,<\/w> <w x=\"1469 2156 1577 2183\">ANNA<\/w> <w x=\"1590 2156 1698 2183\">PERIN<\/w> <\/l>
<l><w x=\"1323 128 1402 156\">ISSN<\/w> <w x=\"1413 126 1536 164\">(print):<\/w> <w x=\"1546 128 1734 156\">2421-5783<\/w> <\/l>
<l><w x=\"1288 181 1368 209\">ISSN<\/w> <w x=\"1379 179 1434 216\">(on<\/w> <w x=\"1447 179 1536 216\">line):<\/w> <w x=\"1546 181 1734 209\">2421-5562<\/w> <\/l>
<l><w x=\"548 878 1734 1128\">Rapporto<\/w> <\/l>
<l><w x=\"805 1151 1734 1358\">Tecnico<\/w> <\/l>
<\/b>
<\/p>
<\/ocr>"
}
|
@jbaiter take into account that could be a my error due to how I post to Solr or Json formatting.
|
Yep, that's a bug in my "Is this a source pointer" regular expression :-) Will be fixed, thank you! |
@jbaiter a last (for today... ) question: do you have some tool to convert from hOCR (i.e. produced by djvu2hocr) to MIniOCR? <?php
$val = getopt("i:p:");
$xml = simplexml_load_file($val['i']);
echo "<?xml version='1.0' encoding='UTF-8'?>" . "\n";
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
$coos = explode(" ", substr($page['title'], 5));
echo '<p xml:id="' . $val['p'] . '" wh="' . $coos[2] . " " . $coos[3] . '">' . "\n";
echo '<b>' . "\n";
foreach ($page->children() as $line){
echo '<l>';
foreach ($line->children() as $word){
$wcoos = explode(" ", $word['title']);
echo '<w x="' . $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4] . '">' . $word . '</w> ';
}
echo '</l>' . "\n";
}
echo '</b>' . "\n";
echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?> Calling with |
Thanks to you for your work with this great plugin !! |
@jbaiter A question: is there any reason why MiniOCR inline with decimal value (i.e. 0.012) instead of integer the highlight select returns empty snippets? Do I need a different format for float value into MiniOCR? |
This might be a bug, do you have a sample of a MiniOCR line with relative coordinates? |
Just seeing this https://github.com/dbmdz/solr-ocrhighlighting/blob/main/src/test/resources/data/miniocr.xml and I note that float numer starting with . and not 0. so I'll try now then report here. THKS |
@jbaiter it works! I need to insert float number starting with . and not 0. |
Oh, yeah, that is not mentioned in the documentation yet, sorry! Will be fixed asap, thanks for pointing it out. |
@jbaiter thanks. Saw you commit. c5a9b48#diff-5cd38c8f72f24090d0841363df5244e806e856688d0d0c520adb643e93b1dbb8R73 Seeing it there now. We are almost there with a concrete implementation and will be sharing next week when tested. Already said this but your work is great and inspiring. Thanks again |
Currently we only support loading field values/OCR fragments from the file system.
Support for S3 buckets could be added by implementing a custom
ExternalFieldLoader
implementation and could be useful for deployment scenarios where the OCR is already stored there.S3 supports requests for byte-ranges, so it should be pretty easy to write an
IterableCharSequence
that makes use of this.One additional benefit that S3 users could get is potentially huge increases in performance if we implemented multi-threaded highlighting:
However, since the BSB does (unfortunately :-/) not use S3, this feature will be very low-priority.
I would be willing to work on it in my spare time, if there's enough interest for this.
The text was updated successfully, but these errors were encountered: