Loading OCR fragments from S3 #49

jbaiter · 2019-06-18T14:13:21Z

Currently we only support loading field values/OCR fragments from the file system.

Support for S3 buckets could be added by implementing a custom ExternalFieldLoader implementation and could be useful for deployment scenarios where the OCR is already stored there.

S3 supports requests for byte-ranges, so it should be pretty easy to write an IterableCharSequence that makes use of this.

One additional benefit that S3 users could get is potentially huge increases in performance if we implemented multi-threaded highlighting:

Amazon S3’s support for parallel requests means you can scale your S3 performance by the factor of your compute cluster, without making any customizations to your application. Performance scales per prefix, so you can use as many prefixes as you need in parallel to achieve the required throughput. There are no limits to the number of prefixes.
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/

However, since the BSB does (unfortunately :-/) not use S3, this feature will be very low-priority.
I would be willing to work on it in my spare time, if there's enough interest for this.

The text was updated successfully, but these errors were encountered:

DiegoPino · 2020-11-16T18:30:47Z

Hallo @jbaiter and thanks for opening this issue. Let me explain our use case and some thoughts about S3 loading.
We have developed a repository architecture named Archipelago, Drupal 8/9 based that basically removes the need of a fixed DB schema and hundreds of DB tables for creating complex rich metadata digital Objects by implementing a "smart" single JSON based field that can hold hierarchical metadata, with querying capabilities to expose internal hierarchies and values flat/aggregated but also to render on the fly any other destination Format (e.g IIIF manifests, XML representations) using a templating engine named Twig with some custom stuff there. This also connects/references/manages binary assets. Here is a silly example (see RAW metadata tab at the bottom) https://play.archipelago.nyc/do/6f1da83c-2bf3-4cc3-bf56-d51bd3a97948

For many reasons but specifically because our system aims to solve resource (money, staff, DEVOPS) constraints found in smaller cultural heritage institutions, we store our assets in S3 using a simply hash structure and use internal php Streamwrappers and Cantaloupe to deal with them as "local" . Also, since we use a lot of IIIF, many times Books can be ingested as a single PDF and individual pages generated on the fly via cantaloupe.

Currently one of our team members is testing your wonderful plugin on a custom installation that uses local storage and the results are promising, but without access to local filesystem on many of this installations lazy loading XML files becomes complex for us.

Current workflow (pre use of your plugin)
1.- PDF gets attached and saved
2.-Set of chained Post Processors (even driven) get the number of pages and run in a queue page by page HOCR extraction (can be also ALTO). Each document is converted into a simpler JSON representation (coordinates + words) and pushed into Solr as its own Document and "bound" via a simple parent relationship (no solr native nested docs) to a larger full Digital Object Solr Object, but also afterwards aggregated and stored as a Frictionless data package and attached to the Digital object for future use/reindexing needs. We went for this format because it standardizes having a manifest with its content, we have tooling and since we do have no need for realtime access its a good way of minimizing individual file management, e.g for a 1000+ page book. This is also stored in S3, same as the PDF itself.
In other words. One master Solr document per Object, many dynamically created smaller ones for each page. This smaller ones also have full tokenization of the text only value.

How to highlight? Same as many other do right now before your plugin became available.
Query is done, highlight is retrieved, but also full JSON OCR representation. The later one is used to find the coordinates for the highlighted terms and in our case PHP processing does the rest. It is a single Solr query of course, but with the issue of having a sometimes large payload returning to deal with and no direct Highlight of course.

Our problem right now is of course not the initial indexing, since we are already (for post processing) fetching remote sources temporary to local. But these are not kept around.

So. S3 (or maybe generic URL retrieval with local cache?) could be a good option, maybe even something similar to what IIIF cantaloupe does for source caching and streaming when dealing with S3/Azure remotes.
But also. (please fight me back here since I know so much less that you about the subject) and option where the source is in the Solr document and not stored as a file remotely. Basically what we manually already do but internally, one Solr field that contains e.g a JSON, slimmer than XML representation of the OCR and your plugin, via a config could "self prime" its highlight machinery from its own Solr document or from another one if that is an issue.

Pinging here @giancarlobi since he is testing the plugin against filesystem right now.
Things I do not know are:
1.- The difference in storage used of a Solr field containing JSON v/s the JSON itself as a file
2.- If the plugin already has a caching mechanism for the lazy load or it will always go and fetch from the remote source
3.- If the plugin is capable of reading from its own Solr document (this may be a yes)
4.- If this is even an issue or maybe out of the scope (sorry)

I may see another use case here, that is not urgent but may be interesting for other people: Solr Cloud. In a multi core/multi shard distributed environment, local file access is probably impossible or would require some type of net mount.

Finally. We have thought of having even some type of S3FS mount to just handle this use case but since HOCR/OCR would be in our architecture simply just another file saved in a hash based bucket structure we would have to mount basically everything. Not sure now performant that would be.

Thanks for reading and kudos on your work on this. It is really great!

jbaiter · 2020-11-16T19:47:02Z

Hey @DiegoPino, thanks for the very detailed use case, this is really helpful!

So. S3 (or maybe generic URL retrieval with local cache?) could be a good option, maybe even something similar to what IIIF cantaloupe does for source caching and streaming when dealing with S3/Azure remotes.

I'll look into what Cantaloupe is doing for caching, currently the plugin is relying complete on the OS page cache for reducing disk I/O.

But also. (please fight me back here since I know so much less that you about the subject) and option where the source is in the Solr document and not stored as a file remotely. Basically what we manually already do but internally, one Solr field that contains e.g a JSON, slimmer than XML representation of the OCR and your plugin, via a config could "self prime" its highlight machinery from its own Solr document or from another one if that is an issue.

This is actually currently supported, although its neither well-documented nor well-tested at the moment :-) You can just leave out the ExternalUtf8ContentFilter in your analysis chain and just put the complete hOCR/ALTO/MiniOCR into the document field, it should work just the same as from external files. As for a "slimmer than XML" representation, this is pretty much what the MiniOCR format is intended for, I'm pretty sure it's slimmer than your JSON schema (if you leave out all unnecessary whitespace) :-)

1.- The difference in storage used of a Solr field containing JSON v/s the JSON itself as a file

As mentioned above, the only difference is that you don't need the CharFilter that tricks Solr into treating external file sources as if its content stored in the index, and that you put the OCR directly into the document field.

2.- If the plugin already has a caching mechanism for the lazy load or it will always go and fetch from the remote source

From the perspective of the application code, it will always go and fetch from the local file source at the moment. On the OS level, the page cache is used, and we have a small hooks that does "cache warming" once the files to be highlighted are known.

3.- If the plugin is capable of reading from its own Solr document (this may be a yes)

Yes :-)

4.- If this is even an issue or maybe out of the scope (sorry)

Definitely an issue and totally within scope, no worries :-)

I may see another use case here, that is not urgent but may be interesting for other people: Solr Cloud. In a multi core/multi shard distributed environment, local file access is probably impossible or would require some type of net mount.

Yep, it requires a net mount (NFS/SMB/whatever), but this works like a charm, we're running Solrcloud on three shards with ~300million pages here and performance is pretty good (and will get better, once I'm done with the current WIP refactor) from what I can tell (with flash storage on the filer, though).

Finally. We have thought of having even some type of S3FS mount to just handle this use case but since HOCR/OCR would be in our architecture simply just another file saved in a hash based bucket structure we would have to mount basically everything. Not sure now performant that would be.

Performance probably depends on your S3FS implementation, and you can use your OS' page cache, which is an advantage (iirc FUSE goes through the kernel VFS layer, at least on Linux, but I might be wrong), but I can see why mounting everything is not practical for you.

So to summarize, I'll look into supporting "source pointers" targeting S3 and what an in-application caching layer could look like, but I can't make any guarantees on when this will result in something usable for you. So, in the meantime, I recommend you maybe experiment with the in-index storage and the MiniOCR format, this is a combination that is pretty similar to what you're currently doing with your JSON format!

And please reach out anytime if you hit roadblocks/bugs/documentation gaps/mistakes, either via an issue or a Twitter DM :-)

giancarlobi · 2020-11-16T21:53:34Z

@jbaiter Many thanks !!! I tried with inline hOCR and it works, removing only this line from schema:
<charFilter class="de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory" />
I see that also number of bbox are indexed instead of only words, so probably I need to tune better the index analyzer or do you have some idea?
A question, what is the role of second charFilter OcrCharFilterFactory?
Again, thks!

DiegoPino · 2020-11-16T22:01:51Z

Das ist ja wunderbar! I really appreciate your very detailed reply here and I may need some extra time now to explore the capabilities. I totally did not know it was already possible (need to read more code next time) to query HOCR from Solr directly, simply great. We will go for MiniOCR which is already similar to that smaller format (no pretty print) we were planning and will do some integration test and also explore file based S3 mounts for larger deployments. Vielen Dank @jbaiter!

jbaiter · 2020-11-16T23:21:14Z

@giancarlobi

I see that also number of bbox are indexed instead of only words, so probably I need to tune better the index analyzer or do you have some idea?

You mean that parts of the bounding boxes/non-text markup are ending up in the index? This shouldn't happen and is a bug :-/ Do you have more information on when this happens? That is, the schema, a sample of the indexed terms, a sample of a broken snippet, etc.

A question, what is the role of second charFilter OcrCharFilterFactory?

This filter detects the OCR format of the document and then returns the format-specific character filter. This filter then takes care of extracting the plaintext from the OCR document, while tracking the corresponding offsets in the input OCR document. This results in the input offsets in the actual OCR markup being stored alongside the plaintext tokens. This again allows us to locate the matching terms very quickly in the markup at highlighting time, since we know the offset and can just seek to the position and discover the surrounding context.

@DiegoPino
There's a test that demonstrates using MiniOCR with in-index storage, maybe this can be helpful for experimenting:
Ingesting a document: https://github.com/dbmdz/solr-ocrhighlighting/blob/main/src/test/java/de/digitalcollections/solrocr/solr/MiniOcrTest.java#L35-L37
Field Type in schema; https://github.com/dbmdz/solr-ocrhighlighting/blob/main/src/test/resources/solr/general/schema.xml#L13-L27

giancarlobi · 2020-11-17T09:25:11Z

@jbaiter again thks. About this:

You mean that parts of the bounding boxes/non-text markup are ending up in the index? This shouldn't happen and is a bug :-/ Do you have more information on when this happens? That is, the schema, a sample of the indexed terms, a sample of a broken snippet, etc.

I made a fast check and probably is not a bug but a mine config error. Anyway, I used this schema conf:

<fieldtype name="text_ocr_inline" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
  </analyzer>
</fieldtype>

Ingested using post tool from Solr installation and as doc this json file:

{
    "id": "ocrdoc-1-i",
    "ocr_text_inline": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">
  <head>
    <meta http-equiv=\"Content-Type\" content=\"text\/html; charset=UTF-8\"\/>
    <meta name=\"ocr-system\" content=\"djvu2hocr 0.10.2\"\/>
    <meta name=\"ocr-capabilities\" content=\"ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word\"\/>
    <title>DjVu hidden text layer<\/title>
  <\/head>
  <body>
    <div class=\"ocr_page\" title=\"bbox 0 0 1836 2596; ppageno 0\" id=\"page_1\">
      <span class=\"ocrx_line\" title=\"bbox 385 631 1738 666\">
        <span class=\"ocrx_word\" title=\"bbox 385 631 566 666\">ISTITUTO<\/span>
        <span class=\"ocrx_word\" title=\"bbox 583 631 621 666\">DI<\/span>
        <span class=\"ocrx_word\" title=\"bbox 639 631 820 666\">RICERCA<\/span>
        <span class=\"ocrx_word\" title=\"bbox 837 631 972 666\">SULLA<\/span>
        <span class=\"ocrx_word\" title=\"bbox 989 631 1190 666\">CRESCITA<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1205 631 1459 666\">ECONOMICA<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1477 631 1738 666\">SOSTENIBILE<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 451 683 1736 718\">
        <span class=\"ocrx_word\" title=\"bbox 451 683 675 718\">RESEARCH<\/span>
        <span class=\"ocrx_word\" title=\"bbox 693 683 903 718\">INSTITUTE<\/span>
        <span class=\"ocrx_word\" title=\"bbox 922 683 980 718\">ON<\/span>
        <span class=\"ocrx_word\" title=\"bbox 999 683 1288 718\">SUSTAINABLE<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1304 683 1528 718\">ECONOMIC<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1546 683 1736 718\">GROWTH<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 633 1528 1740 1622\">
        <span class=\"ocrx_word\" title=\"bbox 633 1532 1000 1603\">Numero<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1033 1531 1104 1618\">6,<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1140 1528 1486 1622\">maggio<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1515 1531 1740 1603\">2018<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 1371 1979 1697 2017\">
        <span class=\"ocrx_word\" title=\"bbox 1371 1980 1482 2009\">Follow<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1494 1980 1549 2009\">the<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1565 1979 1697 2017\">Byterfly<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 1226 2041 1695 2078\">
        <span class=\"ocrx_word\" title=\"bbox 1226 2041 1287 2070\">and<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1302 2042 1396 2078\">enjoy<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1408 2049 1493 2078\">open<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1508 2041 1695 2078\">knowledge<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 1082 2155 1698 2189\">
        <span class=\"ocrx_word\" title=\"bbox 1082 2155 1293 2183\">GIANCARLO<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1304 2155 1457 2189\">BIRELLO,<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1469 2156 1577 2183\">ANNA<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1590 2156 1698 2183\">PERIN<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 1323 126 1734 164\">
        <span class=\"ocrx_word\" title=\"bbox 1323 128 1402 156\">ISSN<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1413 126 1536 164\">(print):<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1546 128 1734 156\">2421-5783<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 1288 179 1734 216\">
        <span class=\"ocrx_word\" title=\"bbox 1288 181 1368 209\">ISSN<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1379 179 1434 216\">(on<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1447 179 1536 216\">line):<\/span>
        <span class=\"ocrx_word\" title=\"bbox 1546 181 1734 209\">2421-5562<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 548 878 1734 1128\">
        <span class=\"ocrx_word\" title=\"bbox 548 878 1734 1128\">Rapporto<\/span>
      <\/span>
      <span class=\"ocrx_line\" title=\"bbox 805 1151 1734 1358\">
        <span class=\"ocrx_word\" title=\"bbox 805 1151 1734 1358\">Tecnico<\/span>
      <\/span>
    <\/div>
  <\/body>
<\/html>"
}

I think probably is a my post error because if I paste xml into Solr GUI analyse only words are extracted and not number.

Now my plan is to check MiniOCR (I like it !!) and I'll make some more check using your reference.
I'll post here how that works.
Thanks and have a nice day.

giancarlobi · 2020-11-17T17:26:49Z

@jbaiter and @DiegoPino good news: inline works with hOCR and MiniOCR.
First a couple of notes, I don't know if my errors or bugs.
Notes:

It seems that the value of field (i.e. ocr_text_stored) must contains newlines, without them plugin give an error (file at ... does not exist), probably without newlines it evaluates string as file name
MiniOCR needs <p xml:id=... instead of simple <p id=...
Next post test details.

giancarlobi · 2020-11-17T17:31:23Z

I test with a couple of field types, both worked with hOCR and MiniOCR.

    <fieldtype name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
      <analyzer type="index">
        <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>

<fieldtype name="text_ocr_inline" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
  </analyzer>
</fieldtype>

jbaiter · 2020-11-17T17:38:01Z

It seems that the value of field (i.e. ocr_text_stored) must contains newlines, without them plugin give an error (file at ... does not exist), probably without newlines it evaluates string as file name

This is interesting, since there's no file handling if you only have the OcrCharFilterFactory in your analysis chain 🤔 Can you post the full error with its stack trace?

MiniOCR needs <p xml:id=... instead of simple <p id=...

Ooops, this is a documentation bug that will be fixed asap, thank you!

giancarlobi · 2020-11-17T17:39:33Z

The hOCR used was

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <meta name="ocr-system" content="djvu2hocr 0.10.2" />
  <meta name="ocr-capabilities" content="ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word" />
  <title>DjVu hidden text layer</title>
</head>
<body>
<div class="ocr_page" id="page-0" title="bbox 0 0 1836 2596"><span class="ocrx_line" title="bbox 385 631 1738 666"><span class="ocrx_word" title="bbox 385 631 566 666">ISTITUTO</span> <span class="ocrx_word" title="bbox 583 631 621 666">DI</span> <span class="ocrx_word" title="bbox 639 631 820 666">RICERCA</span> <span class="ocrx_word" title="bbox 837 631 972 666">SULLA</span> <span class="ocrx_word" title="bbox 989 631 1190 666">CRESCITA</span> <span class="ocrx_word" title="bbox 1205 631 1459 666">ECONOMICA</span> <span class="ocrx_word" title="bbox 1477 631 1738 666">SOSTENIBILE</span></span>
<span class="ocrx_line" title="bbox 451 683 1736 718"><span class="ocrx_word" title="bbox 451 683 675 718">RESEARCH</span> <span class="ocrx_word" title="bbox 693 683 903 718">INSTITUTE</span> <span class="ocrx_word" title="bbox 922 683 980 718">ON</span> <span class="ocrx_word" title="bbox 999 683 1288 718">SUSTAINABLE</span> <span class="ocrx_word" title="bbox 1304 683 1528 718">ECONOMIC</span> <span class="ocrx_word" title="bbox 1546 683 1736 718">GROWTH</span></span>
<span class="ocrx_line" title="bbox 633 1528 1740 1622"><span class="ocrx_word" title="bbox 633 1532 1000 1603">Numero</span> <span class="ocrx_word" title="bbox 1032 1531 1104 1618">6,</span> <span class="ocrx_word" title="bbox 1140 1528 1486 1622">maggio</span> <span class="ocrx_word" title="bbox 1515 1531 1740 1603">2018</span></span>
<span class="ocrx_line" title="bbox 1371 1979 1697 2017"><span class="ocrx_word" title="bbox 1371 1980 1482 2009">Follow</span> <span class="ocrx_word" title="bbox 1494 1980 1549 2009">the</span> <span class="ocrx_word" title="bbox 1565 1979 1697 2017">Byterfly</span></span>
<span class="ocrx_line" title="bbox 1226 2041 1695 2078"><span class="ocrx_word" title="bbox 1226 2041 1287 2070">and</span> <span class="ocrx_word" title="bbox 1302 2042 1396 2078">enjoy</span> <span class="ocrx_word" title="bbox 1408 2049 1493 2078">open</span> <span class="ocrx_word" title="bbox 1508 2041 1695 2078">knowledge</span></span>
<span class="ocrx_line" title="bbox 1082 2155 1698 2189"><span class="ocrx_word" title="bbox 1082 2155 1293 2183">GIANCARLO</span> <span class="ocrx_word" title="bbox 1304 2155 1457 2189">BIRELLO,</span> <span class="ocrx_word" title="bbox 1469 2156 1577 2183">ANNA</span> <span class="ocrx_word" title="bbox 1590 2156 1698 2183">PERIN</span></span>
<span class="ocrx_line" title="bbox 1323 126 1734 164"><span class="ocrx_word" title="bbox 1323 128 1402 156">ISSN</span> <span class="ocrx_word" title="bbox 1413 126 1536 164">(print):</span> <span class="ocrx_word" title="bbox 1546 128 1734 156">2421-5783</span></span>
<span class="ocrx_line" title="bbox 1288 179 1734 216"><span class="ocrx_word" title="bbox 1288 181 1368 209">ISSN</span> <span class="ocrx_word" title="bbox 1379 179 1434 216">(on</span> <span class="ocrx_word" title="bbox 1447 179 1536 216">line):</span> <span class="ocrx_word" title="bbox 1546 181 1734 209">2421-5562</span></span>
<span class="ocrx_line" title="bbox 548 878 1734 1128"><span class="ocrx_word" title="bbox 548 878 1734 1128">Rapporto</span></span>
<span class="ocrx_line" title="bbox 805 1151 1734 1358"><span class="ocrx_word" title="bbox 805 1151 1734 1358">Tecnico</span></span>
</div>
</body>
</html>

While the MiniOCR (derived from hOCR above):

<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id="0" wh="1836 2596">
<b>
<l><w x="385 631 566 666">ISTITUTO</w> <w x="583 631 621 666">DI</w> <w x="639 631 820 666">RICERCA</w> <w x="837 631 972 666">SULLA</w> <w x="989 631 1190 666">CRESCITA</w> <w x="1205 631 1459 666">ECONOMICA</w> <w x="1477 631 1738 666">SOSTENIBILE</w> </l>
<l><w x="451 683 675 718">RESEARCH</w> <w x="693 683 903 718">INSTITUTE</w> <w x="922 683 980 718">ON</w> <w x="999 683 1288 718">SUSTAINABLE</w> <w x="1304 683 1528 718">ECONOMIC</w> <w x="1546 683 1736 718">GROWTH</w> </l>
<l><w x="633 1532 1000 1603">Numero</w> <w x="1032 1531 1104 1618">6,</w> <w x="1140 1528 1486 1622">maggio</w> <w x="1515 1531 1740 1603">2018</w> </l>
<l><w x="1371 1980 1482 2009">Follow</w> <w x="1494 1980 1549 2009">the</w> <w x="1565 1979 1697 2017">Byterfly</w> </l>
<l><w x="1226 2041 1287 2070">and</w> <w x="1302 2042 1396 2078">enjoy</w> <w x="1408 2049 1493 2078">open</w> <w x="1508 2041 1695 2078">knowledge</w> </l>
<l><w x="1082 2155 1293 2183">GIANCARLO</w> <w x="1304 2155 1457 2189">BIRELLO,</w> <w x="1469 2156 1577 2183">ANNA</w> <w x="1590 2156 1698 2183">PERIN</w> </l>
<l><w x="1323 128 1402 156">ISSN</w> <w x="1413 126 1536 164">(print):</w> <w x="1546 128 1734 156">2421-5783</w> </l>
<l><w x="1288 181 1368 209">ISSN</w> <w x="1379 179 1434 216">(on</w> <w x="1447 179 1536 216">line):</w> <w x="1546 181 1734 209">2421-5562</w> </l>
<l><w x="548 878 1734 1128">Rapporto</w> </l>
<l><w x="805 1151 1734 1358">Tecnico</w> </l>
</b>
</p>
</ocr>

giancarlobi · 2020-11-17T17:43:20Z

Finally (next post I'll answer to @jbaiter ), I used this json to update Solr doc:
for hOCR

{
    "id": "ocrdoc-2-stored",
    "ocr_text_stored": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<html xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">
<head>
  <meta http-equiv=\"Content-Type\" content=\"text\/html; charset=UTF-8\" \/>
  <meta name=\"ocr-system\" content=\"djvu2hocr 0.10.2\" \/>
  <meta name=\"ocr-capabilities\" content=\"ocr_carea ocr_page ocr_par ocrx_block ocrx_line ocrx_word\" \/>
  <title>DjVu hidden text layer<\/title>
<\/head>
<body>
<div class=\"ocr_page\" id=\"page-0\" title=\"bbox 0 0 1836 2596\"><span class=\"ocrx_line\" title=\"bbox 385 631 1738 666\"><span class=\"ocrx_word\" title=\"bbox 385 631 566 666\">ISTITUTO<\/span> <span class=\"ocrx_word\" title=\"bbox 583 631 621 666\">DI<\/span> <span class=\"ocrx_word\" title=\"bbox 639 631 820 666\">RICERCA<\/span> <span class=\"ocrx_word\" title=\"bbox 837 631 972 666\">SULLA<\/span> <span class=\"ocrx_word\" title=\"bbox 989 631 1190 666\">CRESCITA<\/span> <span class=\"ocrx_word\" title=\"bbox 1205 631 1459 666\">ECONOMICA<\/span> <span class=\"ocrx_word\" title=\"bbox 1477 631 1738 666\">SOSTENIBILE<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 451 683 1736 718\"><span class=\"ocrx_word\" title=\"bbox 451 683 675 718\">RESEARCH<\/span> <span class=\"ocrx_word\" title=\"bbox 693 683 903 718\">INSTITUTE<\/span> <span class=\"ocrx_word\" title=\"bbox 922 683 980 718\">ON<\/span> <span class=\"ocrx_word\" title=\"bbox 999 683 1288 718\">SUSTAINABLE<\/span> <span class=\"ocrx_word\" title=\"bbox 1304 683 1528 718\">ECONOMIC<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 683 1736 718\">GROWTH<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 633 1528 1740 1622\"><span class=\"ocrx_word\" title=\"bbox 633 1532 1000 1603\">Numero<\/span> <span class=\"ocrx_word\" title=\"bbox 1032 1531 1104 1618\">6,<\/span> <span class=\"ocrx_word\" title=\"bbox 1140 1528 1486 1622\">maggio<\/span> <span class=\"ocrx_word\" title=\"bbox 1515 1531 1740 1603\">2018<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1371 1979 1697 2017\"><span class=\"ocrx_word\" title=\"bbox 1371 1980 1482 2009\">Follow<\/span> <span class=\"ocrx_word\" title=\"bbox 1494 1980 1549 2009\">the<\/span> <span class=\"ocrx_word\" title=\"bbox 1565 1979 1697 2017\">Byterfly<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1226 2041 1695 2078\"><span class=\"ocrx_word\" title=\"bbox 1226 2041 1287 2070\">and<\/span> <span class=\"ocrx_word\" title=\"bbox 1302 2042 1396 2078\">enjoy<\/span> <span class=\"ocrx_word\" title=\"bbox 1408 2049 1493 2078\">open<\/span> <span class=\"ocrx_word\" title=\"bbox 1508 2041 1695 2078\">knowledge<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1082 2155 1698 2189\"><span class=\"ocrx_word\" title=\"bbox 1082 2155 1293 2183\">GIANCARLO<\/span> <span class=\"ocrx_word\" title=\"bbox 1304 2155 1457 2189\">BIRELLO,<\/span> <span class=\"ocrx_word\" title=\"bbox 1469 2156 1577 2183\">ANNA<\/span> <span class=\"ocrx_word\" title=\"bbox 1590 2156 1698 2183\">PERIN<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1323 126 1734 164\"><span class=\"ocrx_word\" title=\"bbox 1323 128 1402 156\">ISSN<\/span> <span class=\"ocrx_word\" title=\"bbox 1413 126 1536 164\">(print):<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 128 1734 156\">2421-5783<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 1288 179 1734 216\"><span class=\"ocrx_word\" title=\"bbox 1288 181 1368 209\">ISSN<\/span> <span class=\"ocrx_word\" title=\"bbox 1379 179 1434 216\">(on<\/span> <span class=\"ocrx_word\" title=\"bbox 1447 179 1536 216\">line):<\/span> <span class=\"ocrx_word\" title=\"bbox 1546 181 1734 209\">2421-5562<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 548 878 1734 1128\"><span class=\"ocrx_word\" title=\"bbox 548 878 1734 1128\">Rapporto<\/span><\/span>
<span class=\"ocrx_line\" title=\"bbox 805 1151 1734 1358\"><span class=\"ocrx_word\" title=\"bbox 805 1151 1734 1358\">Tecnico<\/span><\/span>
<\/div>
<\/body>
<\/html>"
}

and for MiniOCR:

{
    "id": "ocrdoc-1-stored",
    "ocr_text_stored": "<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id=\"0\" wh=\"1836 2596\">
<b>
<l><w x=\"385 631 566 666\">ISTITUTO<\/w> <w x=\"583 631 621 666\">DI<\/w> <w x=\"639 631 820 666\">RICERCA<\/w> <w x=\"837 631 972 666\">SULLA<\/w> <w x=\"989 631 1190 666\">CRESCITA<\/w> <w x=\"1205 631 1459 666\">ECONOMICA<\/w> <w x=\"1477 631 1738 666\">SOSTENIBILE<\/w> <\/l>
<l><w x=\"451 683 675 718\">RESEARCH<\/w> <w x=\"693 683 903 718\">INSTITUTE<\/w> <w x=\"922 683 980 718\">ON<\/w> <w x=\"999 683 1288 718\">SUSTAINABLE<\/w> <w x=\"1304 683 1528 718\">ECONOMIC<\/w> <w x=\"1546 683 1736 718\">GROWTH<\/w> <\/l>
<l><w x=\"633 1532 1000 1603\">Numero<\/w> <w x=\"1032 1531 1104 1618\">6,<\/w> <w x=\"1140 1528 1486 1622\">maggio<\/w> <w x=\"1515 1531 1740 1603\">2018<\/w> <\/l>
<l><w x=\"1371 1980 1482 2009\">Follow<\/w> <w x=\"1494 1980 1549 2009\">the<\/w> <w x=\"1565 1979 1697 2017\">Byterfly<\/w> <\/l>
<l><w x=\"1226 2041 1287 2070\">and<\/w> <w x=\"1302 2042 1396 2078\">enjoy<\/w> <w x=\"1408 2049 1493 2078\">open<\/w> <w x=\"1508 2041 1695 2078\">knowledge<\/w> <\/l>
<l><w x=\"1082 2155 1293 2183\">GIANCARLO<\/w> <w x=\"1304 2155 1457 2189\">BIRELLO,<\/w> <w x=\"1469 2156 1577 2183\">ANNA<\/w> <w x=\"1590 2156 1698 2183\">PERIN<\/w> <\/l>
<l><w x=\"1323 128 1402 156\">ISSN<\/w> <w x=\"1413 126 1536 164\">(print):<\/w> <w x=\"1546 128 1734 156\">2421-5783<\/w> <\/l>
<l><w x=\"1288 181 1368 209\">ISSN<\/w> <w x=\"1379 179 1434 216\">(on<\/w> <w x=\"1447 179 1536 216\">line):<\/w> <w x=\"1546 181 1734 209\">2421-5562<\/w> <\/l>
<l><w x=\"548 878 1734 1128\">Rapporto<\/w> <\/l>
<l><w x=\"805 1151 1734 1358\">Tecnico<\/w> <\/l>
<\/b>
<\/p>
<\/ocr>"
}

giancarlobi · 2020-11-17T17:51:17Z

This is interesting, since there's no file handling if you only have the OcrCharFilterFactory in your analysis chain thinking Can you post the full error with its stack trace?

@jbaiter take into account that could be a my error due to how I post to Solr or Json formatting.
The log here:

2020-11-17 16:28:11.115 WARN  (qtp1997859171-22) [   x:archipelago] d.d.s.m.SourcePointer File at <ocr><p id="0" wh="1836 2596"><b><l><w x="385 631 566 666
">ISTITUTO</w> <w x="583 631 621 666">DI</w> <w x="639 631 820 666">RICERCA</w> <w x="837 631 972 666">SULLA</w> <w x="989 631 1190 666">CRESCITA</w> <w x=
"1205 631 1459 666">ECONOMICA</w> <w x="1477 631 1738 666">SOSTENIBILE</w> </l><l><w x="451 683 675 718">RESEARCH</w> <w x="693 683 903 718">INSTITUTE</w>
<w x="922 683 980 718">ON</w> <w x="999 683 1288 718">SUSTAINABLE</w> <w x="1304 683 1528 718">ECONOMIC</w> <w x="1546 683 1736 718">GROWTH</w> </l><l><w x
="633 1532 1000 1603">Numero</w> <w x="1032 1531 1104 1618">6,</w> <w x="1140 1528 1486 1622">maggio</w> <w x="1515 1531 1740 1603">2018</w> </l><l><w x="1
371 1980 1482 2009">Follow</w> <w x="1494 1980 1549 2009">the</w> <w x="1565 1979 1697 2017">Byterfly</w> </l><l><w x="1226 2041 1287 2070">and</w> <w x="1
302 2042 1396 2078">enjoy</w> <w x="1408 2049 1493 2078">open</w> <w x="1508 2041 1695 2078">knowledge</w> </l><l><w x="1082 2155 1293 2183">GIANCARLO</w>
<w x="1304 2155 1457 2189">BIRELLO,</w> <w x="1469 2156 1577 2183">ANNA</w> <w x="1590 2156 1698 2183">PERIN</w> </l><l><w x="1323 128 1402 156">ISSN</w> <
w x="1413 126 1536 164">(print):</w> <w x="1546 128 1734 156">2421-5783</w> </l><l><w x="1288 181 1368 209">ISSN</w> <w x="1379 179 1434 216">(on</w> <w x=
"1447 179 1536 216">line):</w> <w x="1546 181 1734 209">2421-5562</w> </l><l><w x="548 878 1734 1128">Rapporto</w> </l><l><w x="805 1151 1734 1358">Tecnico
</w> </l></b></p></ocr> does not exist, skipping.
2020-11-17 16:28:11.117 ERROR (qtp1997859171-22) [   x:archipelago] o.a.s.h.RequestHandlerBase java.lang.RuntimeException: Could not read file at '<ocr><p
id="0" wh="1836 2596"><b><l><w x="385 631 566 666">ISTITUTO</w> <w x="583 631 621 666">DI</w> <w x="639 631 820 666">RICERCA</w> <w x="837 631 972 666">SUL
LA</w> <w x="989 631 1190 666">CRESCITA</w> <w x="1205 631 1459 666">ECONOMICA</w> <w x="1477 631 1738 666">SOSTENIBILE</w> </l><l><w x="451 683 675 718">R
ESEARCH</w> <w x="693 683 903 718">INSTITUTE</w> <w x="922 683 980 718">ON</w> <w x="999 683 1288 718">SUSTAINABLE</w> <w x="1304 683 1528 718">ECONOMIC</w
> <w x="1546 683 1736 718">GROWTH</w> </l><l><w x="633 1532 1000 1603">Numero</w> <w x="1032 1531 1104 1618">6,</w> <w x="1140 1528 1486 1622">maggio</w> <
w x="1515 1531 1740 1603">2018</w> </l><l><w x="1371 1980 1482 2009">Follow</w> <w x="1494 1980 1549 2009">the</w> <w x="1565 1979 1697 2017">Byterfly</w>
</l><l><w x="1226 2041 1287 2070">and</w> <w x="1302 2042 1396 2078">enjoy</w> <w x="1408 2049 1493 2078">open</w> <w x="1508 2041 1695 2078">knowledge</w>
 </l><l><w x="1082 2155 1293 2183">GIANCARLO</w> <w x="1304 2155 1457 2189">BIRELLO,</w> <w x="1469 2156 1577 2183">ANNA</w> <w x="1590 2156 1698 2183">PER
IN</w> </l><l><w x="1323 128 1402 156">ISSN</w> <w x="1413 126 1536 164">(print):</w> <w x="1546 128 1734 156">2421-5783</w> </l><l><w x="1288 181 1368 209
">ISSN</w> <w x="1379 179 1434 216">(on</w> <w x="1447 179 1536 216">line):</w> <w x="1546 181 1734 209">2421-5562</w> </l><l><w x="548 878 1734 1128">Rapp
orto</w> </l><l><w x="805 1151 1734 1358">Tecnico</w> </l></b></p></ocr>', cannot index document.
        at de.digitalcollections.solrocr.model.SourcePointer.lambda$parse$2(SourcePointer.java:104)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
        at de.digitalcollections.solrocr.model.SourcePointer.parse(SourcePointer.java:106)
        at de.digitalcollections.solrocr.lucene.OcrHighlighter.loadOcrFieldValues(OcrHighlighter.java:374)
        at de.digitalcollections.solrocr.lucene.OcrHighlighter.highlightOcrFields(OcrHighlighter.java:249)
        at de.digitalcollections.solrocr.solr.SolrOcrHighlighter.doHighlighting(SolrOcrHighlighter.java:48)
        at de.digitalcollections.solrocr.solr.OcrHighlightComponent.process(OcrHighlightComponent.java:76)
        at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:360)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:214)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2627)
        at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:795)
        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:568)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1610)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1300)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1580)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1215)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
        at org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:500)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:273)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:129)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:375)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
        at java.lang.Thread.run(Thread.java:748)

jbaiter · 2020-11-17T18:01:47Z

Yep, that's a bug in my "Is this a source pointer" regular expression :-) Will be fixed, thank you!

giancarlobi · 2020-11-17T18:02:11Z

@jbaiter a last (for today... ) question: do you have some tool to convert from hOCR (i.e. produced by djvu2hocr) to MIniOCR?
I wrote this pretty simple php script this morning that needs some makeup:

<?php
$val = getopt("i:p:");

$xml = simplexml_load_file($val['i']);

echo "<?xml version='1.0' encoding='UTF-8'?>" . "\n";
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
  $coos = explode(" ", substr($page['title'], 5));
  echo '<p xml:id="' . $val['p'] . '" wh="' . $coos[2] . " " . $coos[3] . '">' . "\n";
  echo '<b>' . "\n";
  foreach ($page->children() as $line){
  	echo '<l>';
    foreach ($line->children() as $word){
    	$wcoos = explode(" ", $word['title']);
    	echo '<w x="' . $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4] . '">' . $word . '</w> '; 
    }
    echo '</l>' . "\n";
  }
  echo '</b>' . "\n";
  echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?>

Calling with php hocr2miniocr.php -i page-1.html -p page-1 > page-1.xml

giancarlobi · 2020-11-17T18:05:33Z

Yep, that's a bug in my "Is this a source pointer" regular expression :-) Will be fixed, thank you!

Thanks to you for your work with this great plugin !!

giancarlobi · 2020-11-20T14:41:47Z

@jbaiter A question: is there any reason why MiniOCR inline with decimal value (i.e. 0.012) instead of integer the highlight select returns empty snippets? Do I need a different format for float value into MiniOCR?
Thanks !!

jbaiter · 2020-11-20T14:44:07Z

This might be a bug, do you have a sample of a MiniOCR line with relative coordinates?

giancarlobi · 2020-11-20T14:48:37Z

Just seeing this https://github.com/dbmdz/solr-ocrhighlighting/blob/main/src/test/resources/data/miniocr.xml and I note that float numer starting with . and not 0. so I'll try now then report here. THKS

giancarlobi · 2020-11-20T14:52:33Z

@jbaiter it works! I need to insert float number starting with . and not 0.
Have a nice we and thks again

jbaiter · 2020-11-20T14:57:00Z

Oh, yeah, that is not mentioned in the documentation yet, sorry! Will be fixed asap, thanks for pointing it out.

DiegoPino · 2020-11-20T18:09:27Z

@jbaiter thanks. Saw you commit. c5a9b48#diff-5cd38c8f72f24090d0841363df5244e806e856688d0d0c520adb643e93b1dbb8R73

Seeing it there now. We are almost there with a concrete implementation and will be sharing next week when tested. Already said this but your work is great and inspiring. Thanks again

jbaiter added the enhancement New feature or request label Jun 18, 2019

jbaiter mentioned this issue Sep 12, 2019

Multi-threaded highlighting #64

Closed

DiegoPino mentioned this issue Nov 19, 2020

HOCR my old friend: enable full HOCR pipeline for IAbookreader esmero/format_strawberryfield#105

Open

giancarlobi mentioned this issue Nov 24, 2020

ISSUE-3: OCR specific Processor and new features/processing option esmero/strawberry_runners#11

Merged

beatrycze-volk mentioned this issue Apr 14, 2021

Use Solr OCR Highlighting Plugin in Search in Document Plugin kitodo/kitodo-presentation#587

Merged

12 tasks

wrznr mentioned this issue Apr 19, 2021

Solr highlighting doesn't work with certain search terms. kitodo/kitodo-presentation#502

Closed

jbaiter mentioned this issue May 23, 2024

I/O Stack Simplification and Optimization #430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading OCR fragments from S3 #49

Loading OCR fragments from S3 #49

jbaiter commented Jun 18, 2019 •

edited

Loading

DiegoPino commented Nov 16, 2020 •

edited

Loading

jbaiter commented Nov 16, 2020

giancarlobi commented Nov 16, 2020

DiegoPino commented Nov 16, 2020

jbaiter commented Nov 16, 2020 •

edited

Loading

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

jbaiter commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

jbaiter commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 20, 2020

jbaiter commented Nov 20, 2020

giancarlobi commented Nov 20, 2020

giancarlobi commented Nov 20, 2020

jbaiter commented Nov 20, 2020

DiegoPino commented Nov 20, 2020

Loading OCR fragments from S3 #49

Loading OCR fragments from S3 #49

Comments

jbaiter commented Jun 18, 2019 • edited Loading

DiegoPino commented Nov 16, 2020 • edited Loading

jbaiter commented Nov 16, 2020

giancarlobi commented Nov 16, 2020

DiegoPino commented Nov 16, 2020

jbaiter commented Nov 16, 2020 • edited Loading

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

jbaiter commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

jbaiter commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 17, 2020

giancarlobi commented Nov 20, 2020

jbaiter commented Nov 20, 2020

giancarlobi commented Nov 20, 2020

giancarlobi commented Nov 20, 2020

jbaiter commented Nov 20, 2020

DiegoPino commented Nov 20, 2020

jbaiter commented Jun 18, 2019 •

edited

Loading

DiegoPino commented Nov 16, 2020 •

edited

Loading

jbaiter commented Nov 16, 2020 •

edited

Loading