Updated test GT for legacy

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
DS4SD · Dec 17, 2024 · dca32bf · dca32bf
1 parent 6d38c7c
commit dca32bf
Show file tree

Hide file tree

Showing 5 changed files with 24 additions and 27 deletions.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -25,7 +25,7 @@ packages = [{include = "docling"}]
 # actual dependencies:
 ######################
 python = "^3.9"
-docling-core = { version = "^2.12.0", extras = ["chunking"] }
+docling-core = { version = "^2.12.1", extras = ["chunking"] }
 pydantic = "^2.0.0"
 docling-ibm-models = "^3.1.0"
 deepsearch-glm = "^1.0.0"

diff --git a/tests/data/groundtruth/docling_v1/2203.01017v2.md b/tests/data/groundtruth/docling_v1/2203.01017v2.md
@@ -12,7 +12,6 @@
 
 The occurrence of tables in documents is ubiquitous. They often summarise quantitative or factual data, which is cumbersome to describe in verbose text but nevertheless extremely valuable. Unfortunately, this compact representation is often not easy to parse by machines. There are many implicit conventions used to obtain a compact table representation. For example, tables often have complex columnand row-headers in order to reduce duplicated cell content. Lines of different shapes and sizes are leveraged to separate content or indicate a tree structure. Additionally, tables can also have empty/missing table-entries or multi-row textual table-entries. Fig. 1 shows a table which presents all these issues.
 
-
 <!-- image -->
 
 Tables organize valuable content in a concise and compact representation. This content is extremely valuable for systems such as search engines, Knowledge Graph's, etc, since they enhance their predictive capabilities. Unfortunately, tables come in a large variety of shapes and sizes. Furthermore, they can have complex column/row-header configurations, multiline rows, different variety of separation lines, missing entries, etc. As such, the correct identification of the table-structure from an image is a nontrivial task. In this paper, we present a new table-structure identification model. The latter improves the latest end-toend deep learning model (i.e. encoder-dual-decoder from PubTabNet) in two significant ways. First, we introduce a new object detection decoder for table-cells. In this way, we can obtain the content of the table-cells from programmatic PDF's directly from the PDF source and avoid the training of the custom OCR decoders. This architectural change leads to more accurate table-content extraction and allows us to tackle non-english tables. Second, we replace the LSTM decoders with transformer based decoders. This upgrade improves significantly the previous state-of-the-art tree-editing-distance-score (TEDS) from 91% to 98.5% on simple tables and from 88.7% to 95% on complex tables.
@@ -21,12 +20,10 @@ Tables organize valuable content in a concise and compact representation. This c
 
 - b. Red-annotation of bounding boxes, Blue-predictions by TableFormer
 
-
 <!-- image -->
 
 - c. Structure predicted by TableFormer:
 
-
 <!-- image -->
 
 Figure 1: Picture of a table with subtle, complex features such as (1) multi-column headers, (2) cell with multi-row text and (3) cells with no content. Image from PubTabNet evaluation set, filename: 'PMC2944238 004 02'.
@@ -234,14 +231,11 @@ Table 4: Results of structure with content retrieved using cell detection on Pub
 
 ## Example table from FinTabNet:
 
-
 <!-- image -->
 
 b. Structure predicted by TableFormer, with superimposed matched PDF cell text:
 <!-- image -->
 
-
-
 |                                                    |             | 論文ファイル   | 論文ファイル   | 参考文献   | 参考文献   |
 |----------------------------------------------------|-------------|----------------|----------------|------------|------------|
 | 出典                                               | ファイル 数 | 英語           | 日本語         | 英語       | 日本語     |
@@ -268,7 +262,6 @@ Text is aligned to match original for ease of viewing
 Figure 5: One of the benefits of TableFormer is that it is language agnostic, as an example, the left part of the illustration demonstrates TableFormer predictions on previously unseen language (Japanese). Additionally, we see that TableFormer is robust to variability in style and content, right side of the illustration shows the example of the TableFormer prediction from the FinTabNet dataset.
 <!-- image -->
 
-
 <!-- image -->
 
 Figure 6: An example of TableFormer predictions (bounding boxes and structure) from generated SynthTabNet table.
@@ -451,12 +444,22 @@ phan cell.
 
 Aditional images with examples of TableFormer predictions and post-processing can be found below.
 
+Figure 8: Example of a table with multi-line header.
+
+
 
 <!-- image -->
 
+Figure 9: Example of a table with big empty distance between cells.
+
+
 
 <!-- image -->
 
+Figure 10: Example of a complex table with empty cells.
+
+
+
 Figure 11: Simple table with different style and empty cells.
 <!-- image -->
 
@@ -466,23 +469,29 @@ Figure 12: Simple table predictions and post processing.
 Figure 13: Table predictions example on colorful table.
 <!-- image -->
 
+Figure 14: Example with multi-line text.
 
-<!-- image -->
 
 
 <!-- image -->
 
-
 <!-- image -->
 
+<!-- image -->
 
 <!-- image -->
 
+Figure 15: Example with triangular table.
 
-<!-- image -->
 
 
 <!-- image -->
 
+<!-- image -->
+
+Figure 16: Example of how post-processing helps to restore mis-aligned bounding boxes prediction artifact.
+
+
+
 Figure 17: Example of long table. End-to-end example from initial PDF cells to prediction of bounding boxes, post processing and prediction of structure.
 <!-- image -->
diff --git a/tests/data/groundtruth/docling_v1/2206.01062.md b/tests/data/groundtruth/docling_v1/2206.01062.md
@@ -220,8 +220,6 @@ One of the fundamental questions related to any dataset is if it is "large enoug
 
 The choice and number of labels can have a significant effect on the overall model performance. Since PubLayNet, DocBank and DocLayNet all have different label sets, it is of particular interest to understand and quantify this influence of the label set on the model performance. We investigate this by either down-mapping labels into more common ones (e.g. Caption → Text ) or excluding them from the annotations entirely. Furthermore, it must be stressed that all mappings and exclusions were performed on the data before model training. In Table 3, we present the mAP scores for a Mask R-CNN R50 network on different label sets. Where a label is down-mapped, we show its corresponding label, otherwise it was excluded. We present three different label sets, with 6, 5 and 4 different labels respectively. The set of 5 labels contains the same labels as PubLayNet. However, due to the different definition of
 
-
-
 | Class-count    | 11   | 11   | 5   | 5    |
 |----------------|------|------|-----|------|
 | Split          | Doc  | Page | Doc | Page |

diff --git a/tests/data/groundtruth/docling_v1/redp5110_sampled.md b/tests/data/groundtruth/docling_v1/redp5110_sampled.md
@@ -1,14 +1,11 @@
 Front cover
 
-
 <!-- image -->
 
 ## Row and Column Access Control Support in IBM DB2 for i
 
-
 <!-- image -->
 
-
 <!-- image -->
 
 ## Contents
@@ -17,7 +14,6 @@ DB2 for i Center of Excellence
 
 Solution Brief IBM Systems Lab Services and Training
 
-
 <!-- image -->
 
 ## Highlights
@@ -30,7 +26,6 @@ Solution Brief IBM Systems Lab Services and Training
 
 - GLYPH<g115>GLYPH<g3> GLYPH<g55> GLYPH<g68>GLYPH<g78>GLYPH<g72>GLYPH<g3> GLYPH<g68>GLYPH<g71>GLYPH<g89>GLYPH<g68>GLYPH<g81>GLYPH<g87>GLYPH<g68>GLYPH<g74>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g68>GLYPH<g70>GLYPH<g70>GLYPH<g72>GLYPH<g86>GLYPH<g86>GLYPH<g3> GLYPH<g87>GLYPH<g82>GLYPH<g3> GLYPH<g68> GLYPH<g3> GLYPH<g90>GLYPH<g82>GLYPH<g85>GLYPH<g79>GLYPH<g71>GLYPH<g90>GLYPH<g76>GLYPH<g71>GLYPH<g72>GLYPH<g3> GLYPH<g86>GLYPH<g82>GLYPH<g88>GLYPH<g85>GLYPH<g70>GLYPH<g72>GLYPH<g3> GLYPH<g82>GLYPH<g73>GLYPH<g3> GLYPH<g72>GLYPH<g91>GLYPH<g83>GLYPH<g72>GLYPH<g85>GLYPH<g87>GLYPH<g76>GLYPH<g86>GLYPH<g72>
 
-
 <!-- image -->
 
 Power Services
@@ -77,10 +72,8 @@ This paper is intended for database engineers, data-centric application develope
 
 This paper was produced by the IBM DB2 for i Center of Excellence team in partnership with the International Technical Support Organization (ITSO), Rochester, Minnesota US.
 
-
 <!-- image -->
 
-
 <!-- image -->
 
 Jim Bainbridge is a senior DB2 consultant on the DB2 for i Center of Excellence team in the IBM Lab Services and Training organization. His primary role is training and implementation services for IBM DB2 Web Query for i and business analytics. Jim began his career with IBM 30 years ago in the IBM Rochester Development Lab, where he developed cooperative processing products that paired IBM PCs with IBM S/36 and AS/.400 systems. In the years since, Jim has held numerous technical roles, including independent software vendors technical support on a broad range of IBM technologies and products, and supporting customers in the IBM Executive Briefing Center and IBM Project Office.
@@ -89,7 +82,6 @@ Hernando Bedoya is a Senior IT Specialist at STG Lab Services and Training in Ro
 
 ## Authors
 
-
 <!-- image -->
 
 Chapter 1.
@@ -383,10 +375,8 @@ This IBM Redpaper publication provides information about the IBM i 7.2 feature o
 
 This paper is intended for database engineers, data-centric application developers, and security officers who want to design and implement RCAC as a part of their data control and governance policy. A solid background in IBM i object level security, DB2 for i relational database concepts, and SQL is assumed.
 
-
 <!-- image -->
 
-
 <!-- image -->
 
 INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION