When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

jkrubin · 2024-07-27T23:34:01Z

Before you start - checklist

I followed instructions in documentation written for my React-PDF version
I have checked if this bug is not already reported
I have checked if an issue is not listed in Known issues
If I have a problem with PDF rendering, I checked if my PDF renders properly in PDF.js demo

Description

I am using react-pdf and need the text layer for highlighting purposes. I wanted to swap over from the "single page" to "all page" recipe shown here
https://github.com/wojtekmaj/react-pdf/wiki/Recipes

This code seems to work fine on the surface, but I found that for some PDFs, displaying in all-page caused the text layer past the first page to have some spans with very wrong scaleX transforms applied (way bigger than intended). When displayed in single-page format, all of the spans on the offending pages have spans with expected width.

I tested this using versions 9.0 and 9.1 using different workers and the bug always appears

Steps to reproduce

I made a minimal reproducible example below:

import { Document, Page, pdfjs } from "react-pdf";
import "./App.css";
import { useState } from "react";
import "react-pdf/dist/esm/Page/AnnotationLayer.css";
import "react-pdf/dist/esm/Page/TextLayer.css";

// import pdfWorker from "./assets/pdf.worker.min.mjs?url";
// pdfjs.GlobalWorkerOptions.workerSrc = pdfWorker;

pdfjs.GlobalWorkerOptions.workerSrc = new URL(
  "pdfjs-dist/build/pdf.worker.min.mjs",
  import.meta.url
).toString();

import pdf from "./data/apl_23_003.pdf";

function App() {
  const [numPages, setNumPages] = useState<number>();
  const [pageNumber, setPageNumber] = useState<number>(1);

  function onDocumentLoadSuccess({ numPages }: { numPages: number }): void {
    setNumPages(numPages);
  }

  function onDocumentLoadError(error: Error): void {
    console.error("Failed to load PDF document:", error);
  }

  return (
    <>
      <div>
        <button onClick={() => setPageNumber((prev) => Math.max(prev - 1, 1))}>
          Previous
        </button>
        <button
          onClick={() =>
            setPageNumber((prev) =>
              numPages && prev < numPages ? prev + 1 : prev
            )
          }
        >
          Next
        </button>
      </div>
      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
        {Array.from(new Array(numPages), (el, index) => (
          <Page key={`page_${index + 1}`} pageNumber={index + 1} />
        ))}
      </Document>
    </>
  );
}

export default App;

As stated above this doesn't happen to every pdf, and while the best examples are on non-public PDFs, I found a public document where you can see this bug on pages 3 and onwards (attached).

apl_23_003.pdf

Expected behavior

I expect the text layer to fit over the text exactly, like it does when displayed like this

      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
      </Document>

Actual behavior

The text on pages after page 1 is displayed with text layer having a larger scaleX transform when displayed like this

      <Document
        file={pdf}
        onLoadSuccess={onDocumentLoadSuccess}
        onLoadError={onDocumentLoadError}
      >
        <Page pageNumber={pageNumber} />
        {Array.from(new Array(numPages), (el, index) => (
          <Page key={`page_${index + 1}`} pageNumber={index + 1} />
        ))}
      </Document>

Additional information

This bug will not occur on the first page displayed.
For example, if page 4 is the one that gets stretched, and I display page 4 10 times in a row, the first page will be normal and subsequent pages will be stretched.

This bug does not affect all PDFs, Only a few that I have found.
In debugging I noticed that the PDF does use some encoding that is not supported by my VSCode, I can still open the file and it says "this document contains many invisible unicode characters"
This may contribute to some parsing error, but I don't know why that could occur only past the first page

Environment

Browser (if applicable):
React-PDF version: 9.0 & 9.1
React version: 18.3.1
Bundler name and version (if applicable): vite

The text was updated successfully, but these errors were encountered:

jasoncardinale · 2024-07-30T13:15:59Z

I too am experiencing a similar issue. Say for a given page, when looking through the spans constituting the text layer, a vast majority of them do not contain a transform and are seemingly positioned correctly (no disjoint overlapping). However, there are some instances where the span will have a transformation applied to it along the x-axis. Something along the lines of transform: scaleX(n) where n is a value close to 1. However, when the page state updates (say I want to now highlight this text using <mark> so I modify the text in a customTextRenderer) all of a sudden this transform jumps to a larger or smaller value (n starts to approach 0 or 2). This only happens for very specific instances of words or lines within a pdf and for most cases I don't see any issue with the large transformation.

As a temporary solution, I am following the advice mentioned here: #332 (comment).

However, I found that the transformation is within the nested line spans and not in react-pdf__Page__textContent so I do this instead.

const removeTextLayerOffset = () => {
  const spans = document.querySelectorAll("span[role='presentation']")
  spans.forEach((span) => {
    const { style } = span as HTMLElement
    style.transform = ''
  })
}

And then use the function here

<Page ... onRenderTextLayerSuccess={removeTextLayerOffset} />

This removes all the transformations and for the most part yields good results. However, as mentioned before, some of lines already had a small transformation applied to them and so when we remove that, the text does not overlap perfectly (though the difference is not nearly as drastic as the worst offenders).

This solution seemingly works well enough for all the PDFs that I have tried it with so far but I don't believe it to be a satisfying solution. pdfjs is clearly doing some calculation under the hood to determine this transformations based on font size, screen width, etc. and so just removing it is definitely a work around.

See https://github.com/mozilla/pdf.js/blob/300e806efe7e6438e0b37d8eeb1a97d9e5d27daa/src/display/text_layer.js#L419 for how this transformation is calculated. My best guess is that width in this case is off due to some inability to properly calculate the width of the text in the line. It may have something to do with unrecognized fonts which have spacing + character width unsupported by pdfjs.

jkrubin · 2024-08-14T06:01:18Z

I don't believe this is an issue with pdfjs. if I uploaded my pdf to the pdfjs demo
https://mozilla.github.io/pdf.js/web/viewer.html
and the issue did not occur, also i can display the page normally in paginated mode, so the issue is something to do with subsequent pages.

I believe there needs to be a fix to react-pdf here

szl1993 · 2024-08-29T02:44:05Z

hoho i find the reason.
See https://github.com/mozilla/pdf.js/blob/master/src/display/text_layer.js
The pdf.js library use canvas.measureText to calculate the actual display width of <span/> elements and employs a static canvas for performance optimization.
I logged the measurement information and found that when the issue occurred, the canvas.font property did not match the expected data.

if (prevFontSize !== fontSize || prevFontFamily !== fontFamily) {
        console.log("---------ctx font--------");
        console.log("textContent:", div.textContent);
        console.log("pageIndex", this);
        console.log("prevFontSize:", prevFontSize);
        console.log("fontSize:", fontSize);
        console.log("this.#scale", this.#scale);
        console.log("fontSize * this.#scale:", fontSize * this.#scale);
        console.log("-------------------------");
        ctx.font = `${fontSize * this.#scale}px ${fontFamily}`;
        params.prevFontSize = fontSize;
        params.prevFontFamily = fontFamily;
      }

      // Only measure the width for multi-char text divs, see `appendText`.
      const { width } = ctx.measureText(div.textContent);

      if (width > 0) {
        transform = `scaleX(${(canvasWidth * this.#scale) / width}) ${transform}`;
      }

      if (
        div.textContent ===
        "scale x show error text"
      ) {
        console.log("----------measureText------------");
        console.log("transform:", transform);
        console.log("width:", width);
        console.log("ctx.font:", ctx.font);
        console.log("fontSize:", fontSize);
        console.log("fontFamily:", fontFamily);
        console.log("oldPrevFontSize:", oldPrevFontSize);
        console.log("oldPrevFontFamily:", oldPrevFontFamily);
        console.log("this.#scale:", this.#scale);
        console.log("canvasWidth:", canvasWidth);
      }
    }

so the reason is <Page/> render parallel causing canvas attribute error.
my solution is to determine whether the current page is in display status. If it is not in display status not render TextLayer.

jkrubin · 2024-09-02T20:53:10Z

I implemented a Mutex approach like this and it solved all of the width issues

type PageProps = {
  pageNumber: number;
  pageLoadLock: Mutex
  scale: number;
}
export const PageWrapper: React.FC<PageProps> = ({
    pageNumber,
    pageLoadLock,
    scale
}) => {
  const [readyToLoadTextLayer, setReadyToLoadTextLayer] = useState<boolean>(false)
  const releaseRef = useRef<(() => void) | null>(null);
  const pageRef = useRef<HTMLDivElement>(null);

  useEffect(() => {
    const acquireLock = async () => {
      console.log(`page ${pageNumber} waiting for mutex`);
      const releaseLock = await pageLoadLock.acquire();
      
      // Store the release function in a ref
      releaseRef.current = () => {
        releaseLock();
      };
      
      setReadyToLoadTextLayer(true);
      console.log(`page ${pageNumber} acquired`);
    };

    acquireLock();

    // Cleanup to release lock if the component unmounts before the lock is released
    return () => {
      if (releaseRef.current) {
        releaseRef.current();
      }
    };
  }, [pageNumber, pageLoadLock]);

  const handleTextLayerLoad = () => {
    console.log(`page ${pageNumber} loaded text layer`);
    if (releaseRef.current) {
      releaseRef.current();
      releaseRef.current = null; // Clear the ref after releasing
    }
    console.log(`page ${pageNumber} has released lock`);
  };
  return (
    <Page
        pageNumber={pageNumber}
        scale={scale}
        renderTextLayer={readyToLoadTextLayer}
        onRenderTextLayerSuccess={handleTextLayerLoad}
        onLoadError={(error) => console.error(error)}
    />
  );
};

Wanted to flag to @wojtekmaj if we can include some fix to this race condition in the react-pdf lib

trey-trimble-posh · 2024-10-03T13:36:45Z

Would this also affect getting the wrong scaleX transform on bolded text? I'm seeing that happen on 9.1. the text is scaled way up incorrectly

anton-mauritzson · 2024-10-31T11:16:55Z

any solution on this?

jkrubin · 2024-10-31T17:31:50Z

Manually rendering all of the pages sequentially with a mutex worked for me. I don't think this is a good solution though, so my ultimate solution was to just use pdfjs. this package was not build to show a full length PDF so I wouldn't use it to do that.

jkrubin added the bug Something isn't working label Jul 27, 2024

jkrubin mentioned this issue Aug 7, 2024

[Bug]: Issue with width of unsupported characters causing text layer to be very wide mozilla/pdf.js#18576

Closed

obecker mentioned this issue Sep 22, 2024

Text layer may contain overlapping areas (react-pdf 9.0.0) #1828

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

jkrubin commented Jul 27, 2024 •

edited

Loading

jasoncardinale commented Jul 30, 2024

jkrubin commented Aug 14, 2024

szl1993 commented Aug 29, 2024

jkrubin commented Sep 2, 2024 •

edited

Loading

trey-trimble-posh commented Oct 3, 2024

anton-mauritzson commented Oct 31, 2024

jkrubin commented Oct 31, 2024

When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

When Pages are all displayed inside a <Document>, pages after first can have a scaled up scaleX text layer (only found on some documents) #1848

Comments

jkrubin commented Jul 27, 2024 • edited Loading

Before you start - checklist

Description

Steps to reproduce

Expected behavior

Actual behavior

Additional information

Environment

jasoncardinale commented Jul 30, 2024

jkrubin commented Aug 14, 2024

szl1993 commented Aug 29, 2024

jkrubin commented Sep 2, 2024 • edited Loading

trey-trimble-posh commented Oct 3, 2024

anton-mauritzson commented Oct 31, 2024

jkrubin commented Oct 31, 2024

jkrubin commented Jul 27, 2024 •

edited

Loading

jkrubin commented Sep 2, 2024 •

edited

Loading