Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerals read as \u0000 when using font feature settings #523

Open
SimonEggert opened this issue Sep 21, 2023 · 1 comment
Open

Numerals read as \u0000 when using font feature settings #523

SimonEggert opened this issue Sep 21, 2023 · 1 comment

Comments

@SimonEggert
Copy link

First of all, thanks for the work and effort you've put into this great library!

Bug description

We are having an issue with numerals not being read correctly by PDF::Inspector::Text.analyze. They get misinterpreted as \u0000 when we use font-feature-settings: 'tnum' as style. We are generating the PDF with Gotenberg from HTML templates.

Minimal reproducible example

<div>21.09.2023</div> gets read as 21.09.2023

while

<div style="font-feature-settings: 'tnum'">21.09.2023</div>gets read as \u0000\u0000.\u0000\u0000.\u0000\u0000\u0000\u0000.

PDFs

Here are two PDFs, one with the feature turned off and one with the feature turned on:
font_features_off.pdf
font_features_on.pdf

Further information

The UNIX tool pdftotext is able to read both versions correctly so I think the PDF is alright.
The font in use is Barlow if that makes any difference.

Any help would be appreciated!

P.S.: I'll also open an issue regarding this problem over at https://github.com/prawnpdf/pdf-inspector so feel free to close this one if you think it should be handled there.

@yob
Copy link
Owner

yob commented Dec 27, 2023

Thanks for the clear report and simple test files.

Looking at the features on file, it has a ToUnicode CMap that maps each glyph code to the unicode codepoint \u0000 and we're honoring it:

$ ruby -Ilib bin/pdf_object font_features_on.pdf 20
{:Filter=>:FlateDecode, :Length=>245}
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<<  /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
6 beginbfchar
<008E> <0000>
<008F> <0000>
<0090> <0000>
<0091> <0000>
<0097> <0000>
<00B4> <002E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

However, the content stream is using the optional "marked content" operators (BDC, EMC) and I can see the real characters in there as literal strings ((2), (1), (9), etc):

$ ruby -Ilib bin/pdf_object font_features_on.pdf 5                                                                                                                                                                                                                
{:Filter=>:FlateDecode, :Length=>282}                                                                                                                                                                                                                                                       
.23999999 0 0 -.23999999 0 841.91998 cm                                                                                                                                                                                                                                                     
q                                                                                                                                             
0 387.5 2479.1665 2732.0789 re                                         
W* n                                                                   
q                                                                      
3.122376 0 0 3.122376 0 387.5 cm                                       
1 1 1 RG 1 1 1 rg                                                      
/G3 gs                                                                 
0 0 794 875 re                                                         
f                                                                      
0 0 794 875 re                                                         
f                                                                      
.1255 .1333 .1569 RG .1255 .1333 .1569 rg                              
BT                                                                     
/P <</MCID 0 >>BDC                                                     
/Span<</ActualText (2) >> BDC                                          
/F4 10.6599998 Tf                                                      
1 0 0 -1 0 11 Tm                                                       
<0090> Tj                                                              
EMC                                                                    
/Span<</ActualText (1) >> BDC                                          
5.6159058 0 Td <008F> Tj                                               
EMC                                                                    
5.6159058 0 Td <00B4> Tj                                                                                                                      
/Span<</ActualText (0) >> BDC                                                                                                                                                                                                                                                               
2.9624634 0 Td <008E> Tj                                                                                                                      
EMC                                                                    
/Span<</ActualText (9) >> BDC                                          
5.6159058 0 Td <0097> Tj                                               
EMC                                                                    
5.6159058 0 Td <00B4> Tj                                               
/Span<</ActualText (2) >> BDC                                          
2.9624634 0 Td <0090> Tj                                               
EMC                                                                    
/Span<</ActualText (0) >> BDC                                          
5.6159058 0 Td <008E> Tj                                               
EMC                                                                    
/Span<</ActualText (2) >> BDC                                          
5.6159058 0 Td <0090> Tj                                               
EMC                                                                    
/Span<</ActualText (3) >> BDC                                          
5.6159058 0 Td <0091> Tj                                               
EMC                                                                    
EMC                                                                    
ET                                                                     
Q                                                                      
Q

pdf-reader currently doesn't look at marked content. Maybe we should, and maybe this suggests marked content should take precedence over ToUnicode CMaps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants