Easiest way to get form fields from a pdf #497

arahman4710 · 2022-06-30T00:58:03Z

I'm trying to parse a standard documents like w9 forms (https://www.irs.gov/pub/irs-pdf/fw9.pdf). I want to parse out the name which is the first form fields that is inputted by someone. Whats the easiest way to do that?

I've tried doing:

reader = PDF::Reader.new("W9.pdf")
objects = reader.objects
result = objects.deref!(reader.pages[0].attributes[:Annots])

When I take a look at result for a bunch of different w9s that have been filled it, there doesn't seem to be a single structure in the result variable that I can use to figure out the name. I know name is always going to be the first form field, is there an easy way to search for that?

The text was updated successfully, but these errors were encountered:

yob · 2022-07-02T01:33:57Z

I'm confident that pdf-reader can deserialize the data you're after, but unfortunately I'm not personally very familiar with PDF forms or how the fields or data are stored.

The PDF spec says the optional Annots page attribute is an array of dictionaries, each dictionary is a single annotation and can have different properties depending on the type (link, line, square, circle, underline, file, sound, movie, 3d, etc). It sounds like forms use the :Widget annotation type:

Interactive forms (see 12.7, "Forms") use widget annotations (PDF 1.2) to represent the appearance of
fields and to manage user interactions. As a convenience, when a field has only a single associated
widget annotation, the contents of the field dictionary (12.7.4, "Field dictionaries") and the annotation
dictionary may be merged into a single dictionary containing entries that pertain to both a field and an
annotation.

Maybe filtering the :Annots array down to just :Widget annotations will yield some useful results?

I see there's also an :AcroForm property at the document level that might have some interesting data. Unfortunately it's not currently exposed by pdf-reader, but I'd happily accept a PR that adds it.

Something like this:

diff --git a/lib/pdf/reader.rb b/lib/pdf/reader.rb
index 22aea3d..8c3266b 100644
--- a/lib/pdf/reader.rb
+++ b/lib/pdf/reader.rb
@@ -142,6 +142,12 @@ def metadata
       end
     end
 
+    # Return a Hash with interactive form details from this file. Not always present
+    #
+    def acroform
+      @objects.deref_hash(root[:AcroForm])
+    end

Would allow:

PDF::Reader.open("somefile.pdf") do |pdf|
  puts pdf.acroform
end

arahman4710 · 2022-07-02T23:36:26Z

Gotcha, filtering to look at only the :Widget annotation is helpful but even when i do that, I see that the value seems like it can be nested in :Parent attribute afterwards

:Parent=>{:FT=>:Tx, :Ff=>8388608, :Kids=>[{...}], :T=>"topmostSubform[0].Page1[0].f1_1[0]", :V=>"Vendor name "}

On one pdf, I was able to just look at the :V attribute on a :Widget annotation, but in the case above, it looks like I need to look at the :Parent attribute and i'm unsure when there'll be a :Parent I should look at and when I shouldn't

Michael1969 · 2022-09-08T01:36:40Z

`
pdf = Base64.decode64(params['pdf'])

reader = PDF::Reader.new(StringIO.new(pdf))
reader.pages.each do |page|
  objects = page.objects
  result = objects.deref!(page.attributes[:Annots])
  result.each do |r|
    puts r[:T]
    puts r[:V]
  end
end

`

ruinunes · 2023-02-02T17:47:54Z

Get all fields from a file using the low level API:

fields_from_pdf_form = PDF::Reader.new(file).pages.map do |page| 
  page.objects.deref!(page.attributes[:Annots])&.pluck(:T) 
end.flatten.compact_blank

UPDATE not all fields: skips radio button groups. But they are there inside the Annots. Need to find a way to collect these.

cokron · 2023-12-07T14:53:54Z

Hello everybody, I came up with this script to extract acrofields:

require 'pdf-reader'

filename = ARGV[0]

# Check if the filename is provided
if filename.nil?
  puts "Please provide a PDF file name."
  exit 1
end

reader = PDF::Reader.new(filename)

# Access the catalog (root object) of the PDF through indirect reference
catalog_ref = reader.objects[reader.objects.trailer[:Root]]
acroform_ref = catalog_ref[:AcroForm]

# Exit if AcroForm is not found
if acroform_ref.nil?
  puts "No AcroForm found in the PDF."
  exit
end

acroform = reader.objects[acroform_ref]

# Check if AcroForm is present and has Fields
if acroform && acroform[:Fields]
  acroform[:Fields].each do |field_ref|
    field = reader.objects[field_ref]

    # Check if it's an AcroField with a name
    next unless field && field[:T]

    field_name = field[:T]
    # The position (Rect) might not be directly available in the field object
    field_rect = field[:Rect]

    puts "Field '#{field_name}' at position #{field_rect}"
  end
else
  puts "No AcroFields found."
end

This seems to work. I thought it might be useful for you as well.

Keep up the good work everybody :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easiest way to get form fields from a pdf #497

Easiest way to get form fields from a pdf #497

arahman4710 commented Jun 30, 2022

yob commented Jul 2, 2022

arahman4710 commented Jul 2, 2022

Michael1969 commented Sep 8, 2022 •

edited

Loading

ruinunes commented Feb 2, 2023 •

edited

Loading

cokron commented Dec 7, 2023

Easiest way to get form fields from a pdf #497

Easiest way to get form fields from a pdf #497

Comments

arahman4710 commented Jun 30, 2022

yob commented Jul 2, 2022

arahman4710 commented Jul 2, 2022

Michael1969 commented Sep 8, 2022 • edited Loading

ruinunes commented Feb 2, 2023 • edited Loading

cokron commented Dec 7, 2023

Michael1969 commented Sep 8, 2022 •

edited

Loading

ruinunes commented Feb 2, 2023 •

edited

Loading