Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easiest way to get form fields from a pdf #497

Open
arahman4710 opened this issue Jun 30, 2022 · 5 comments
Open

Easiest way to get form fields from a pdf #497

arahman4710 opened this issue Jun 30, 2022 · 5 comments

Comments

@arahman4710
Copy link

I'm trying to parse a standard documents like w9 forms (https://www.irs.gov/pub/irs-pdf/fw9.pdf). I want to parse out the name which is the first form fields that is inputted by someone. Whats the easiest way to do that?

I've tried doing:

reader = PDF::Reader.new("W9.pdf")
objects = reader.objects
result = objects.deref!(reader.pages[0].attributes[:Annots])

When I take a look at result for a bunch of different w9s that have been filled it, there doesn't seem to be a single structure in the result variable that I can use to figure out the name. I know name is always going to be the first form field, is there an easy way to search for that?

@yob
Copy link
Owner

yob commented Jul 2, 2022

I'm confident that pdf-reader can deserialize the data you're after, but unfortunately I'm not personally very familiar with PDF forms or how the fields or data are stored.

The PDF spec says the optional Annots page attribute is an array of dictionaries, each dictionary is a single annotation and can have different properties depending on the type (link, line, square, circle, underline, file, sound, movie, 3d, etc). It sounds like forms use the :Widget annotation type:

Interactive forms (see 12.7, "Forms") use widget annotations (PDF 1.2) to represent the appearance of
fields and to manage user interactions. As a convenience, when a field has only a single associated
widget annotation, the contents of the field dictionary (12.7.4, "Field dictionaries") and the annotation
dictionary may be merged into a single dictionary containing entries that pertain to both a field and an
annotation.

Maybe filtering the :Annots array down to just :Widget annotations will yield some useful results?

I see there's also an :AcroForm property at the document level that might have some interesting data. Unfortunately it's not currently exposed by pdf-reader, but I'd happily accept a PR that adds it.

Something like this:

diff --git a/lib/pdf/reader.rb b/lib/pdf/reader.rb
index 22aea3d..8c3266b 100644
--- a/lib/pdf/reader.rb
+++ b/lib/pdf/reader.rb
@@ -142,6 +142,12 @@ def metadata
       end
     end
 
+    # Return a Hash with interactive form details from this file. Not always present
+    #
+    def acroform
+      @objects.deref_hash(root[:AcroForm])
+    end

Would allow:

PDF::Reader.open("somefile.pdf") do |pdf|
  puts pdf.acroform
end

@arahman4710
Copy link
Author

Gotcha, filtering to look at only the :Widget annotation is helpful but even when i do that, I see that the value seems like it can be nested in :Parent attribute afterwards

:Parent=>{:FT=>:Tx, :Ff=>8388608, :Kids=>[{...}], :T=>"topmostSubform[0].Page1[0].f1_1[0]", :V=>"Vendor name "}

On one pdf, I was able to just look at the :V attribute on a :Widget annotation, but in the case above, it looks like I need to look at the :Parent attribute and i'm unsure when there'll be a :Parent I should look at and when I shouldn't

@Michael1969
Copy link

Michael1969 commented Sep 8, 2022

`
pdf = Base64.decode64(params['pdf'])

reader = PDF::Reader.new(StringIO.new(pdf))
reader.pages.each do |page|
  objects = page.objects
  result = objects.deref!(page.attributes[:Annots])
  result.each do |r|
    puts r[:T]
    puts r[:V]
  end
end

`

@ruinunes
Copy link

ruinunes commented Feb 2, 2023

Get all fields from a file using the low level API:

fields_from_pdf_form = PDF::Reader.new(file).pages.map do |page| 
  page.objects.deref!(page.attributes[:Annots])&.pluck(:T) 
end.flatten.compact_blank

UPDATE not all fields: skips radio button groups. But they are there inside the Annots. Need to find a way to collect these.

@cokron
Copy link

cokron commented Dec 7, 2023

Hello everybody, I came up with this script to extract acrofields:

require 'pdf-reader'

filename = ARGV[0]

# Check if the filename is provided
if filename.nil?
  puts "Please provide a PDF file name."
  exit 1
end

reader = PDF::Reader.new(filename)

# Access the catalog (root object) of the PDF through indirect reference
catalog_ref = reader.objects[reader.objects.trailer[:Root]]
acroform_ref = catalog_ref[:AcroForm]

# Exit if AcroForm is not found
if acroform_ref.nil?
  puts "No AcroForm found in the PDF."
  exit
end

acroform = reader.objects[acroform_ref]

# Check if AcroForm is present and has Fields
if acroform && acroform[:Fields]
  acroform[:Fields].each do |field_ref|
    field = reader.objects[field_ref]

    # Check if it's an AcroField with a name
    next unless field && field[:T]

    field_name = field[:T]
    # The position (Rect) might not be directly available in the field object
    field_rect = field[:Rect]

    puts "Field '#{field_name}' at position #{field_rect}"
  end
else
  puts "No AcroFields found."
end

This seems to work. I thought it might be useful for you as well.

Keep up the good work everybody :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants