using a bunch of open source tools to examine a whole bunch of documents
- Apache Tika (document text and metadata processing)
- Solr (text indexing, search, and faceting)
- ICIJ Extract (for processing documents into e.g. a Solr index)
- Blacklight (Ruby on Rails engine for exploring data stored in a Solr index)
docker compose up
with the defaultdocker-compose.yml
(will createvar_solr
if it doesn't exist)docker exec -it document_indexer_solr_1 solr create_core -c document_core
from host CLI- add below
lib
andrequestHandler
stanzas to the appropriate spots in thesolrconfig.xml
that core creation generated - not sure this is necessary? save the example linked here as
schema.xml
(sibling ofsolrconfig.xml
invar_solr/data/document_core/conf
, there shouldn't be one there yet to overwrite) - install apache tika (can be done via homebrew on mac)
- https://tika.apache.org/2.0.0/gettingstarted.html
- alternatively: https://github.com/apache/tika-docker (could incorporate into docker-compose.yml)
- install ICIJ Extract
- https://github.com/ICIJ/extract/ -- "A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of Apache Tika and an essential part of the engineering behind the Panama Papers, Swiss Leaks and Luxembourg Leaks investigations."
- see command below
- requires installing maven (also available via homebrew, for mac)
- consider installing jEnv (similar to rbenv and pyenv) if you have multiple java versions to manage -- https://github.com/jenv/jenv
<lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.content">_text_</str>
</lst>
</requestHandler>
# to successfully build extract.jar, i got errors on the javadoc task and the gpg task, and signing is only necessary for publishing the build, not running it locally
$ mvn install -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -Dgpg.skip=true
$ `curl 'http://localhost:8983/solr/document_core/update/extract?literal.id=helloworld&commit=true' -F "myfile=@test.pdf"`
{
"responseHeader":{
"status":0,
"QTime":2179}}
run something like:
# from the extract project directory after building
$ java -jar extract-cli/target/extract-cli-3.8.1.jar spew --ocr no -o solr -s 'http://localhost:8983/solr/document_core' --commitInterval 500 'my_cool_pdf_directory'
extract might be overkill for me. try calling tika on the PDFs (since those are the documents you have at the moment) and then indexing fields of my choice from parsing, using python or ruby. i had trouble figuring out how to get tika to index the fields i wanted.
both pysolr and rsolr seem fine, tika bindings for python seem better than those for ruby, so prob easier to write indexing code in python than ruby? would be easy to do simple text extraction using tika and create a document based on that. could generate same document IDs as ICIJ extract (i think just a hash of the document content?).
https://solr.apache.org/guide/8_8/overview-of-documents-fields-and-schema-design.html https://github.com/projectblacklight/blacklight/wiki/Blacklight-configuration
https://github.com/chrismattmann/tika-python https://github.com/django-haystack/pysolr/ https://github.com/mrcsparker/ruby_tika_app https://github.com/duke-libraries/tika-client https://github.com/kanety/tikarb https://github.com/rsolr/rsolr
this seems to do something similar to what i'm trying to do https://github.com/EricLondon/Docker-Rails-Tika-Elasticsearch / https://ericlondon.com/2017/02/01/integrate-tika-rest-service-with-rails-paperclip-attachments-to-extract-text-from-pdf-documents-and-store-in-elasticsearch.html
not sure how useful this is, but maybe? https://github.com/chrismattmann/tika-similarity
https://guides.rubyonrails.org/getting_started.html
rails new . --skip-spring --skip-listen
https://github.com/projectblacklight/blacklight/wiki/Quickstart
- confirm the dependencies (ruby, solr, node, yarn, etc) are installed
- go to the "Creating a new application the hard way" section -- https://github.com/projectblacklight/blacklight/wiki/Quickstart#creating-a-new-application-the-hard-way
- skip solr_wrapper because you're already running solr in docker
rails generate blacklight:install --devise --solr_version=latest
bin/rails db:migrate
rails s
- modify the default catalog controller a bunch because it expects a totally different schema than you get from tika throwing PDFs at solr