Catmandu

Importing, transforming, storing and indexing data should be easy

SWIB2014 1 - 3 December 2014 Bonn, Germany

Johann Rolschewski / Jakob Voß

Staatsbibliothek zu Berlin, Germany / Verbundzentrale des GBV (VZG), Germany

Libraries collect data ...

books
journals
articles
maps
manuscripts
sheets of music
...

Libraries create metadata ...

bibliographic descriptions
holding informations
references
patron data
...

Metadata

... catalogued in library specific formats (MARC, MAB2, PICA, ...)

... provided via library specific APIs (OAI, SRU, Z39.50, ...)

... used in diverse systems (OPACs, discovery systems, institutional repositories, link resolvers, ...)

Demand

... for a library specific metadata toolkit

LibreCat

... is an open collaboration of the three university libraries of Bielefeld, Gent and Lund

... joined by developers of other institutions

Catmandu

... provides an open source set of programming components to build up digital libraries and research services

... supports "Extract, Transform, Load" (ETL) processes

Catmandu - core concepts

Items are the basic unit of data processing in Catmandu. Items may be read, stored, and accessed in many forms.
Importers are Catmandu packages to read items into an application. One can also import from remote sources for instance via Atom and OAI-PMH endpoints.
Fixes transforms items, massage the data into any format you like.
Stores are databases and search engines to store/index your data.
Exporters are Catmandu packages to export items from an application.
Iterables - Every stream of data, if it comes from Iterators, Fixes or Stores is an iterator. With Iterators the memory consumption of your program is low: you can process Gigabytes, Terabytes of input data without ever running out of memory.

Importer/Exporter

AlephX BibTeX MAB2 MARC PICA

Atom CSV JSON RDF XLS XML YAML

Importer for APIs

getJSON

OAI

SRU

Z39.50

Stores

CHI

DBI

Elasticsearch

MongoDB

Solr

CLI

catmandu <command> [-DIL] [long options...]
    -D --debug          
    -L --load_path      
    -I --lib_path       

Available commands:

       commands: list the application's commands
           help: display a command's help screen

         config: export the Catmandu config
        convert: convert objects
          count: count the number of objects in a store
           data: store, index, search, import, export or convert objects
         delete: delete objects from a store
         export: export objects from a store
         import: import objects into a store
           info: list installed Catmandu modules
           move: move objects to another store
           repl: interactive shell for Catmandu

CLI - info

$ catmandu info
$ catmandu help <command>

or

$ catmandu exporter_info
$ catmandu fix_info
$ catmandu importer_info
$ catmandu store_info
$ catmandu help <command>

CLI - convert()

catmandu convert [-?hLv] [long options...]

examples:

cat books.json | catmandu convert JSON to CSV --fields id,title

options:

        -? -h --help        this usage screen
        -L --load_path
        -v --verbose

CLI - convert()

$ cat ./shared/journals_mab2.dat | catmandu convert MAB2 to JSON

$ catmandu convert MAB2 to JSON < ./shared/journals_mab2.dat

$ catmandu convert MAB2 --type XML to JSON < ./shared/journals_mab2.xml

CLI - convert()

{
   "_id" : "246797-5",
   "record" : [
      ...
      [
         "331",
         " ",
         "_",
         "UNIX-Magazin"
      ],
      ...
      [
         "406",
         "a",
         "j",
         "1988",
         "k",
         "1992"
      ],
      ...
    ] 
}

CLI - convert()

$ catmandu convert MARC to JSON < ./shared/camel.mrc

$ catmandu convert MARC --type RAW to JSON < ./shared/camel.mrc

$ catmandu convert MARC --type XML to JSON < ./shared/camel.xml

CLI - convert()

$ catmandu convert PICA to YAML < ./shared/pica.xml

$ catmandu convert PICA to JSON < ./shared/pica.xml

CLI - convert()

$ catmandu convert CSV to YAML < ./shared/eu_elections_2014.csv 

$ catmandu convert CSV to CSV --fields Wahlbezirk,DKP,NPD < ./shared/eu_elections_2014.csv 

$ catmandu convert YAML to JSON < ./shared/journals.yml

CLI - convert()

$ catmandu convert MAB2 --fix ./shared/mab2rdf.fix to CSV --file mab2.csv --fields dc_identifier,dc_title,dc_language < ./shared/journals_mab2.dat

$ catmandu convert MAB2 --fix ./shared/mab2rdf.fix to XLS --file mab2.xls --fields dc_identifier,dc_title,dc_language < ./shared/journals_mab2.dat

CLI - convert()

$ cat ./shared/test.tt
[%- FOREACH f IN record %]
[% _id %] [% f.shift %][% f.shift %][% f.shift %][% f.join(":") %]
[%- END %]

$ catmandu convert MARC to Template --template ./shared/test.tt < ./shared/camel.mrc

$ cat ./shared/marc.tt
[% _id %] [% dc.creator.0 %]: [% dc.title %]

$ catmandu convert MARC --fix ./shared/marc.fix to Template --template ./shared/marc.tt < ./shared/camel.mrc

CLI - convert()

see https://gbv.github.io/aREF/aREF.html and https://metacpan.org/pod/RDF::aREF

catmandu convert RDF --file ./shared/zdb_resources.rdf to YAML
catmandu convert MAB2 --type RAW --fix ./shared/mab2rdf.fix to RDF --type ttl < ./shared/mab2.dat
catmandu convert MAB2 --type RAW --fix ./shared/mab2rdf.fix to RDF --type xml < ./shared/mab2.dat

CLI - import()

catmandu import [-?hLv] [long options...]

examples:

catmandu import YAML --file books.yml to MongoDB 
    --database_name items --bag book

options:

        -? -h --help        this usage screen
        -L --load_path
        -v --verbose

CLI - import()

... by default all Importers expect UTF-8 encoded data

CLI - import()

$ catmandu import MARC --type RAW --fix ./shared/marc.fix to MongoDB --database_name marc --bag marc < ./shared/camel.mrc

$ catmandu import MAB2 --fix ./shared/mab2rdf.fix to MongoDB --database_name mab --bag mab  < ./shared/journals_mab2.dat

$ mongo
> use marc
> db.marc.find()

$ catmandu import MARC --type RAW --fix ./shared/marc.fix to Elasticsearch --index_name marc --bag marc < ./shared/camel.mrc

$ catmandu import MAB2 --fix ./shared/mab2rdf.fix to Elasticsearch --index_name mab --bag mab < ./shared/journals_mab2.dat

$ curl 'http://localhost:9200/mab/_search?q=*'

CLI - export()

catmandu export [-?hLqv] [long options...]

examples:

catmandu export MongoDB --database_name items --bag book to YAML

options:

        -? -h --help        this usage screen
        -L --load_path
        -v --verbose
        -q --query
        --limit

CLI - export()

$ catmandu export MongoDB --database_name mab --bag mab to JSON

$ catmandu export Elasticsearch --index_name marc --bag marc to JSON

$ catmandu export Elasticsearch --index_name mab --bag mab --query '_id:"http://example.org/1142708-5"'

CLI - count()

catmandu count [-?hLq] [long options...]

examples:

catmandu count Elasticsearch --index_name shop --bag products 
    --query 'brand:Acme'

options:

        -? -h --help        this usage screen
        -L --load_path
        -q --query

CLI - count()

$ catmandu count MongoDB --database_name mab --bag mab

$ catmandu count MongoDB --database_name marc --bag marc --query '{"dc.creator": "Wall, Larry."}'

$ catmandu count Elasticsearch --index_name mab  --bag mab
$ catmandu count Elasticsearch --index_name mab --bag mab --query 'dc_title:"magazin"'

$ catmandu count Elasticsearch --index_name marc --bag marc --query 'dc.creator:"wall"'

CLI - delete()

catmandu delete [-?hLq] [long options...]

examples:

catmandu delete Elasticsearch --index_name items 
    --bag book -q 'title:"Programming Perl"'

options:

        -? -h --help        this usage screen
        -L --load_path
        -q --query

CLI - delete()

$ catmandu delete MongoDB --database_name mab --bag mab

$ catmandu delete Elasticsearch --index_name mab --bag mab

$ catmandu delete MongoDB --database_name marc --bag marc --query '{"dc.creator": "Wall, Larry."}'

$ catmandu delete Elasticsearch --index_name mab --bag mab --query '_id:"http://example.org/1142708-5"'

CLI - move()

catmandu move [-?hLqv] [long options...]

examples:

catmandu move MongoDB --database_name items --bag book 
    to Elasticsearch --index_name items --bag book

options:

        -? -h --help        this usage screen
        -L --load_path
        -v --verbose
        -q --query
        --limit

CLI - move()

$ catmandu move MongoDB --database_name marc --bag marc to Elasticsearch --index_name moved

$ catmandu move MongoDB --database_name marc --bag marc --query '{"dc.creator": "Wall, Larry."}' to Elasticsearch --index_name moved

$ catmandu move Elasticsearch --index_name mab --bag mab --query '_id:"http://example.org/1142708-5"' to Elasticsearch --index_name selected --bag selected

CLI - data()

catmandu data [-?hLqv] [long options...]

        -? -h --help         this usage screen
        -L --load_path
        --from-store
        --from-importer
        --from-bag
        --count
        --into-exporter
        --into-store
        --into-bag
        --start
        --limit
        --total
        -q --cql-query
        --query
        --fix                fix expression(s) or fix file(s)
        --replace
        -v --verbose

CLI - data()

$ catmandu data --from-store MongoDB --from-database_name marc --from-bag marc --query '{"dc.creator": "Wall, Larry."}'

$ catmandu data --from-store Elasticsearch --from-index_name marc --query 'dc.creator:"Wall, Larry."'

$ catmandu data --from-store Elasticsearch --from-index_name mab --from-bag mab --cql-query 'publisher exact Heise'

$ catmandu data --from-store Elasticsearch --from-index_name mab --from-bag mab --cql-query 'issued > 2009' --into-exporter YAML

$ catmandu data --from-store Elasticsearch --from-index_name mab --from-bag mab --cql-query 'issued > 2009' --into-exporter CSV --fix 'retain_field("_id")'

CLI - APIs

$ catmandu convert OAI --url http://pub.uni-bielefeld.de/oai to JSON

$ catmandu convert SRU --base http://sru.gbv.de/gvk --recordSchema picaxml --parser picaxml --query "pica.iss=0939-4362" to JSON    

$ catmandu convert getJSON --from http://example.org/alice.json to YAML

$ catmandu convert getJSON --dry 1 --url http://{domain}/robots.txt < domains

config

$ cat catmandu.yml
---
store:
  mdb:
   package: MongoDB
   options:
    database_name: mydb
  els:
   package: Elasticsearch
   options:
    index_name: mydb

$ catmandu import JSON to mdb < records.json
$ catmandu import MARC to els < records.mrc
$ catmandu export mdb to JSON
$ catmandu export els to JSON

Excercise 1

convert data
store data
query data
get data
edit config

Fix

... easy data manipulation by non programmers

... small Perl DSL language

Fix - Path

$append   - Add a new item at the end of an array
$prepend  - Add a new item at the start of an array
$first    - Syntactic sugar for index '0' (the head of the array)
$last     - Syntactic sugar for index '-1' (the tail of the array)

Fix - marc_map

    marc_map('008_/35-38','language');
    marc_map('100','authors.$append');
    marc_map('245[10]a','title');
    marc_map('500a','publisher');
    marc_map('650a','subject', -join => '; ');
    remove_field('record');

Fix - mab_map

mab_map('001','identifier');
mab_map('002[a]','date');
mab_map('037[b]','language');
mab_map('050[ ]','format');
mab_map('052[ ]_/0-0','type');
mab_map('331[ ]','title');
mab_map('406jk','coverage.$append', -join => ' - ');
mab_map('700[bc]','subject.$append');
remove_field('record');

Fix - pica_map

pica_map('001A0','date');
pica_map('010@a','language');
pica_map('009Qa','primaryTopicOf.$append');
pica_map('027A[01]a','varyingFormOfTitle');
remove_field('record');

Fix - field

add_field('name','Smith');
# { name => 'Smith' }
set_field('name','Doe');
# { name => 'Doe'}
copy_field('name','title');
# { name => 'Doe, John', title => 'Dr.' }
remove_field('title');
# { name => 'Doe, John' }
move_field('name','dc.creator');
# { 'dc.creator' => 'Doe, John' }
retain_field('dc.creator')
# delete every field except named field

Fix - field

# { subjects => 'Perl,R,JavaScript' }
split_field('subjects',',');
sort_field('subjects');
# { subjects => ['JavaScript', 'Perl', 'R'] }
join_field('subjects','; ');
# { subjects => 'JavasSript; Perl; R' }

Fix - string

# { name => 'Doe'}
upcase('name');
# { name => 'DOE' }
downcase('name');
# { name => 'doe' }
capitalize('name');
# { name => 'Doe' }
append('name',', John');
# { name => 'Doe, John' }
prepend('name',', Dr. ');
# { name => 'Dr. Doe, John' }

Fix - string

# { name => ' Doe,  '}
trim('name');
# { name => 'Doe,' }
trim('name','nonword');
# { name => 'Doe' }
substring('name', 0, 1);
# { name => 'D' }

Fix - string

# {format => 'MARC21'}
replace_all('format', '\d', '');
# {format => 'MARC'}    

# {id => ['123-4', '567-X']}
replace_all('id.*', '-[0-9xX]$', '');
# {id => ['123', '567']}

Fix - count & sum

# { numbers => [1, 2, 3] }
copy_field('numbers','count');
count('count');
copy_field('numbers','sum');
sum('sum');
# { numbers => [1, 2, 3], count => 3, sum => 6 }

Fix - dictionaries

$ cat dict.csv
004,Informatik
310,Statistik
510,Mathematik

# { ddc => '004' }
lookup('ddc', 'dict.csv', -default=>'Allgemeines');
lookup('ddc', 'dict.csv', -delete=>'1');
# { ddc => 'Informatik' }

lookup_in_store('ddc', 'MongoDB', -database_name => 'lookups');

Fix - conditions

if_exists('ddc');
    lookup('ddc', 'dict.csv',  -delete=>'1');
end();

unless_exists('ddc');
    add_field('ddc', '000');
end();

if_any_match('ddc', '004');
    set_field('subject', 'Informatik');
end();

unless_any_match('subject', '[a-zA-Z]+');
    lookup('subject', 'dict.csv',  -delete=>'1');
end();

Fix - nested data structures

add_field('dc.title','code4lib');
add_field('dc.subject.$append', 'Computer');
add_field('dc.subject.$append', 'Informatik');
add_field('dc.subject.$append', 'Bibliothek');
add_field('dc.identifier.$append.zdbid','2415107-5');
add_field('dc.identifier.$append.ocn','502377032');
add_field('dc.identifier.$append.issn','1940-5758');
remove_field('dc.identifier.$first');
remove_field('dc.subject.1');
remove_field('dc.subject.*');

Fix - nested data structures

# Collapse deep nested hash to a flat hash
collapse();

# Expand flat hash to deep nested hash
expand();               

# Clone the perl hash and work on the clone
clone();

Fix - cmd

# Use an external program that can read JSON 
# from stdin and write JSON to stdout
cmd("java MyClass");

Fix - Binds

... provide processing hooks around Fix functions

$ echo "{}" | catmandu convert --fix 'meow()'
{ "meow": "Prrr" }
$ echo "{}" | catmandu convert --fix 'do bark() meow() end'
woof! woof!
{ "meow": "Prrr" }

Excercise 2

import & fix data
export & fix data

RDF

...

Excercise 3

export data to RDF

Extensions

├── Catmandu
│   ├── Cmd
│   │   └── foo.pm
│   ├── Exporter
│   │   ├── Foo.pm
│   ├── Fix
│   │   ├── foo_map.pm
│   ├── Importer
│   │   ├── Foo.pm
│   ├── Store
│   │   ├── Foo
│   │   │   ├── Bag.pm
│   │   │   └── Searcher.pm
│   │   ├── Foo.pm

CMD

package Catmandu::Importer::Hello;

use Catmandu::Sane;
use Moo;
with 'Catmandu::Importer';

sub generator {
    my ($self) = @_;
    state $fh = $self->fh;
    state $n = 0;
    return sub {
        my $line = $self->readline or return;
        my ($name) = split( ',', $line );
        return $name
            ? { "hello" => $name }
            : { "hello" => 'World' };
    };
}

1;

Fix

package Catmandu::Fix::hello_world;

use Moo;

sub fix {
    my ($self,$data) = @_;

    $data->{hello} = 'World';

    return $data;
}

1;

CMD

package Catmandu::Cmd::hello_world;
use parent 'Catmandu::Cmd';
 
sub command_opt_spec {
   (
       [ "greeting|g=s", "provide a greeting text" ],
   );
}
 
sub description {
   <<EOS;
examples:
catmandu hello_world --greeting "Hoi"
options:
EOS
}
 
sub command {
   my ($self, $opts, $args) = @_;
   my $greeting = $opts->greeting // 'Hello';
   print "$greeting, World!\n"
}
 
1;

Extensions

catmandu -I ./lib convert Hello < ./shared/names.csv 
catmandu -D -I ./lib convert Hello --fix "hello_world()" < ./shared/names.csv
catmandu -I ./lib hello_world --greeting Moin

Links

http://librecat.org

http://librecat.org/Catmandu/

http://metacpan.org/release/Catmandu

http://github.com/LibreCat/Catmandu

Comic by Randall Munroe, CC BY-NC 2.5

Files

Catmandu.md

Latest commit

History

Catmandu.md

File metadata and controls

Catmandu

Importing, transforming, storing and indexing data should be easy

SWIB2014 1 - 3 December 2014 Bonn, Germany

Johann Rolschewski / Jakob Voß

Staatsbibliothek zu Berlin, Germany / Verbundzentrale des GBV (VZG), Germany

Libraries collect data ...

Libraries create metadata ...

Metadata

Metadata

Demand

LibreCat

Catmandu

Catmandu - core concepts

Importer/Exporter

Importer for APIs

Stores

CLI

CLI - info

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - convert()

CLI - import()

CLI - import()

CLI - import()

CLI - export()

CLI - export()

CLI - count()

CLI - count()

CLI - delete()

CLI - delete()

CLI - move()

CLI - move()

CLI - data()

CLI - data()

CLI - APIs

config

Excercise 1

Fix

Fix - Path

Fix - marc_map

Fix - mab_map

Fix - pica_map

Fix - field

Fix - field

Fix - string

Fix - string

Fix - string

Fix - count & sum

Fix - dictionaries

Fix - conditions

Fix - nested data structures

Fix - nested data structures

Fix - cmd

Fix - Binds

Excercise 2

RDF

Excercise 3

Extensions

CMD

Fix

CMD

Extensions

Links