A bundle of web scraping scripts that harvest information about ships, arrivals and passengers from Jewish Genealogy in Argentina.
Result files can be imported into a SQL database for querying. These files are available in Releases.
My mother's family emigrated mainly from Czech Republic. While looking for them in some passenger lists, a couple of problems appeared. The first was the last name. I still don't quite understand how it works, but women have their last name "changed" by adding "ova". For example, "Vonka" is "Vonkova". The second was how they were registered when they arrived in Argentina. When the names were somewhat complex, they were changed to a similar one from here. For example, "Jan" to "Juan" or "František" to "Francisco".
This made things more difficult, I needed to look for all the possibilities. What was my solution? Regular expressions. But since the page had no option to search with them, I decided to copy its information to a personal database to work from there.
If you are going to create the database and query it, you also need:
You can import the .csv
files to your preferred database service. But this code only covers MySQL.
-
Download this repository
-
Install the dependencies
npm install
-
Fill
.env.sample
file and rename it to.env
If you don't want to bother yourself about getting the information:
-
Create a
results
folder inside the project -
Download the latest data release
-
Unzip the downloaded file inside
results
-
Skip the "🔍 Getting some information" section
These are the scraping scripts you can run:
-
Get ships
npm run get-ships
Search for available ships on the page. Then, they will be written in
ships.csv
. -
Get arrivals
npm run get-arrivals -- [flags]
Look for the arrivals of every ship in
ships.csv
. So, you must have runget-ships
before.Then, the results will be saved in
arrivals.csv
. If any request failed over the network or is due to a limit, it will be inships.error.csv
to retry them later. -
Get passengers
npm run get-passengers -- [flags]
Get the passenger list of every arrival in
arrivals.csv
. So, you have to run theget-arrivals
command before.After it, all passengers could be found in
passengers.csv
. If any request failed over the network or is due to a limit, it will be inarrivals.error.csv
to retry them later.
I added these to modify the behaviour without changing a config file or some constant inside the script.
-
Limit the amount of work to do
Example:
npm run get-arrivals -- [-l | --limit <number>]
When you set a limit, some requests or inserts may exceed it. So, it will be saved in a
.error.csv
file to be resumed later. The default value is 500. With 0 we set no limit. -
Change the delay
Example:
npm run get-passengers -- [-d | --delay <number>]
The default value is 200ms. It's not recommended to go below that without knowing how many requests the server could handle/allows. I am not responsible for any ban for making too many requests in a very short time.
If you encountered a failure or set a limit, then you have a .error.csv
file and here is what you should do to retry those.
Example: If you want to retry getting the arrivals of ships that failed
npm run get-arrivals -- [-r | --retry]
This search for the arrivals of ships in ships.error.csv
. The results will be appended to arrivals.csv
. Same logic for other commands.
Wait. What database? Well... first you need to create it running:
npm run init-database
With a database created, you can insert the result files into it with:
-
Ships
npm run insert-ships -- [flags]
-
Arrivals
npm run insert-arrivals -- [flags]
-
Passengers
npm run insert-passengers -- [flags]
By the time I am writing this, I harvested almost 1.2 million passengers. Inserting this quantity will take a while... like several minutes. So, stretch out and go get some coffee.
Once finished, you will be able to query the scraping-passenger-list
database. You don't have to worry about table joins, I leave a template called selectPassenger.sql
.
Copyright © 2021 Aguirre Gonzalo Adolfo. This project is MIT licensed.