scraping-passenger-list

A bundle of web scraping scripts that harvest information about ships, arrivals and passengers from Jewish Genealogy in Argentina.

Result files can be imported into a SQL database for querying. These files are available in Releases.

💡 Motivation

My mother's family emigrated mainly from Czech Republic. While looking for them in some passenger lists, a couple of problems appeared. The first was the last name. I still don't quite understand how it works, but women have their last name "changed" by adding "ova". For example, "Vonka" is "Vonkova". The second was how they were registered when they arrived in Argentina. When the names were somewhat complex, they were changed to a similar one from here. For example, "Jan" to "Juan" or "František" to "Francisco".

This made things more difficult, I needed to look for all the possibilities. What was my solution? Regular expressions. But since the page had no option to search with them, I decided to copy its information to a personal database to work from there.

🚧 Prerequisites

Node.js

If you are going to create the database and query it, you also need:

MySQL

You can import the .csv files to your preferred database service. But this code only covers MySQL.

🛠️ Install

Download this repository
Install the dependencies
```
npm install
```
Fill .env.sample file and rename it to .env

🚀 Usage

If you don't want to bother yourself about getting the information:

Create a results folder inside the project
Download the latest data release
Unzip the downloaded file inside results
Skip the "🔍 Getting some information" section

🔍 Getting some information

These are the scraping scripts you can run:

Get ships
```
npm run get-ships
```
Search for available ships on the page. Then, they will be written in ships.csv.
Get arrivals
```
npm run get-arrivals -- [flags]
```
Look for the arrivals of every ship in ships.csv. So, you must have run get-ships before.

Then, the results will be saved in arrivals.csv. If any request failed over the network or is due to a limit, it will be in ships.error.csv to retry them later.
Get passengers
```
npm run get-passengers -- [flags]
```
Get the passenger list of every arrival in arrivals.csv. So, you have to run the get-arrivals command before.

After it, all passengers could be found in passengers.csv. If any request failed over the network or is due to a limit, it will be in arrivals.error.csv to retry them later.

🚩 Flags

I added these to modify the behaviour without changing a config file or some constant inside the script.

Limit the amount of work to do

Example:
```
npm run get-arrivals -- [-l | --limit <number>]
```
When you set a limit, some requests or inserts may exceed it. So, it will be saved in a .error.csv file to be resumed later. The default value is 500. With 0 we set no limit.
Change the delay

Example:
```
npm run get-passengers -- [-d | --delay <number>]
```
The default value is 200ms. It's not recommended to go below that without knowing how many requests the server could handle/allows. I am not responsible for any ban for making too many requests in a very short time.

♻️ Retrying those which failed

If you encountered a failure or set a limit, then you have a .error.csv file and here is what you should do to retry those.

Example: If you want to retry getting the arrivals of ships that failed

npm run get-arrivals -- [-r | --retry]

This search for the arrivals of ships in ships.error.csv. The results will be appended to arrivals.csv. Same logic for other commands.

🔣 Querying the database

Wait. What database? Well... first you need to create it running:

npm run init-database

With a database created, you can insert the result files into it with:

Ships
```
npm run insert-ships -- [flags]
```
Arrivals
```
npm run insert-arrivals -- [flags]
```
Passengers
```
npm run insert-passengers -- [flags]
```

By the time I am writing this, I harvested almost 1.2 million passengers. Inserting this quantity will take a while... like several minutes. So, stretch out and go get some coffee.

Once finished, you will be able to query the scraping-passenger-list database. You don't have to worry about table joins, I leave a template called selectPassenger.sql.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.env.sample		.env.sample
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.en.md		README.en.md
README.es.md		README.es.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scraping-passenger-list

💡 Motivation

🚧 Prerequisites

🛠️ Install

🚀 Usage

🔍 Getting some information

🚩 Flags

♻️ Retrying those which failed

🔣 Querying the database

📝 License

About

Releases 3

Contributors 2

Languages

License

gonza7aav/scraping-passenger-list

Folders and files

Latest commit

History

Repository files navigation

scraping-passenger-list

💡 Motivation

🚧 Prerequisites

🛠️ Install

🚀 Usage

🔍 Getting some information

🚩 Flags

♻️ Retrying those which failed

🔣 Querying the database

📝 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Contributors 2

Languages