IMDBScraper

I could not find any library that could scrape data off IMDB given advanced search filters. Here is my attempt at filling that void.

How to install

This script uses scrapy, so that should be installed first. Also note these instructions are only for python3. You can get it using pip3/pip installer.

sudo pip3 install scrapy

Then clone this repository.

git clone https://github.com/Hallicopter/imdbscraper

And that's it, it is installed!

Usage

It is very easy to use this script. First cd into the directory (by default this will be your home directory.)

cd ~/imdbscraper

Then go to this imdb advanced search link here. Here you can enter all the filters you want and then hit search. You would be redirected to the search results page. Copy the URL of this page. For my example, I searched for IMDB top 100 movies. This is the link I got:

http://www.imdb.com/search/title?groups=top_100

Now open your terminal from the imdbscraper folderand enter this :

scrapy runspider imdb.py -o test.json -a search='http://www.imdb.com/search/title?groups=top_100'

Here the search parameter is your starting page for the crawler to begin. The spider will continue crawling till it hits the end of the search result. The -o test.json is used to save the result in a json file called test.json. You can change this to whatever you want.

Format of the json file.

[
  {
    "director": "Vittorio De Sica", 
    "imdb_link": "http://www.imdb.com/title/tt0040522/?ref_=adv_li_tt",
    "lead_actors": ["Lamberto Maggiorani", "Enzo Staiola", "Lianella Carell",
    "Elena Altieri"], 
    "film_name": ["Bicycle Thieves"]
  },
  {
   "director": "Charles Chaplin",
   "imdb_link": "http://www.imdb.com/title/tt0012349/?ref_=adv_li_tt", 
   "lead_actors": ["Charles Chaplin", "Edna Purviance", "Jackie Coogan", "Carl Miller"],
   "film_name": ["The Kid"]
  }
]

And that's it.

A file containing the results of this example is there in the repository for you to checkout. I will try to add more features in future iterations of this script.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
__pycache__		__pycache__
.imdb.py.swp		.imdb.py.swp
README.md		README.md
betatest.json		betatest.json
imdb.py		imdb.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMDBScraper

How to install

And that's it, it is installed!

Usage

Format of the json file.

And that's it.

About

Releases

Packages

Languages

Hallicopter/imdbscraper

Folders and files

Latest commit

History

Repository files navigation

IMDBScraper

How to install

And that's it, it is installed!

Usage

Format of the json file.

And that's it.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages