The dataset can be accessed by this link: https://ftp.fu-berlin.de/pub/misc/movies/database/frozendata/
The success of a film is usually measured through its box-office revenue or the opinion of professional critics. In this project, a different approach is proposed: the success of a movie is measured looking at how much it has influenced the ones released after it.
Influence is calculated on a network of references among movies, where an directed edge between two movies indicates that the first one makes a reference to the second one.
The task is to compare and contrast various ranking methods to quantify the influence of films based on the network of references among movies using graph centrality algorithms in Python.
The dataset is the IMDb movie citation network consisting of around 48,000 international movies produced in several countries from 1920 to 2017. I apply the methods on a network with 48,000 nodes and 130,000 directed edges, where a node represents a movie, and edge exists between two movies if one of them cites the other.
Table 1 presents some statistics of the network that we have generated. The resulting network is directed and acyclic, because a film cannot be referenced by a film that came out at an earlier date.
Table 1 – Generated network statistics
As methodology, I compute an influence score for each movie through four static centrality measures and one temporal measure. These measures are graph centrality algorithms borrowed from Network Analysis. The methods selected for the final comparison are highlighted in bold.
- Long-gap Citation Count (Inspired by Wasserman et al. (2014))
- In-degree Centrality
- Eigenvector Centrality
- Katz Centrality
- PageRank Centrality
Table 2 – Top 10 most influential movies, by PageRank
Table 3 – Top 10 most influential movies, by Long Gap Citation Count
The main differences between the two tables are highlighted in blue and red in Table 2 and Table 3, respectively. Comparing two tables, I have several observations:
-
First, there is temporal bias in Table 2. 70% of the movies have been released before 1940. This is expected; PageRank rewards older films than more recent films. As a result, older movies tend to place higher in the list.
-
Second, the temporal bias is partially reduced in Table 3. This is due to the temporal nature of Long-Gap Citation Count, which excludes citatitons that are shorter than 25 years. As a result, there are more recent films in Table 3, such as movies highlighted in blue.
-
Third, movies highlighted in red in Table 2 are excluded in Table 3. This is because those films were influential soon after their original release but did not stand the test of time because they did not get referenced by more recent films.
-
Finally, there is location bias. All movies in both tables are American with no other country represented. We can argue that the references that American movies receive are more accurately reflected in the dataset.
Wasserman, M., Zeng, X.H.T. & Amaral, L.A.N. 2014. “Cross-evaluation of metrics to estimate the significance of creative works”. Applied Mathematics. 112 (5) 1281-1286.