- users.dat
- rating.csv
- movies.csv
All demographic information is provided voluntarily by the users and is not checked for accuracy. Only users who have provided some demographic information are included in this data set.
-
Gender is denoted by a "M" for male and "F" for female
-
Age is chosen from the following ranges:
- 1: "Under 18"
- 18: "18-24"
- 25: "25-34"
- 35: "35-44"
- 45: "45-49"
- 50: "50-55"
- 56: "56+"
-
Occupation is chosen from the following choices:
- 0: "other" or not specified
- 1: "academic/educator"
- 2: "artist"
- 3: "clerical/admin"
- 4: "college/grad student"
- 5: "customer service"
- 6: "doctor/health care"
- 7: "executive/managerial"
- 8: "farmer"
- 9: "homemaker"
- 10: "K-12 student"
- 11: "lawyer"
- 12: "programmer"
- 13: "retired"
- 14: "sales/marketing"
- 15: "scientist"
- 16: "self-employed"
- 17: "technician/engineer"
- 18: "tradesman/craftsman"
- 19: "unemployed"
- 20: "writer"
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
MovieID;Title;Genres
-
UserIDs range between 1 and 6040
-
MovieIDs range between 1 and 3952
-
Ratings are made on a 5-star scale (whole-star ratings only)
-
Timestamp is represented in seconds since the epoch as returned by time(2)
-
Each user has at least 20 ratings
-
Titles are identical to titles provided by the IMDB (includingyear of release)
-
Genres are pipe-separated and are selected from the following genres:
- Action
- Adventure
- Animation
- Children's
- Comedy
- Crime
- Documentary
- Drama
- Fantasy
- Film-Noir
- Horror
- Musical
- Mystery
- Romance
- Sci-Fi
- Thriller
- War
- Western
unames=["user_id", "gender", "age", "occupation", "zip"]
users= pd.read_table("users.dat", sep="::", header=None, names=unames, engine="python")
rnames=["user_id", "movie_id", "rating", "timestamp"]
ratings= pd.read_csv("ratings.csv")
mnames=["movie_id", "title", "genres"]
movies=pd.read_csv("movies.csv")
ratings.rename(columns = {'userId':'user_id', 'movieId':'movie_id'}, inplace = True)
movies.rename(columns = {'movieId':'movie_id'}, inplace = True)
data = pd.merge(pd.merge(ratings, users), movies)
Once the tables are merged it was decided to study the mean films rated by gender, to then plotting the 20 best rated films by gender.
mean_ratings= data.pivot_table("rating", index="title", columns="gender", aggfunc="mean")
indexer=mean_ratings.sum(axis="columns").argsort()
mean_ratings_20 = mean_ratings.take(indexer[-20:])
mean_ratings_20 = mean_ratings_20.stack()
mean_ratings_20.name = "rating_by_gender"
mean_ratings_20 = mean_ratings_20.reset_index()
import seaborn as sns
sns.barplot(x="rating_by_gender",y="title",hue="gender", data= mean_ratings_20)
ratings_by_title=data.groupby("title").size()
most_rated_titles= ratings_by_title.index[ratings_by_title >= 250]
mean_ratings= mean_ratings.loc[most_rated_titles]
mean_ratings
Once we have the film titles sorted by the 250 most rated we can study which has been rated by gender, female or male.
top_female_ratings= mean_ratings.sort_values("F", ascending=False)
top_female_ratings.head()
Once we have the titles' ratings sorted by gender we can measure the difference between them and get to know which titles have the greatest rating difference, sorting them by it.
mean_ratings["diff"]= mean_ratings["M"]-mean_ratings["F"]
sorted_by_diff= mean_ratings.sort_values("diff")
sorted_by_diff[::-1].head()
sorted_by_diff.tail()
Suppose instead you wanted the movies that elicited the most disagreement among viewers, independent of gender identification. Disagreement can be measured by the variance or standard deviation of the ratings. To get this, we first compute the rating standard deviation by title and then filter down to the active titles
rating_std_by_title= data.groupby("title")["rating"].std()
rating_std_by_title=rating_std_by_title.loc[most_rated_titles]
rating_std_by_title.sort_values(ascending=False)[:10]
Another thing we can do with this database is classify the film's title by genre and group of age, to do that first we need to split the "genres" column in the amount of genres every film is classified with. Then once we have split the array of genres using .explode() function we can generate another dataframe with a genre per row.
movies["genres"].head()
movies["genres"].head().str.split("|")
movies["genre"]=movies.pop("genres").str.split("|")
Now, calling movies.explode("genre") generates a new DataFrame with one row for each "inner" element in each list of movie genres. For example, if a movie is classified as both a comedy and a romance, then there will be two rows in the result, one with just "Comedy" and the other with just "Romance":
rating_with_genre= pd.merge(pd.merge(movies_exploded,ratings),users)
rating_with_genre.iloc[0]
genre_ratings= (rating_with_genre.groupby(["genre", "age"])["rating"].mean().unstack("age"))
genre_ratings[:10]
We then decided to do the same with the search engines and count the number of times the different search engines are used.To do this we have to take the data from a column, column a with the Browser's name data, and keep the first word that corresponds to the search engine.