In this project, I have to choose any one Dataset for investigation out of 4. Click here to open a document with links and information about datasets that I can investigate for this project.
I have choosen TMDb Movie Data
for my Investigation in this project.
For the this project, I will conduct my own data analysis and create a file to share my findings. I will start by taking a look at the dataset and brainstorm what questions I could answer using it. Then i will use pandas and NumPy to answer the questions that I am most interested in, and create a report sharing the answers. I have not been required to use inferential statistics or machine learning to complete this project, but I will make it clear in my communications that my findings are tentative. This project is open-ended in that they aren't looking for one right answer.
In this project, I'll go through the data analysis process and see how everything fits together. I'll use the Python libraries NumPy, pandas, and Matplotlib which make writing data analysis code in Python a lot easier! Not only that, these are sought-after skills by employers!
This project contains 2 files and 2 folder:
data.csv
: The dataset file containing 10k+ entries of movies that I have worked on.report.ipynb
: The investigation of the dataset has been done in this jupyter notebook file.export/
: Folder containing HTML and PDF file of notebook.plots/
: Contains images of all the plots that are displayed inreport.ipynb
file.
This data set contains information about 10,000 movies collected from The Movie Database (TMDb). Contains data such as title, cast, director, runtime, budget, revenue, release year
etc.
- Certain columns, like
‘cast’
and‘genres’
, contain multiple values separated by pipe (|) characters. - There are some odd characters in the
‘cast’
column. Nothing to care much of, I leave them as is. - The final two columns ending with
“_adj"
show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
This project requires Python 3 and the following Python libraries installed:
You will also need to have software installed to run and execute a Jupyter Notebook
If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.
In a terminal or command window, navigate to the top-level project directory Investigate_TMDb_Movies/
(that contains this README) and run one of the following commands:
ipython notebook report.ipynb
or
jupyter notebook report.ipynb
or if you have 'Jupyter Lab' installed
jupyter lab
This will open the Jupyter/iPython Notebook software and project file/folder in your browser.
- What all steps are involved in a typical data analysis process.
- Comfortable posing questions that can be answered with a given dataset and then answering those questions.
- Know how to investigate problems in a dataset and wrangle the data into a format that can be used.
- Have practice communicating the results of the analysis.
- Being able to use vectorized operations in NumPy and Pandas to speed up your data analysis code.
- Being familiar with Pandas Series and DataFrame objects, which lets access data more conveniently.
- Last but not least, Know how to use Matplotlib and Seaborn to produce plots showing findings.
My project was reviewed by a Udacity reviewer against the Investigating a Dataset rubric. All criteria found in the rubric must be meeting specifications for me to pass.
My Project Review by an Udacity Reviewer