This is for the book Practitioner's Guide to Data Science (The previous name was "Introduction to Data Science")
- You can access the free online version here. Alternatively, you can purchase physical copies from Amazon or Routledge.
Please note that this work is written under the Contributor Code of Conduct and released under the CC-BY-NC-SA license. By participating in this project, for example, by submitting a pull request with suggestions or edits, you agree to abide by its terms.
This book focuses on data science with an emphasis on industrial experience. It covers a cross-disciplinary subject, combining hands-on experience and problem-solving in a business context. Most introductory books on data science discuss modeling techniques and implementation using R or Python but lack the industrial context. This book seeks to fill that gap by exploring the art of data science in practice.
Some key features of this book are as follows:
-
It covers both technical and soft skills.
-
It has a chapter dedicated to the big data cloud environment. In the industry, the practice of data science is often in such an environment.
-
It is hands-on. We provide the data and repeatable R and Python code in notebooks. Readers can repeat the analysis in the book using the data and code provided. We also suggest that readers modify the notebook to perform their analyses with their data and problems whenever possible. The best way to learn data science is to do it!
-
It focuses on the skills needed to solve real-world industrial problems rather than an academic book.
Notebooks
Chapter | R | Python |
---|---|---|
Ch4: Big Data Cloud Platform | html, rmd | Create Spark Data, pyspark Notebook |
Ch5: Data Preprocessing | html, rmd | Notebook |
Ch6: Data Wrangling | html, rmd | Notebook |
Ch7: Model Tuning Strategy | html, rmd | Notebook |
Ch8: Measuring Performance | html, rmd | Notebook |
Ch9: Regression Models | html, rmd | Notebook |
Ch10: Regularization Methods | html, rmd | Notebook |
Ch11: Tree-Based Methods | html, rmd | Notebook |
Ch12: Deep Learning | html(DNN, CNN, RNN ) , rmd ( DNN, CNN, RNN ) | DNN, CNN, RNN, Tokenizing and Padding, MINST with one hidden layer: step by step |
Use R code. You should be able to repeat the R code in your local R console or RStudio in all the chapters except for Chapter 4. The code in each chapter is self-sufficient, and you don't need to run the code in previous chapters first to run the code in the current chapter. For code within a chapter, you do need to run from the beginning. At the beginning of each chapter, there is a code block for installing and loading all required packages. We also provide the .rmd
notebooks that include the code to make it easier for you to repeat the code.
To repeat the code on big data and cloud platforms, you need to use Databricks, a cloud data platform. We use Databricks because:
- It provides a user-friendly web-based notebook environment that can create a Spark cluster on the fly to run R/Python/Scala/SQL scripts
- It has a free community edition that is convenient for teaching purpose
Follow the instructions in section 4.3 on the process of setting up and using the spark environment.
Use Python code. We provide python notebooks for all the chapters on GitHub. Like R notebooks, you should be able to repeat all notebooks in your local machine except for Chapter 4 with reasons stated above. An easy way to repeat the notebook is to import and run in Google Colab. To use Colab, you only need to log in to your Google account in Chrome Browser. To load the notebook to your colab, you can do any of the following:
-
Click the ''Open in Colab" icon on the top of each linked notebook using the Chrome Brower. It should load the notebook and open it in your Colab.
-
In your Colab, choose File -> Upload notebook -> GitHub. Copy-paste the notebook's link in the box, search, and select the notebook to load it.
To repeat the code for big data, like running R notebook, you need to set up Spark in Databricks. Follow the instructions in section 4.3 on the process of setting up and using the spark environment. Then, run the "Create Spark Data" notebook to create Spark data frames. After that, you can run the pyspark notebook to learn how to use pyspark
.
Short links:
- https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv to http://bit.ly/2P5gTw4
- https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/AirlineRating.csv to http://bit.ly/2TNQ6TK
- https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/sim1_da1.csv to http://bit.ly/2KXb1Qi
- https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/topic.csv to http://bit.ly/2zam5ny