Predict the car price in used vehicles listings from Craigslist.org
Craigslist is the world's largest collection of used vehicles for sale, yet it's very difficult to collect all of them in the same place. I built a scraper for a school project and expanded upon it later to create this dataset which includes every used vehicle entry within the United States on Craigslist.
This data is scraped every few months, it contains most all relevant information that Craigslist provides on car sales including columns like price, condition, manufacturer, latitude/longitude, and 18 other categories. For ML projects, consider feature engineering on location columns such as long/lat. For previous listings, check older versions of the dataset.
This is a Kaggle dataset which can be found in this link: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I followed in this project the steps of the project management method called CRISP-DM. This method has undergone modifications aimed at the reality of a Data Science project and with that it was called CRISP-DS.
Your main principle is doing the project following multiples cycles as the necessity.
0.0 - IMPORTS
0.1 - Helper Function
0.2 - Loading Data
1.0 - DESCRIPTION OF DATA
1.1 - Rename Columns
1.2 - Data Dimensions
1.3 - Data Types
1.4 - Check NA
1.5 - Fillout NA
1.6 - Change Types
1.7 - Descriptive Statistical
- 1.7.1 - Numerical Attributes
- 1.7.2 - Categorical Attributes
2.0 FEATURE ENGINEERING
2.1 - Creation of Hyphoteses
- 2.1.1 - Demographic Hyphoteses
- 2.1.2 - Geographic Hyphoteses
- 2.1.3 - Sociocultural Hyphoteses
2.2 - Final list of Hypotheses
2.3 - Feature Engineering
3.0 - VARIABLE FILTERING
3.1 - Line filtering
3.2 - Column Selection
4.0 - EXPLORATORY DATA ANALYSIS
4.1 - Univariate Analysis
- 4.1.1 - Response Variable
- 4.1.2 - Numerical Variable
- 4.1.3 - Categorical Variable
4.2 - Bivariate Analysis
- 4.2.1 - Summary of Hyphoteses
4.3 - Multivariate Analysis
- 4.3.1 - Numerical Attributes
- 4.3.2 - Categorical Attributes
5.0 - DATA PREPARATION
5.1 - Normalization
5.2 - Rescaling
5.3 - Transformation
- 5.3.1 - Encoding
- 5.3.2 - Response Variable Transformation
- 5.3.3 - Nature Transformation
6.0 - FEATURE SELECTION
6.1 - Split dataframe into training and test dataset
6.2 - Boruta as Feature Selection
- 6.2.1 - Best Feature from Boruta
7.0 - MACHINE LEARNING MOMDELLING
7.1 - Average Model
7.2 - Linear Regression Model
- 7.2.1 - Linear Regression Model - Cross Validation
7.3 - Linear Regression Regularized Model
- 7.3.1 - Linear Regression - Lasso - Cross Validation
7.4 - Random Forest Regressor
- 7.4.1 - Random Forest Regressor - Cross Validation
7.5 - XGBoost Regressor
- 7.5.1 - XGBoost Regressor - Cross Validation
7.6 - Compare Model's Performance
- 7.6.1 - Single Performance
- 7.6.2 - Real Performance - Cross Validation
8.0 - HYPERPARAMETER FINE TUNING
8.1 - Random Search
8.2 - Final Model
9.0 - TRANSLATION AND INTERPRETATION OF THE ERROR
9.1 - Business Performance
9.2 - Total Performance
9.3 - Machine Learning Performance
10.0 - DEPLOY MODEL TO PRODUCTION
10.1 - Energy Consumption Class
10.2 - API Handler
10.3 - Tester