Tech layoffs dataset from COVID 2019 to 2023
Step 1: Setup the environment
- Install and setup MySQL workbench.
- Download the dataset: https://www.kaggle.com/datasets/swaptr/layoffs-2022 https://drive.google.com/file/d/12mIcynhjBajGj9zvRNx1HcsKOAVXLwGs/view?usp=sharing
Step 2: Importing the Data into MySQL
-
Create a new database, start by creating a new schema
-
Import the dataset into a table using the “Table Data Import Wizard”
-
Configure the input settings, and make sure the correct data type for each column is selected. We left the ‘date’ column as ‘text’ for now instead of changing it to ‘DATETIME’ type.
Only 564 records were imported out of the 2361 records the .csv file has.
-
Fixing the import in order to be able to import the whole .csv file with all the records, and for this I explored two methods:
-
Changing the data type in the schema in the import wizard such that all columns are imported as text.
But that didn’t fix the issue.
b. Converting the .csv file into a .json file and re-importing using the Wizard.
The entire file was now imported. However, the detected format for all of the columns would be text.
-
-
Now we want to start cleaning the data.
Step 3: Cleaning the data
We’re going to do a few things in this step that include the following:
- Remove duplicates
- Standardize the Data
- handle Null values or Blank values
- Remove any irrelevant columns
Before we start deleting or irreversibly change the data, we want to make a copy of the raw data, and keep one copy untouched, and then do our work in the other table.
Removing duplicates:
The result shows multiple duplicates
However, we need to check and confirm that the result records are actually duplicates.
The result shows that they’re not duplicates and so we need to adjust our criteria to partition by all the columns.
Duplicates were found and now we need to clean them.
We can use the CTE for that.
However, running the code resulted in an error:
Error Code: 1288. The target table new_duplicate_cte of the DELETE is not updatable
Apparently, MySQL doesn’t support this action and instead, to workaround it, we can create a new table with the row_num colum and deleting the record where the value in that column is equal to 2.
That can be easily done if we right-click on the staging table > copy to clipboard > create statement. Then we can paste the code in the editor.
We will add the row_num column and run
Now we’ll check the duplicate values once more
SELECT *
FROM layoffs_staging2
WHERE row_num > 1;
The delete them
DELETE
FROM layoffs_staging2
WHERE row_num > 1;
However, we get an error
Error Code: 1175. You are using safe update mode and you tried to update a table without a WHERE that uses a KEY column. To disable safe mode, toggle the option in Preferences -> SQL Editor and reconnect.
We’ll disable the safe mode from the settings, hit the ‘reconnect to DBMS’ button and try gain.
Standardizing data:
Basically, we want to check columns, and look for things to improve.
We’ll remove extra spaces using the TRIM function, and update the table. Then we’ll find different versions of the same record and update them such that they are standardized.
Fixing the date column format to make it standardized
However, when running the UPDATE command, we get and error:
Error Code: 1411. Incorrect datetime value: 'NULL' for function str_to_date
This error is because the function is trying to update date strings into datetime data type but it’s failing to update NULL values. Upon inspection, these NULL values are just strings of ‘NULL’ rather than a NULL value. This happened because of the way we imported the data.
To fix that, we’ll do a workaround.
Now we have proper date format, but the column data type is still text, so we’ll need to change that.
We will alter the column and change it into a ‘Date’ type.
Handling NULL and Blank values:
SELECT *
FROM layoffs_staging2
WHERE total_laid_off IS NULL
; -- This code returns an empty table, because the NULL values are actually strings
-- 'NULL'. Therefore, this must be addresses, and one way to do that is:
UPDATE layoffs_staging2
SET total_laid_off = NULL
WHERE total_laid_off = 'NULL';
Now the SELECT code works as expected
We’ll redo the same steps above to fix the other columns as well.
Let’s continue working on populating NULL or Missing values
Let’s get records where the industry column is either empty or has a NULL value and see if we can populate these.
The below example shows how we can potentially fill the missing values from the industry column
Or the better method
We’ll also delete records where both the total_laid_off and percentage_laid_off are NULL or missing, because we can’t extrapolate their values
Lastly, we want to remove the row_num column we added earlier
STEP 4:
Data Exploration
- Started by checking MAX values, but noticed that the value that was being returned was incorrect. And after further inspection, realized that the function wasn’t performing the calculation correctly on the column values, because the column data type was text, and so the values in the column were also strings/texts.
Fixing the data type lead to getting the correct result from the max function.
![Screenshot 2024-09-08 at 11.33.11 PM.png](Project%201%20Tech%20layoffs%20data%20cleaning%20and%20data%20expl%20ffc72114d07945f9bfc46bfbef8ad9db/Screenshot_2024-09-08_at_11.33.11_PM.png)
I continue exploring the data, and also find that the funds_raised_millions column is being sorted incorrectly because the values in the column are also stored as strings rather than numbers.
We’ll fix the column and values and try the query again
We continue to explore the data with more queries to find out what companies had the most layoffs, which industries had the most layoffs, find what is the date range we have for the data and more..
Continue exploring to find more insights
More advanced exploration
Find company lay offs, partitioned by year and ordered by dense rank