Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Data Cleaning and Feature Engineering #3

Open
7 tasks
limwualice opened this issue May 15, 2023 · 0 comments
Open
7 tasks

Enhancement: Data Cleaning and Feature Engineering #3

limwualice opened this issue May 15, 2023 · 0 comments
Labels
good first issue Good for newcomers

Comments

@limwualice
Copy link
Owner

limwualice commented May 15, 2023

Description:
We should prioritize implementing data cleaning and feature engineering techniques to improve the quality and usefulness of our dataset. This will involve performing necessary transformations and creating new features based on the existing data.

Data Cleaning:

  • Handle Missing Values: Identify and handle any missing values in the dataset by either imputing missing values or removing rows/columns with substantial missing data.
  • Remove Duplicates: Ensure data integrity by checking for and removing any duplicate records in the dataset.
  • Standardize Data Types: Verify and standardize the data types of each column, ensuring they are appropriate for the respective data.

Feature Engineering:

  • Extract Relevant Information: Extract valuable information from existing columns, such as day, month, or year from date columns.
  • Create Categorical Variables: Transform continuous variables into categorical variables if it provides additional insights or simplifies analysis.
  • Engineer Interaction Features: Create new features that capture interactions or relationships between existing variables, such as ratios or combinations of features.
  • Binning or Grouping: Group continuous variables into bins or categories to simplify analysis or capture non-linear relationships.

Examples of Features for Our Project:

  1. Average Rating: Calculate the average rating based on user ratings.
  2. Review Sentiment: Analyze the text of reviews to determine sentiment (positive, negative, neutral).
  3. Price Range: Categorize prices into ranges, such as low, medium, high.
  4. Popularity Score: Create a score based on the number of reviews and ratings to measure the popularity of a sushi restaurant.
  5. Location Features: Use latitude and longitude data to derive features like proximity to landmarks or distance from city center.

By incorporating these data cleaning and feature engineering steps, we can significantly enhance the quality of our dataset, uncover hidden patterns, and enable more accurate analysis and predictions.

Please share your thoughts and any additional suggestions regarding data cleaning and feature engineering for our project.

@limwualice limwualice changed the title Clean dataframe Enhancement: Data Cleaning and Feature Engineering May 16, 2023
@limwualice limwualice added the good first issue Good for newcomers label May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant