Spaceship Titanic 🚀 Transport Prediction

spaceX-Falcon9-Launch.MP4.MP4

spaceX-Falcon9-Test.MP4

Spaceship Titanic 🚀 Transport Prediction

Overview

This repository contains a machine learning project for the Kaggle competition "Spaceship Titanic." The goal is to predict which passengers were transported to an alternate dimension during a collision with a spacetime anomaly.

Project Description

In this competition, we use machine learning techniques to analyze data from the Spaceship Titanic's damaged computer system and predict whether passengers were transported.

Project Structure

Introduction
Dependencies Installation
Data Loading
Initial Data Exploration
Feature Engineering and PCA
Data Preprocessing
Model Training and Evaluation (Ensemble Learning)
Hyperparameter Optimization
Feature Importance (Random Forest & Gradient Boosting)
Submission
Conclusion

1. Project Structure

Project Structure

The project follows a complete machine learning pipeline, which includes:

Installation of Dependencies: Installing and importing necessary Python libraries.

Data Loading: Loading the training and testing datasets.

Exploratory Data Analysis (EDA): A first look at the data through visualization and summary statistics.

Feature Engineering: Enhancing the dataset by creating new variables to improve prediction.

Preprocessing: Handling missing values, scaling numeric features, and encoding categorical variables.

Model Building: Training different machine learning models and evaluating their performance.

Hyperparameter Optimization: Using grid search to fine-tune the best model.

Submission: Predicting on the test set and creating a submission file for Kaggle.

Getting Started

Prerequisites

Python 3.x
Required Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn

Installation

Install the required libraries using pip:

pip install numpy pandas matplotlib seaborn scikit-learn

Usage

Clone the Repository

git clone https://github.com/yourusername/spaceship-titanic.git

Navigate to the Project Directory
```
cd spaceship-titanic
```
Run the Main Script
```
python main.py
```

Code Explanation

1. Introduction

The goal of this project is to predict if a passenger will be transported using machine learning models.

2. Installation of Dependencies

!pip install numpy pandas matplotlib seaborn scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

%matplotlib inline
plt.style.use('dark_background')  # Setting dark mode for visualizations

3. Loading the Data

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head()

4. Initial Data Exploration

train_data.info()
plt.figure(figsize=(8, 6))
sns.countplot(x='Transported', data=train_data, palette='cool')
plt.title('Distribution of Transported')
plt.show()  # Dark mode applied

Transported Distribution Graphic

5. Feature Engineering and PCA

# Feature engineering: Total Spend and Average Spend
train_data['TotalSpend'] = train_data['RoomService'] + train_data['FoodCourt'] + train_data['ShoppingMall'] + train_data['Spa'] + train_data['VRDeck']
train_data['AvgSpend'] = train_data['TotalSpend'] / 5
train_data['CabinNumRatio'] = pd.to_numeric(train_data['Num'], errors='coerce') / train_data['Age']

# PCA for dimensionality reduction
X = train_data[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalSpend', 'AvgSpend', 'CabinNumRatio']].fillna(0)
y = train_data['Transported']

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolor='k', alpha=0.7)
plt.title('PCA of Features (2 Components) - Dark Mode')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()  # PCA plot in dark mode

PCA of Features (2 Components) Graphic

6. Data Preprocessing

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
        ('cat', categorical_transformer, ['HomePlanet', 'Destination', 'Deck', 'Side'])
    ])

7. Model Training and Evaluation (Ensemble Learning)

X_train_pca, X_val_pca, y_train_pca, y_val_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

ensemble_model = VotingClassifier(estimators=[('rf', rf_model), ('gb', gb_model)], voting='soft')
ensemble_model.fit(X_train_pca, y_train_pca)
y_pred_ensemble = ensemble_model.predict(X_val_pca)

# Metrics
accuracy = accuracy_score(y_val_pca, y_pred_ensemble)
f1 = f1_score(y_val_pca, y_pred_ensemble)
roc_auc = roc_auc_score(y_val_pca, y_pred_ensemble)

print(f"Ensemble Model Accuracy: {accuracy:.4f}")
print(f"Ensemble Model F1 Score: {f1:.4f}")
print(f"Ensemble Model ROC AUC: {roc_auc:.4f}")

# Confusion matrix
cm = confusion_matrix(y_val_pca, y_pred_ensemble)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples')
plt.title('Confusion Matrix - Ensemble Model (Dark Mode)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Confusion Matrix - Random Forest Graphic

8. Hyperparameter Optimization

param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(rf_model, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_pca, y_train_pca)

print("Best parameters:", grid_search.best_params_)

9. Feature Importance (Random Forest & Gradient Boosting)

ensemble_model.estimators_[0].fit(X_train_pca, y_train_pca)  # Random Forest
feature_importance_rf = ensemble_model.estimators_[0].feature_importances_

ensemble_model.estimators_[1].fit(X_train_pca, y_train_pca)  # Gradient Boosting
feature_importance_gb = ensemble_model.estimators_[1].feature_importances_

importance_df = pd.DataFrame({
    'Feature': ['PC1', 'PC2'],
    'RandomForest': feature_importance_rf,
    'GradientBoosting': feature_importance_gb
})

importance_df = pd.melt(importance_df, id_vars=['Feature'], var_name='Model', value_name='Importance')

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', hue='Model', data=importance_df, palette='coolwarm')
plt.title('Feature Importance by Model (Random Forest vs Gradient Boosting)')
plt.tight_layout()
plt.show()

Feature Importance by Model (Random Forest vs Gradient Boosting) Graphic

10. Submission

test_data['TotalSpend'] = (test_data['RoomService'] + test_data['FoodCourt'] +
                           test_data['ShoppingMall'] + test_data['Spa'] + test_data['VRDeck'])

# Assuming you have transformed the test data similarly to the training data
X_test = test_data[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalSpend', 'AvgSpend', 'CabinNumRatio']].fillna(0)
X_test_pca = pca.transform(X_test)

test_predictions = ensemble_model.predict(X_test_pca)

submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Transported': test_predictions})
submission.to_csv('submission.csv', index=False)

11. Conclusion

This project demonstrates a complete machine learning pipeline from feature engineering and PCA to ensemble learning. We further improve the model with hyperparameter tuning and provide visualizations in dark mode for better readability. The final results show competitive accuracy and F1 scores.

Jupyter Notebook

# Spaceship Titanic - Transport Prediction 🚀

## 1. Introduction

This notebook aims to predict whether a passenger aboard the Spaceship Titanic will be transported to another dimension using machine learning algorithms. We will use the Kaggle Spaceship Titanic dataset, explore the data,

'

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
Dataset-Data Sources		Dataset-Data Sources
generated graphics images		generated graphics images
spaceXLaunches_videos		spaceXLaunches_videos
LICENSE		LICENSE
README.md		README.md
code.md		code.md
spaceship-titanic-analysis.ipynb		spaceship-titanic-analysis.ipynb
spaceship-titanicJupyterN.md		spaceship-titanicJupyterN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spaceship Titanic 🚀 Transport Prediction

Overview

Project Description

Project Structure

1. Project Structure

Getting Started

Prerequisites

Installation

Usage

Code Explanation

1. Introduction

2. Installation of Dependencies

3. Loading the Data

4. Initial Data Exploration

5. Feature Engineering and PCA

6. Data Preprocessing

7. Model Training and Evaluation (Ensemble Learning)

8. Hyperparameter Optimization

9. Feature Importance (Random Forest & Gradient Boosting)

10. Submission

11. Conclusion

Jupyter Notebook

Copyright 2024 Mindful-AI-Assistants. Code released under the MIT license.

About

Sponsor this project

Languages

License

Mindful-AI-Assistants/spaceship

Folders and files

Latest commit

History

Repository files navigation

Spaceship Titanic 🚀 Transport Prediction

Overview

Project Description

Project Structure

1. Project Structure

Getting Started

Prerequisites

Installation

Usage

Code Explanation

1. Introduction

2. Installation of Dependencies

3. Loading the Data

4. Initial Data Exploration

5. Feature Engineering and PCA

6. Data Preprocessing

7. Model Training and Evaluation (Ensemble Learning)

8. Hyperparameter Optimization

9. Feature Importance (Random Forest & Gradient Boosting)

10. Submission

11. Conclusion

Jupyter Notebook

Copyright 2024 Mindful-AI-Assistants. Code released under the MIT license.

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Sponsor this project

Languages