Skip to content

Leveraging Machine Learning for Medicare Claims Data, to identify and prevent fraudulent practices among Healthcare Providers.

Notifications You must be signed in to change notification settings

rashmishreev/healthcare-fraud-detection-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 

Repository files navigation

Pandas Matplotlib NumPy scikit-learn

Healthcare Provider Fraud Detection Using Machine Learning

Project Overview

This project addresses the critical issue of healthcare provider fraud using machine learning techniques. By analyzing Medicare claims data, it aims to identify potentially fraudulent claims and providers, helping to reduce financial losses and improve the integrity of the healthcare system.

Problem Statement

Healthcare provider fraud, including false claims, unnecessary treatments, and billing for unrendered services, costs billions of dollars annually. Using Machine Learning algorithms the project classifies claims as fraudulent or legitimate and identifies key features that contribute to accurate fraud prediction.

Data Source

The dataset, from Kaggle, includes:

  • Inpatient and outpatient claims data
  • Beneficiary details (demographics, enrollment info, chronic conditions)
  • Provider information

Methodology

  1. Exploratory Data Analysis
  2. Data Preprocessing and Transformation
  3. Feature Engineering
  4. Data Normalization
  5. Feature Selection
  6. Model Training and Testing
    • Algorithms: Logistic Regression, Random Forest, Decision Trees, XGBoost
  7. Hyperparameter Tuning
  8. Model Evaluation

Evaluation Metrics

AUC (Area Under the Curve) F1-Score Accuracy

Key Findings

Procedure and Diagnosis Codes

  • Inpatients: Most common procedure code is 4019 (general surgery), most frequent diagnostic code is 4019 (unspecified essential hypertension).

  • Outpatients: Most common procedure code is 9904 (general medical procedures), with diagnostic code 4019 also being the most frequent.

Claim Reimbursement Patterns

  • For hospital stays (inpatient claims): Most reimbursements are between $0 and $10,000. A few claims have much higher amounts, but these are less common.
  • For outpatient visits (no overnight stay): Almost all claims (99.9%) are $3,500 or less. Any claims above $3,500 are unusual and might be for very expensive procedures or could potentially be fraudulent.

This pattern helps us understand what typical medical costs look like and identify any unusually high claims that might need closer inspection.

Figure: Distribution of Claim Amount Reimbursement for Inpatient (left) and Outpatient (right) services.

Financial Impact

  • In 2009, approximately $290 million was lost to fraud.
  • $241 million was lost in the inpatient setting, and $54 million in the outpatient setting.

Age Distribution

  • Higher concentration of potential fraud cases among patients over 65.

These insights highlight the complexity of healthcare fraud detection and the importance of thorough data analysis and preprocessing in developing effective machine learning models for fraud identification.

Results

Using All Features

Model Hyperparameters Accuracy F1 Score AUC
Logistic Regression Penalty 'l2', C = 10.0 0.6298 0.4829 0.5875
Decision Tree max_depth: 50, min_samples_split: 270 0.7522 0.6951 0.8227
Random Forest criterion: 'gini', max_depth: 8, max_features: 'auto', n_estimators: 300 0.6387 0.5495 0.6576
XGBoost n_estimators: 100, eta: 0.3 0.7623 0.6929 0.8177

Using Important Features

Model Hyperparameters Accuracy F1 Score AUC
Logistic Regression C: 1000.0, penalty: 'l2' 0.6287 0.48406 0.5846
Decision Tree max_depth: 50, min_samples_split: 270 0.7525 0.6954 0.8227
Random Forest n_estimators: 500, max_features: 'auto', max_depth: 8, criterion: 'entropy' 0.6352 0.5529 0.6615
XGBoost n_estimators: 50, eta: 0.3 0.7519 0.6786 0.8063
  • The Decision Tree model performed consistently well across both feature sets, achieving the highest AUC of 0.8227.
  • XGBoost also showed strong performance, particularly when using all features.
  • Feature selection slightly improved the performance of the Decision Tree model but had mixed effects on other models.
  • Logistic Regression had the lowest performance among the models tested.

About

Leveraging Machine Learning for Medicare Claims Data, to identify and prevent fraudulent practices among Healthcare Providers.

Topics

Resources

Stars

Watchers

Forks