Predicting 30-Day Hospital Readmissions for Diabetic Patients
Overview
This project focuses on predicting whether a diabetic patient will be readmitted to the hospital within 30 days of discharge using real-world clinical data. Hospital readmissions are costly and often indicate gaps in patient care, making accurate prediction especially important in healthcare settings where false negatives can have serious consequences.
The goal of this project was to build an interpretable classification model that prioritizes recall, ensuring that high-risk patients are identified and can receive preventative follow-up care.
Dataset
- Source: UCI Machine Learning Repository
- Dataset: Diabetes 130-US Hospitals for Years 1999–2008
- Size: 101,766 patient encounters
- Features: 50 variables including demographics, diagnoses, lab procedures, medications, and hospital utilization
Target Variable
readmitted→ transformed into a binary indicator representing whether a patient was readmitted (within or after 30 days).
Data Cleaning & Preparation
- Replaced missing value placeholders (
?) withNaN - Dropped features with excessive missingness or limited predictive value
- Imputed missing categorical values (race and diagnosis codes) as
"Unknown" - Removed identifier columns to prevent data leakage
- One-hot encoded categorical variables
- Created a binary readmission label for modeling
Exploratory Data Analysis (EDA)
Key insights from the exploratory analysis:
- Average hospital stay is approximately 4 days
- Majority of patients are over age 50
- Emergency and urgent admissions dominate the dataset
- Medication counts and length of stay show modest correlation with readmission
- Insulin prescription patterns suggest potential under-adjustment of treatment plans
- Demographic imbalance may impact model fairness
Visualizations included:
- Correlation heatmaps
- Pair plots of clinical utilization variables
- Demographic distributions (age, race, gender)
- Readmission rates by admission type and insulin status
Modeling Approach
Model: Logistic Regression
Logistic regression was chosen for its interpretability and suitability as a baseline model in a high-dimensional healthcare dataset.
Three model configurations were evaluated:
- Standard Logistic Regression
- Class-Balanced Logistic Regression
- Balanced Logistic Regression with Custom Probability Thresholds
Results
| Model Variant | Recall (Readmitted) | Precision | Accuracy | ROC-AUC |
|---|---|---|---|---|
| Standard Logistic Regression | 0.49 | 0.62 | 0.63 | 0.67 |
| Balanced Logistic Regression | 0.58 | 0.60 | 0.63 | 0.67 |
| Balanced + Threshold (0.35) | 0.89 | 0.50 | 0.55 | 0.67 |
Key takeaway:
Adjusting class weights and decision thresholds significantly improved recall, capturing most potential readmissions. While precision decreased, this tradeoff is appropriate in a healthcare context where missing at-risk patients is more costly than false positives.
Technologies Used
- JupyterLab
- Matplotlib
- NumPy
- Pandas
- Python
- Scikit-learn
- Seaborn
Future Improvements
- Evaluate tree-based models (Random Forest, Gradient Boosting)
- Add model explainability (feature importance, SHAP)
- Address demographic imbalance with fairness-aware methods
- Incorporate temporal admission patterns
Why This Project Matters
This project demonstrates applied machine learning in healthcare, emphasizing ethical tradeoffs, evaluation beyond accuracy, and decision-making under uncertainty using real-world clinical data.