Fair Pricing Predictions in the Used Car Market
Overview
This project analyzes a large used vehicle sales dataset to build predictive models that estimate fair vehicle selling prices. The analysis includes extensive data cleaning, feature engineering, and model comparison using multiple regression techniques to ensure that price predictions are realistic and unbiased.
The primary objective was to understand which features influence vehicle prices and to develop models that provide accurate, equitable estimates while avoiding data leakage caused by pre-existing estimated price variables.
The project demonstrates an end-to-end data mining workflow, including exploratory data analysis, data preparation, feature encoding, and predictive modeling, aimed at empowering buyers and sellers with fair market insights.
Dataset
Source: Used car auction dataset
Observations: ~558,000 vehicle sales records Final cleaned dataset: 544,188 records
Features: 16 original variables including:
- Vehicle information (year, make, model, trim, body)
- Vehicle condition and mileage
- Location and seller information
- Interior and exterior color
- Transmission type
- Estimated market price (MMR)
- Final selling price
Key variables used for modeling:
- Vehicle year
- Vehicle condition
- Odometer reading
- Make, model, trim
- Body type
- Transmission
- State
- Interior and exterior color
- Engineered feature: vehicle age
Target variable:
- Selling Price – used as a proxy for fair market value
Exploratory Data Analysis
Several visualizations were created to better understand relationships between vehicle features and selling price. This analysis helps identify factors that contribute to fair pricing.
Selling Price by Model Year
Boxplots revealed that newer model years generally have higher selling prices, but luxury brands and high-performance trims introduce significant outliers. Understanding these patterns is critical for avoiding over- or underestimation.
Selling Price by Vehicle Condition
Average selling prices increase as condition ratings improve, although the dataset revealed multiple condition rating scales. Normalizing these ratings helps ensure consistent, fair price estimates.
Correlation Analysis
A correlation heatmap showed:
- Strong correlation between MMR and selling price
- Negative correlation between odometer and vehicle year
- Moderate correlation between vehicle year and price
These observations suggested that additional categorical variables (make, model, trim) could significantly influence price predictions while maintaining fairness.
Scatterplot Matrix
Pairwise feature comparisons helped visualize relationships between numerical variables and confirmed correlations observed in the heatmap.
Data Processing
Feature Removal
Certain variables were removed due to lack of predictive value:
- VIN – unique identifier
- Seller – business names without useful categorization
The MMR (Manheim Market Report) variable was initially retained for comparison but later removed to prevent data leakage and ensure models reflect fair pricing based on real vehicle characteristics.
Feature Engineering
Vehicle Age
A new feature was created to better capture depreciation effects:
Vehicle Age = Sale Year − Model Year
This provides a more meaningful representation of how vehicle age affects price, ensuring equitable estimation across vehicles of different ages.
Data Cleaning
Condition Normalization
The dataset contained multiple condition rating scales (1–5 and 1–50). Values were normalized to a standard 1–5 scale to support consistent and fair price predictions.
Categorical Standardization
Several categorical variables required cleaning:
- Make
- Model
- Trim
- Body
- Transmission
- State
- Color
- Interior
Common fixes included:
- Converting text to lowercase
- Removing whitespace inconsistencies
- Correcting spelling variations
- Consolidating proprietary body types into standardized categories
- Removing extremely rare categories to reduce noise
Handling Missing Data
Missing values ranged from 0.02% to 11.6% across columns.
A custom imputation function was developed that filled missing values using the most common value among similar vehicles, grouped by relevant features, to maintain fairness across similar vehicle types.
Example imputation strategies:
| Feature | Imputation Strategy |
|---|---|
| Transmission | Grouped by make, model, trim |
| Color | Grouped by make, model, trim |
| Interior | Grouped by make, model, trim |
| Condition | Grouped by make, model, odometer or vehicle age |
| Body | Grouped by make, model, trim |
After imputation and removal of minimal remaining null rows, the final dataset contained 544,188 complete records.
Feature Encoding
Because the dataset contains many categorical features, one-hot encoding was applied using pd.get_dummies().
Categorical variables encoded:
- Make
- Model
- Trim
- Body
- Transmission
- State
- Color
- Interior
This expanded the dataset to:
- 1,822 total features
- ~1GB in memory after encoding
Modeling
Three regression approaches were used to predict fair vehicle selling prices:
- Linear Regression
- Ridge Regression
- Lasso Regression
Data was split into 80% training and 20% testing sets.
Feature scaling was applied using StandardScaler to ensure stable coefficient estimates.
Model Comparison
Two modeling scenarios were evaluated:
Model Including MMR
The MMR column (market price estimate) was included as a feature.
Results:
- R² ≈ 0.97
- Very low prediction error
However, coefficient analysis revealed that the model relied heavily on MMR, indicating data leakage and reduced transparency in pricing fairness.
Model Without MMR
After removing the MMR column, the model relied on actual vehicle characteristics such as:
- Condition
- Vehicle age
- Odometer
- Make and model
- Geographic location
This produced more realistic performance and better reflected equitable price determination, free from bias introduced by pre-existing market estimates.
Key Insights
- Vehicle age and condition are strong predictors of resale value.
- Odometer readings correlate with depreciation but vary significantly by vehicle type.
- Vehicle make and model introduce large price variation due to brand and trim differences.
- Pre-existing price estimates such as MMR can artificially inflate model accuracy, highlighting the importance of fair and transparent modeling.
Technologies Used
- JupyterLab
- Matplotlib
- NumPy
- Pandas
- Python
- Scikit-learn
- Seaborn
Outputs
- Exploratory visualizations
- Cleaned and engineered dataset
- Regression model comparisons
- Model evaluation metrics (MAE, MSE, RMSE, R²)
Why This Project Matters
This project demonstrates the complete data mining pipeline required for real-world predictive modeling, including large-scale data cleaning, feature engineering, categorical encoding, and regression analysis.