Skip to the content.

Fair Pricing Predictions in the Used Car Market

Overview

This project analyzes a large used vehicle sales dataset to build predictive models that estimate fair vehicle selling prices. The analysis includes extensive data cleaning, feature engineering, and model comparison using multiple regression techniques to ensure that price predictions are realistic and unbiased.

The primary objective was to understand which features influence vehicle prices and to develop models that provide accurate, equitable estimates while avoiding data leakage caused by pre-existing estimated price variables.

The project demonstrates an end-to-end data mining workflow, including exploratory data analysis, data preparation, feature encoding, and predictive modeling, aimed at empowering buyers and sellers with fair market insights.


Dataset

Source: Used car auction dataset

Observations: ~558,000 vehicle sales records Final cleaned dataset: 544,188 records

Features: 16 original variables including:

Key variables used for modeling:

Target variable:


Exploratory Data Analysis

Several visualizations were created to better understand relationships between vehicle features and selling price. This analysis helps identify factors that contribute to fair pricing.

Selling Price by Model Year

Boxplots revealed that newer model years generally have higher selling prices, but luxury brands and high-performance trims introduce significant outliers. Understanding these patterns is critical for avoiding over- or underestimation.

Selling Price by Vehicle Condition

Average selling prices increase as condition ratings improve, although the dataset revealed multiple condition rating scales. Normalizing these ratings helps ensure consistent, fair price estimates.

Correlation Analysis

A correlation heatmap showed:

These observations suggested that additional categorical variables (make, model, trim) could significantly influence price predictions while maintaining fairness.

Scatterplot Matrix

Pairwise feature comparisons helped visualize relationships between numerical variables and confirmed correlations observed in the heatmap.


Data Processing

Feature Removal

Certain variables were removed due to lack of predictive value:

The MMR (Manheim Market Report) variable was initially retained for comparison but later removed to prevent data leakage and ensure models reflect fair pricing based on real vehicle characteristics.


Feature Engineering

Vehicle Age

A new feature was created to better capture depreciation effects:

Vehicle Age = Sale Year − Model Year

This provides a more meaningful representation of how vehicle age affects price, ensuring equitable estimation across vehicles of different ages.


Data Cleaning

Condition Normalization

The dataset contained multiple condition rating scales (1–5 and 1–50). Values were normalized to a standard 1–5 scale to support consistent and fair price predictions.

Categorical Standardization

Several categorical variables required cleaning:

Common fixes included:


Handling Missing Data

Missing values ranged from 0.02% to 11.6% across columns.

A custom imputation function was developed that filled missing values using the most common value among similar vehicles, grouped by relevant features, to maintain fairness across similar vehicle types.

Example imputation strategies:

Feature Imputation Strategy
Transmission Grouped by make, model, trim
Color Grouped by make, model, trim
Interior Grouped by make, model, trim
Condition Grouped by make, model, odometer or vehicle age
Body Grouped by make, model, trim

After imputation and removal of minimal remaining null rows, the final dataset contained 544,188 complete records.


Feature Encoding

Because the dataset contains many categorical features, one-hot encoding was applied using pd.get_dummies().

Categorical variables encoded:

This expanded the dataset to:


Modeling

Three regression approaches were used to predict fair vehicle selling prices:

Data was split into 80% training and 20% testing sets.

Feature scaling was applied using StandardScaler to ensure stable coefficient estimates.


Model Comparison

Two modeling scenarios were evaluated:

Model Including MMR

The MMR column (market price estimate) was included as a feature.

Results:

However, coefficient analysis revealed that the model relied heavily on MMR, indicating data leakage and reduced transparency in pricing fairness.


Model Without MMR

After removing the MMR column, the model relied on actual vehicle characteristics such as:

This produced more realistic performance and better reflected equitable price determination, free from bias introduced by pre-existing market estimates.


Key Insights


Technologies Used


Outputs


Why This Project Matters

This project demonstrates the complete data mining pipeline required for real-world predictive modeling, including large-scale data cleaning, feature engineering, categorical encoding, and regression analysis.