Modeling Steam Game Prices for Indie Developers
Overview
This project focuses on estimating baseline prices for Steam games, helping indie and small development teams make informed pricing decisions. Without access to market research or publisher support, developers may struggle to price games fairly—too high can discourage buyers, while too low can undervalue their work.
The goal of this project was to build predictive regression models that use game characteristics (genres, categories, release timing) and, optionally, publisher context to provide baseline pricing guidance for developers.
Dataset
- Source: Kaggle – Steam Games Dataset (2021–2025)
- Size: ~65,000 games
-
Features:
appid→ Unique game IDname→ Game titlerelease_date→ Release dateprice→ Listed price in USDgenres→ Game genres (multi-label)categories→ Game features (e.g., single-player, controller support)developer→ Game developerpublisher→ Game publisherrecommendations→ Number of user recommendations
Target Variable
price→ Continuous numeric variable representing the game’s price
Data Cleaning & Preparation
- Removed outliers (games priced above $70)
- Filled missing genre, category, and publisher information
- Converted multi-label
genresandcategoriesto binary features using one-hot encoding - Created numeric features: number of genres, number of categories, release month, and season
- Grouped rare publishers under “Other” to reduce sparsity
- Used sparse matrices to efficiently handle high-dimensional categorical features
Exploratory Data Analysis (EDA)
Key insights:
- Certain genres (e.g., Action, RPG) tend to have higher prices
- Free-to-play titles are present but relatively rare
- User recommendations loosely correlate with price
- Release month and season show minor effects on pricing trends
Visualizations included:
- Price distributions by genre and category
- Scatterplots of price vs. recommendations
- Heatmaps of genre and category co-occurrence
- Boxplots comparing pricing by publisher
Modeling Approach
Models: Ridge Regression, Random Forest, Gradient Boosting, LightGBM
Two scenarios were evaluated:
- No-Publisher Model – simulates self-published or first-time developers; uses only game attributes (genres, categories, release features).
- Publisher-Aware Model – simulates games released through established publishers; incorporates publisher, release year, and recommendation counts in addition to game attributes.
Evaluation Metrics:
- Mean Absolute Error (MAE)
- Root Mean Squared Error (RMSE)
- R² Score
- Mean Absolute Percentage Error (MAPE)
Results
No-Publisher Model
| Model | MAE ($) | RMSE ($) | R² | MAPE (%) |
|---|---|---|---|---|
| Ridge Regression | 4.37 | 6.86 | 0.254 | 152.3 |
| Random Forest | 3.83 | 6.40 | 0.350 | 108.7 |
| Gradient Boosting | 4.03 | 6.54 | 0.323 | 124.7 |
| LightGBM | 3.88 | 6.36 | 0.359 | 114.3 |
Publisher-Aware Model
| Model | MAE ($) | RMSE ($) | R² | MAPE (%) |
|---|---|---|---|---|
| Ridge Regression | 4.49 | 7.01 | 0.222 | 159.9 |
| Random Forest | 3.64 | 6.02 | 0.425 | 110.2 |
| Gradient Boosting | 3.75 | 6.05 | 0.421 | 120.9 |
| LightGBM | 3.66 | 6.07 | 0.415 | 111.3 |
Key takeaway:
- Tree-based models (Random Forest, LightGBM) outperform linear regression in both scenarios.
- Publisher and recommendation data significantly improve predictive performance.
- No-Publisher models provide a conservative, realistic baseline for indie developers with limited information.
Technologies Used
- JupyterLab
- LightGBM
- Matplotlib
- NumPy
- Pandas
- Python
- Scikit-learn
- Seaborn
Future Improvements
- Incorporate time-series pricing and sales data
- Develop genre-specific pricing models
- Recommend price ranges instead of single price points
- Evaluate model explainability (feature importance, SHAP)
Why This Project Matters
This project demonstrates applied machine learning for business decision-making in the gaming industry. It highlights how structured historical data can inform fair pricing, particularly for independent developers, while emphasizing ethical use of predictive models to avoid pressuring developers toward market averages.