Predicting Obesity Levels in Latin American Cities

Overview

This project focuses on predicting obesity levels using demographic, dietary, and lifestyle factors collected through survey data. Obesity is associated with numerous chronic health conditions, making early identification of at-risk populations important for preventative health initiatives.

The objective of this project was to develop classification models capable of estimating obesity levels based on behavioral patterns and demographic characteristics. Two modeling scenarios were evaluated: one using both behavioral and demographic features, and another using behavioral factors alone. Comparing these scenarios helps assess how much predictive power demographic information contributes to obesity classification.

These models could be used to support public health monitoring, early risk screening, and targeted health interventions. For example, health organizations or local governments could use similar models to identify behavioral risk patterns within populations, design preventative education programs, or allocate resources toward communities with higher predicted obesity risk.

Dataset

Source: UCI Machine Learning Repository
Dataset: Estimation of Obesity Levels Based on Eating Habits and Physical Condition
Observations: 2,111 survey responses
Population: Individuals from Colombia, Peru, and Mexico

Features include:

Demographics: gender, age, height, weight
Dietary habits: vegetable consumption, high-calorie food intake, meal frequency, snacking habits
Lifestyle behaviors: physical activity, water intake, technology usage
Health behaviors: calorie monitoring, alcohol consumption, smoking
Transportation methods
Family history of overweight

Target Variable

ObesityLevel → categorical obesity classification including:
- Underweight
- NormalWeight
- OverweightLevel1
- OverweightLevel2
- ObesityType1
- ObesityType2
- ObesityType3

Data Preparation

Several preprocessing steps were applied to prepare the dataset for modeling:

Renamed variables for improved readability and consistency
Rounded numeric survey values to reduce synthetic noise from class-balancing samples
Converted numerically encoded survey responses back into categorical labels
Standardized categorical responses across multiple lifestyle variables
Removed height and weight from model inputs to prevent trivial prediction of obesity levels

Two feature sets were created:

Behavioral + Demographics

Includes age, gender, and behavioral lifestyle features

Behavioral-Only

Excludes age and gender to evaluate prediction performance using only lifestyle factors

Exploratory Data Analysis (EDA)

Exploratory analysis was conducted to better understand the distribution of obesity levels and behavioral patterns in the dataset.

Key visualizations included:

Distribution of obesity levels across the dataset
Age distributions by obesity category
Relationships between obesity level and family history
Obesity levels by high-calorie food consumption
Lifestyle comparisons including:
- meal frequency
- water intake
- physical activity
- technology use
Scatter plots of height versus weight by obesity category
Age distribution histograms for each obesity class

These visualizations help highlight patterns between lifestyle behaviors and obesity classifications.

Modeling Approach

This project uses supervised classification models to predict obesity levels.

Three classifiers were implemented:

Logistic Regression – baseline linear classifier
Random Forest – ensemble tree-based model capable of capturing nonlinear patterns
Gradient Boosting – sequential tree-based model that improves predictive performance through boosting

A preprocessing pipeline was constructed using:

StandardScaler for numeric features
OneHotEncoder for categorical variables
ColumnTransformer to combine transformations into a single pipeline

Models were trained and evaluated using an 80/20 train-test split with stratified sampling to maintain class balance.

Results

Behavioral + Demographics Models

Model	Accuracy
Logistic Regression	0.6170
Random Forest	0.8274
Gradient Boosting	0.7612

Behavioral-Only Models

Model	Accuracy
Logistic Regression	0.5650
Random Forest	0.7139
Gradient Boosting	0.6643

Key Observations

Tree-based models significantly outperformed logistic regression.
The Random Forest classifier achieved the highest accuracy (82.7%) when demographic information was included.
Removing demographic variables reduced predictive performance but still produced reasonable results using behavioral factors alone.
Misclassifications were most common between adjacent obesity categories, such as:
- NormalWeight vs ObesityType1
- OverweightLevel1 vs OverweightLevel2

Model Insights

Feature importance analysis from the Random Forest models revealed several influential predictors.

Behavioral-Only Model

Top predictors included:

Vegetable consumption frequency
Physical activity levels
Technology use duration
Alcohol consumption
Transportation method
Meal frequency
Water intake

Behavioral + Demographics Model

Age emerged as the most influential feature
Gender also contributed predictive value
Behavioral features remained important but were more evenly distributed in importance

Visualizations

Additional visualizations were created to support model interpretation:

Model performance comparison bar charts
Feature importance rankings for both modeling scenarios
Confusion matrix heatmaps for each classifier
Class-level accuracy comparisons between feature sets

These visualizations provide insights into classification performance and highlight where prediction improvements occur when demographic information is included.

Technologies Used

JupyterLab
Matplotlib
NumPy
Pandas
Python
Scikit-learn
Seaborn

Why This Project Matters

Obesity continues to be a growing public health concern in many parts of the world, including Latin America. Predictive models like those developed in this project can help researchers and policymakers better understand how lifestyle behaviors influence obesity risk within populations.

In practical settings, similar models could be used to:

Support population-level health monitoring
Identify behavioral risk patterns associated with obesity
Guide preventative health campaigns and education programs
Assist researchers studying lifestyle and demographic drivers of obesity

While behavioral data alone provides a reasonable baseline for obesity prediction, incorporating demographic information significantly improves classification accuracy. These findings highlight how combining lifestyle patterns with demographic context can enhance predictive health analytics while still allowing behavioral-only models to support broader community-level assessments.