Understanding how vehicle characteristics affect fuel efficiency is a classic regression problem β and an excellent way to explore tree-based models like Decision Trees, Random Forests, and XGBoost. In this project, I analyzed a dataset of cars and built models to predict fuel efficiency (MPG) with different configurations.
π§© Step 1 β Data Preparation
The dataset contained various vehicle features, including:
vehicle_weightengine_displacementhorsepoweraccelerationmodel_yearoriginfuel_type
To ensure data consistency, all missing values were filled with zeros.
Then I performed a train/validation/test split (60%/20%/20%), using a random_state=1 for reproducibility.
Next, I used DictVectorizer(sparse=True) to convert categorical and numerical features into a format suitable for scikit-learn models.
π³ Step 2 β Decision Tree Regressor
I began with a Decision Tree Regressor with max_depth=1.
This simple tree helps visualize which feature the model uses first to split the data β effectively revealing the most influential variable in predicting MPG.
Result:
The feature used for splitting was model_year, showing that newer vehicles tend to have different fuel efficiencies compared to older models.
π² Step 3 β Random Forest Model
Next, I trained a Random Forest Regressor with the parameters:
n_estimators=10
random_state=1
n_jobs=-1
Random forests aggregate multiple decision trees to reduce overfitting and improve accuracy.
Validation RMSE: β 4.5
This confirmed the model could capture relationships between engine specs and fuel efficiency quite effectively.
βοΈ Step 4 β Tuning n_estimators
To see how the number of trees affects performance, I trained models with n_estimators ranging from 10 to 200 (step = 10).
After monitoring RMSE, I observed the improvement plateaued after around 80 estimators, indicating that adding more trees didnβt significantly enhance accuracy.
πΎ Step 5 β Tuning max_depth
I then compared four values of max_depth β [10, 15, 20, 25] β each with increasing n_estimators from 10 to 200.
The best mean RMSE occurred at max_depth = 20, which struck the right balance between bias and variance.
π Step 6 β Feature Importance
Random Forests provide an excellent built-in mechanism for feature importance.
Training the model with:
n_estimators=10, max_depth=20, random_state=1
I found the most influential feature for predicting fuel efficiency to be engine_displacement, followed by vehicle_weight and horsepower.
This aligns well with domain knowledge β larger engines and heavier vehicles typically consume more fuel.
β‘ Step 7 β XGBoost Experiments
Finally, I trained an XGBoost regressor, tuning the eta (learning rate) parameter between 0.3 and 0.1.
xgb_params = {
'eta': [0.3 or 0.1],
'max_depth': 6,
'objective': 'reg:squarederror',
'nthread': 8,
'seed': 1
}
After 100 training rounds, the model with eta = 0.1 delivered slightly better RMSE on the validation set β confirming that a smaller learning rate can yield smoother, more generalized models.
π― Key Takeaways
-
model_yearstrongly influences fuel efficiency in modern cars. -
Random Forests with
n_estimators β 80andmax_depth=20gave the most balanced performance. - Engine displacement emerged as the most important predictor of MPG.
- XGBoost with a lower learning rate (eta=0.1) achieved the best validation score.
π‘ Final Thoughts
This project demonstrates how iterative experimentation with tree-based models reveals both predictive strength and interpretability.
From simple decision trees to tuned XGBoost models, each step provided insight into how vehicle characteristics drive fuel efficiency β and how model parameters affect performance.
If youβre learning machine learning, projects like this are perfect for mastering feature engineering, evaluation metrics, and model tuning.
Top comments (0)