Utilizing XGBoost training reports to improve your models

In 2019, AWS unveiled Amazon SageMaker Debugger, a SageMaker capability that enables you to automatically detect a variety of issues that may arise while a model is being trained. SageMaker Debugger captures model state data at specified intervals during a training job. With this data, SageMaker Debugger can detect training issues or anomalies by leveraging built-in or user-defined rules. In addition to detecting issues during the training job, you can analyze the captured state data afterwards to evaluate model performance and identify areas for improvement. This task is made easier with the newly launched XGBoost training report feature. With a minimal amount of code changes, SageMaker Debugger generates a comprehensive report outlining key information that you can use to evaluate and improve the model.

This post shows you an end-to-end example of training an XGBoost model on Sagemaker and how to enable the automatic XGBoost report functionality in Sagemaker Debugger to quickly and easily evaluate model performance and identify areas of improvement for your model. Even if you don’t have a lot of data science experience, you can still gauge how well the model performs and identify areas of improvement based on information provided by the report. The code from this post is available in the GitHub repo.

Dataset

For this example, we use the dataset from the Kaggle ATLAS Higgs Boson Machine Learning Challenge 2014. With this dataset, we train a machine learning (ML) model to automatically classify Higgs Boson events from others (such as background noise) generated from simulated proton-proton collisions in CERN’s Large Hadron Collider. The data can be obtained directly from CERN. Let’s go through the steps of obtaining the data and configuring the training job. You can follow along with a Jupyter notebook.

We start with the relevant imports:

import requests
from io import BytesIO
import pandas as pd
import boto3
import s3fs
from datetime import datetime
import time
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
from sagemaker.debugger import Rule, rule_configs

from IPython.display import FileLink, FileLinks

Then we set up variables that we later need to configure the SageMaker training job:

# setup sagemaker variables
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
key_prefix = "higgs-boson"
region = sess._region_name
s3 = s3fs.S3FileSystem(anon=False)
xgboost_container = image_uris.retrieve("xgboost", region, "1.2-1")

We obtain data and prepare it for training:

# obtain data from CERN and load it into a DataFrame
data_url = "http://opendata.cern.ch/record/328/files/atlas-higgs-challenge-2014-v2.csv.gz"
gz_file = BytesIO(requests.get(data_url).content)
gz_file.flush()
df = pd.read_csv(gz_file, compression="gzip")

# identify feature, label, and unused columns
non_feature_cols = ["EventId", "Weight", "KaggleSet", "KaggleWeight", "Label"]
feature_cols = [col for col in df.columns if col not in non_feature_cols]
label_col = "Label"
df["Label"] = df["Label"].apply(lambda x: 1 if x=="s" else 0)

# take subsets of data per the original Kaggle competition
train_data = df.loc[df["KaggleSet"] == "t", [label_col, *feature_cols]]
test_data = df.loc[df["KaggleSet"] == "b", [label_col, *feature_cols]]

# upload data to S3
for name, dataset in zip(["train", "test"], [train_data, test_data]):
    sess.upload_string_as_file_body(body=dataset.to_csv(index=False, header=False),
                                   bucket=bucket,
                                   key=f"{key_prefix}/input/{name}.csv"
                                   )
                                   
# configure data inputs for SageMaker training
train_input = TrainingInput(f"s3://{bucket}/{key_prefix}/input/train.csv", content_type="text/csv")
validation_input = TrainingInput(f"s3://{bucket}/{key_prefix}/input/test.csv", content_type="text/csv")

Setting up a training job with XGBoost training report

We only need to make one code change to the typical process for launching a training job: adding the create_xgboost_report rule to the Estimator. SageMaker takes care of the rest. A companion SageMaker processing job spins up to analyze the XGBoost model and produce the report. This analysis is done at no additional cost. See the following additional code:

# add a rule to generate the XGBoost Report
rules=[
    Rule.sagemaker(rule_configs.create_xgboost_report())
]

hyperparameters={
    "max_depth": "6",
    "eta": "0.1",
    "objective": "binary:logistic",
    "num_round": "100",
    
}

estimator=Estimator(
    role=role,
    image_uri=xgboost_container,
    base_job_name="higgs-boson-model",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    hyperparameters=hyperparameters,
    rules=rules, 
)

training_job_time = datetime.now()
estimator.fit({'train': train_input, 'validation': validation_input}, 
              wait=True)

Analyzing models with the XGBoost training report

When the training job is complete, SageMaker automatically starts the processing job to generate the XGBoost report. We write a few lines of code to check the status of the processing job. When it’s complete, we download it to our local drive for further review. The following code downloads the report upon its completion, and provides a hyperlink directly within the notebook for easy viewing:

import os
#get name of profiler report
profiler_report_name = [rule["RuleConfigurationName"] 
                        for rule in estimator.latest_training_job.rule_job_summary() 
                        if "Profiler" in rule["RuleConfigurationName"]][0]

xgb_profile_job_name = [rule["RuleEvaluationJobArn"].split("/")[-1] 
                        for rule in estimator.latest_training_job.rule_job_summary() 
                        if "CreateXgboostReport" in rule["RuleConfigurationName"]][0]

base_output_path = os.path.dirname(estimator.latest_job_debugger_artifacts_path())
rule_output_path = os.path.join(base_output_path, "rule-output/")
xgb_report_path = os.path.join(rule_output_path, "CreateXgboostReport")
profile_report_path = os.path.join(rule_output_path, profiler_report_name)

while True:
    
    xgb_job_info = sess.sagemaker_client.describe_processing_job(ProcessingJobName=xgb_profile_job_name)

    if xgb_job_info["ProcessingJobStatus"] == "Completed":
        break
    else:
        print(f"Job Status: {xgb_job_info['ProcessingJobStatus']}")
        time.sleep(30)

s3.download(xgb_report_path, "reports/xgb/", recursive=True)
s3.download(profile_report_path, "reports/profiler/", recursive=True)
display("Click link below to view the profiler report", FileLink("reports/profiler/profiler-output/profiler-report.html"))
display("Click link below to view the XGBoost Training report", FileLink("reports/xgb/xgboost_report.html"))

Before we dive into the training report, let’s take a quick look at the SageMaker Debugger report, which by default is generated after every training job. This report provides key metrics around resource utilization such as network, I/O, and CPU. In the following example, we can see the median CPU utilization was at around 55% while memory utilization was consistently under 5%. This tells us that we can reduce costs by utilizing a smaller training instance.

This report provides key metrics around resource utilization such as network, I/O, and CPU.

Now let’s dive into the training report. SageMaker Debugger automatically generates the following key insights on our model:

Distribution of labels – Detects imbalanced datasets
Loss graph – Detects over-fitting or over training
Feature importance metrics – Identifies redundant or uninformative features
Confusion matrix and evaluation metrics – Evaluates performance at the individual class level and identifies concentrations of errors
Accuracy rate per iteration – Shows how accuracy improved for each class over each round of boosting
Receiver operating characteristic curve – Shows how the model performs under different probability thresholds
Distribution of residuals – Helps determine if residuals are a result of random error or missing information

We pick a few items from the report for demonstration purposes.

Distribution of true labels of the dataset

This visualization shows the distribution of labeled classes (for classification) or values (for regression) in your original dataset. An imbalanced dataset could result in poor predictive performance unless properly handled. In this particular example, there’s a slight imbalance between the negative and positive label.

Loss vs. step graph

This visualization compares the loss from the training dataset against the validation dataset. For this particular model, it looks like this model is over-fitting on the training set because the validation error remains relatively flat after about 30 boosting rounds, even though the error on the training loss continues to improve.

This visualization compares the loss from the training dataset against the validation dataset

Feature importance

This visualization shows you feature importance by weight, gain, and coverage. Gain, which measures the relative contribution of each feature, is typically the most relevant one for most use cases. For this particular model, we see that a handful of features provide the bulk of the contribution, while a large number contribute little to no gain to the model’s predictive performance. It’s usually a good practice to drop uninformative features from the model because they add noise and may result in over-fitting.

Confusion matrix and ROC curve

There are a number of additional visualizations that show you the common things data scientists often look at, such as the confusion matrix, ROC curve, and F1 score. For more information, see Debugger XGBoost Training Report Walkthrough.

From the following confusion matrix, we can see that the model does a better job at predicting for class 0 than class 1. And this can be explained by the imbalanced label distribution we showed at the beginning (there are more instances for class 0 than class 1). One ramification is making the label distribution more balanced via data resampling techniques.

From the following confusion matrix, we can see that the model does a better job at predicting for class 0 than class 1.

SageMaker Debugger automatically generates and reports the performance metrics such as F1 score and accuracy. You can also see a classification report, such as the following.

You can also see a classification report, such as the following.

Fine-tuning performance

From the training report’s outputs, we can see several areas where the model can be fine-tuned to improve performance, notably the following:

The loss vs. step graph indicates that the validation error stopped improving after about 30 rounds, so we can reduce the number of boosting rounds or enable early stopping to mitigate over-training.
The feature importance graph shows a large number of uninformative features that could potentially be removed to reduce over-fitting and improve predictive performance on unseen datasets.
Based on the confusion matrix and the classification report, the recall score is somewhat low, meaning we’ve misclassified a large number of signal events. Tuning the scale_pos_weight parameter to adjust for the imbalance in the dataset could help improve this.

Conclusion

In this post, we generated an XGBoost training report and profiler report using SageMaker Debugger. With these, we got reports for both the model performance and the resource utilization during training automatically. We then walked through the XGBoost training report and identified a number of issues that we can alleviate with some hyperparameter tuning.

For more about SageMaker Debugger, see SageMaker Debugger XGBoost Training Report and SageMaker Debugger Profiling Report.

About the Authors

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Lu Huang is a Senior Product Manager on the AWS Deep Engine team, managing Sagemaker Debugger.

Satadal Bhattacharjee is Principal Product Manager at AWS AI. He leads the machine learning engine PM team on projects such as SageMaker and optimizes machine learning frameworks such as TensorFlow, PyTorch, and MXNet.

Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Nobel Prize he promised. Currently he helps customers in the financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.

Nihal Harish is an engineer at AWS AI. He loves working at the intersection of distributed systems and machine learning. Outside of work, he enjoys long distance running and playing tennis.