The solution presented in this post aims to enable a dynamic EC2 cores allocation limit for each business unit (BU), automatically adapted according to a past time frame (e.g. one week) spending, that can be adapted anytime according to workload spikes or priorities.
EC2 cores allocation limit is enforced in the HPC workload scheduler itself, with the use of its dynamic resources. HPC jobs breaking the limit will be kept in queue/pending status until more cores are released by completed, older jobs. With this approach, a related job pending reason is also provided to both business or job owners, making them aware of their spending implications.
Focus is on real EC2 cores instead of vCPUs, because most HPC applications run better with hyperthreading disabled. HPC users are more familiar with the core/CPU concept too.
The underlying core logic is simple**: adapt the scheduler EC2 cores limit for each business unit on a regular basis (e.g. weekly), according to its assigned budget**. To calculate past costs, we’ll make use of AWS Cost Explorer.
In this document, we’ll refer to a Linux HPC environment with weekly budgets set by Administrators. This time frame, however, can be easily modified since it’s just up to e.g. a system cron setting.
The overall logic is the following:
- A configuration file, located in the HPC headnode, stores the following list of entries defined by Administrators/budget owners:
<BU1>,<BU1 weekly budget> <BU2>,<BU2 weekly budget>
WhereBU1
,BU2
are business units identifiers or labels. They are next to their respective weekly budget, comma separated. This file can be modified anytime by Administrators/budget owners in case of justified workload spikes or business decisions - A script executed on a weekly basis (e.g. Sunday at midnight) on the HPC headnode, reads the configuration file, queries AWS Cost Explorer and appends a new row to a CSV file (e.g. budget.control.csv) with these contents:
date, BU1, BU1 cost, BU1 budget, BU1 core limit, BU2, BU2 cost, BU2 budget, BU2 core limit,...
where:date
reports the day/timestamp this row has been added to the CSV fileBU1,BU2,BUn
are business unit identifiers, read from the configuration file defined earlier in step 1cost
is previous week EC2 cost summary for its specific BU, retrieved from AWS Cost Explorer by this scriptbudget
is the weekly budget set for that specific BU read from the configuration file defined earliercores limit
is calculated by this script according to c.cost
and d.budget
. That calculation is described in the next step
- Core limit calculation follows this formula:
if Cost > Budget: core limit = default core limit / (Cost / Budget) if Cost <= Budget: core limit = default core limit
wheredefault cores limit
is calculated from the 1 core weekly cost of a reference EC2 instance type, e.g. 23.6880 weekly cost for the 1 core of an r5.large in our AWS region. Sodefault core limit = budget / 1 core weekly cost
- For each business unit, a scheduler dynamic resource keeps track of the available EC2 cores amount, calculated as:
current BU core limit - currently allocated cores
- Each submitted job must include a cores resource requirement according to the user request and the BU he belongs to
In this way, a BUx job that will require more cores than the available ones, will stay in pending status until more cores are freed by completed, older BUx jobs.
Example
Configuration:
BU1 1000
BU2 2000
Script will calculate default core limits of: 42 and 84 respectively, according to an EC2 r5.large weekly core cost of 23.688.
Projecting the usage through 4 weeks, with random example usage, we can have this CSV entries:
week / date | BU | Cost | Budget | Cores limit | BU | Cost | Budget | Cores limit |
---|---|---|---|---|---|---|---|---|
10/10/2022 | BU1 | 500 | 1000 | 42 | BU2 | 400 | 2000 | 84 |
--- | --- | --- | --- | --- | --- | --- | --- | --- |
10/17/2022 | BU1 | 700 | 1000 | 42 | BU2 | 500 | 2000 | 84 |
10/21/2022 | BU1 | 1500 | 1000 | 28 | BU2 | 2200 | 2000 | 76 |
10/28/2022 | BU1 | 600 | 1200 | 42 | BU2 | 2100 | 2000 | 80 |
In this example:
- in week 3 entry saw a previous week cost summary that exceeded BU1 budget by 1.5 (50%). That ratio was used to update the cores limit accordingly: 42 / 1.5 = 28.
- in week 3 entry saw a previous week cost summary that exceeded BU2 budget by 1.1 (10%). That ratio was used to update the cores limit accordingly: 84 / 1.1 = 76.
- in week 4 entry saw a previous week cost summary that exceeded BU1 budget by 1.05 (5%). That ratio was used to update the cores limit accordingly: 84 / 1.05 = 80.
- All the other limits are referring to the default cores limit.
- HPC cluster headnode needs the following IAM permissions to query AWS Cost Explorer for costs and EC2 for current cores usage:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "CostExplorerQuery",
"Effect": "Allow",
"Resource": "*",
"Action": [
"ce:GetReservationUtilization",
"ce:GetDimensionValues",
"ce:GetCostAndUsage",
"ce:GetReservationCoverage",
"ce:GetReservationPurchaseRecommendation",
"ce:GetCostForecast",
"ce:GetTags"
]
},
{
"Sid": "DescribeEC2",
"Effect": "Allow",
"Resource": "*",
"Action": [
"ec2:DescribeInstances"
]
}
]
}
- The sum of all BU limits, calculated as vCPU, must not exceed AWS Account EC2 vCPU limit
- A (compute) node that belongs to a certain BU, must have a related tag, with a specific key e.g.
BusinessUnit
and the BU identifier/name as value. E.g.BusinessUnit=BU1
- AWS CLI and
jq
commands must be installed on the headnode to correctly run budget control scripts.jq
command is used to correctly and safely parse json data. AWS CLI is used to query Cost Explorer (both AWS CLI version 1 and 2 have been tested successfully):- https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
yum -y install jq
(available in EPEL repository), or more info on http://stedolan.github.io/jq/
Budget control code is made up of the following files. All of them must be present and accessible on the scheduler headnode:
- budget.control.conf
This configuration file saves the current budget assigned to each BU. File format is:
<BU>,<Budget>
e.g.BU1,1000 BU2,2000 BU3,3000
- budget.control.updater.sh This file must be run on a regular basis on the headnode by e.g. cron daemon. Every time it runs it appends a new line in budget.control.csv file, and this line will be used by the scheduler dynamic resource described below to provide the available cores for each BU job. Log file: /var/log/budget.control.updater.log
budget.control.csv file will be created and updated in the same folder, and it will contain the data about each BU last week costs, budget and calculated cores limit.
Format:
date, BU1, BU1 cost, BU1 budget, BU1 core limit, BU2, BU2 cost, BU2 budget, BU2 core limit,..
E.g.
2022-11-24,BU1,500,1000,42,BU2,1200,2000,84,BU3,770,3000,126
Note: Administrators can change assigned budgets in budget.control.conf anytime. To force the application of new budgets, administrators can manually run budget.control.updater.sh: the last line appended to the file will be enforced immediately.
To enable dynamic budget control, proceed in this way:
- Enable one selected Business Unit Tag in AWS Billing → Cost allocation tags. (e.g.
BusinessUnit
tag) By performing this operation, AWS Cost Explorer starts collecting data about BU usage. Collected data is available from next day onward. - Make sure headnode has the required IAM policies described above in its role/profile to be able to query Cost Explorer.
- In the headnode, place Budget Control solution scripts and its configuration file in a location reachable by the HPC scheduler server process (runs as root). They all must be placed in the same location. Please check the execution bits for .sh scripts. e.g. */opt/slurm/budget.control budget.control.conf budget.control.updater.sh *
- **Adapt/update script parameters in budget.control.updater.sh:
**
- Specify the selected Business Unit tag key as value for
BU_INSTANCE_TAG
e.g.BU_INSTANCE_TAG="BusinessUnit"
- Retrieve or update the approximated single core weekly cost from AWS https://aws.amazon.com/ec2/pricing or utility sites like https://instances.vantage.sh. You can select the reference instance type (e.g.
r5
) and check what’s the weekly price of 2 vCPUs. As a quick pickup, you can take the smaller instance size (r5.large
). Set this value asSINGLE_CORE_WEEKLY_COST
e.g.SINGLE_CORE_WEEKLY_COST=23.6880 # r5.large weekly on-demand cost
- Specify the selected Business Unit tag key as value for
- Fill budges for each BU in budget.control.conf, using format
<BU>,<Budget>
e.g.BU1,1000 BU2,2000 BU3,3000
- Add a cron task to execute budget.control.updater.slurm.sh every week at e.g. Sunday 00:00:
cat > /tmp/budget.control.cron << EOF 00 00 * * 0 /opt/slurm/budget.control/budget.control.updater.sh EOF crontab /tmp/budget.control.cron
The simplest integration method is using SLURM® Licenses management, Local Licenses, as described here: https://slurm.schedmd.com/licenses.html
For SLURM®, there’s a slightly different version of budget.control.updater.slurm.sh that updates Licenses setting in slurm.conf, and reloads slurmctld
daemon.
So the only step for setting up budget control in SLURM® is to set weekly cron to run budget.control.updater.slurm.sh instead of the standard one budget.control.updater.sh
Last step to enable budget control, is for each HPC job to request the cores it needs to their owner BU pool.
You can check current BU cores limit with SLURM® command:
scontrol show lic
Users must submit their jobs with sbatch
command option:
-L <BU identifier>:<number of cores>
Full example: given SLURM® Local License configuration is:
$ scontrol show lic
LicenseName=BU1
Total=42 Used=0 Free=42 Reserved=0 Remote=no
LicenseName=BU2
Total=84 Used=0 Free=84 Reserved=0 Remote=no
LicenseName=BU3
Total=126 Used=0 Free=126 Reserved=0 Remote=no
and an user belonging to BU2 wants to submit a job requesting 2 cores on 2 nodes, submission must use this syntax:
$ sbatch --ntasks 4 --ntasks-per-node 2 -L BU2:4 jobscript.sh
Submitted batch job 2
$ scontrol show lic
LicenseName=BU1
Total=42 Used=0 Free=42 Reserved=0 Remote=no
LicenseName=BU2
Total=84 Used=4 Free=80 Reserved=0 Remote=no
LicenseName=BU3
Total=126 Used=0 Free=126 Reserved=0 Remote=no
$ scontrol show job 2
JobId=2 JobName=jobscript.sh
UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
Priority=4294901758 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:21 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2022-11-28T13:41:39 EligibleTime=2022-11-28T13:41:39
AccrueTime=2022-11-28T13:41:39
StartTime=2022-11-28T13:44:33 EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-11-28T13:41:40 Scheduler=Main
Partition=parallel AllocNode:Sid=ip-10-0-0-226:19010
ReqNodeList=(null) ExcNodeList=(null)
NodeList=parallel-dy-m5large-[1-2]
BatchHost=parallel-dy-m5large-1
NumNodes=2 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0::
TRES=cpu=4,node=2,billing=4
Socks/Node= NtasksPerN:B:S:C=2:0:: CoreSpec=
MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=BU2:4 Network=(null)
Command=/home/ec2-user/jobscript.sh
WorkDir=/home/ec2-user
StdErr=/home/ec2-user/slurm-2.out
StdIn=/dev/null
StdOut=/home/ec2-user/slurm-2.out
Power=
In this case SLURM® found that there are enough BU2 cores (Licenses) to run the job, and started it.
We described how to introduce a fine grained, dynamic and bland way to keep business units budgets under control. The major advantages of this approach are:
- Allow Administrators to manually update budgets and even cores limit in case of priority projects. Budgets can be altered directly in the configuration file, while cores limit can be changed in the CSV file.
- Calculated budget history is available as CSV file, enabling further analysis on the overall budgets and limits over time.
- HPC Jobs breaking the limit are kept in pending status until more cores are freed. Also, the scheduler reports the missing free cores as pending reason to the user.
Proposed logic and solution can be further improved and customized, e.g.:
To manage more instance types costs defaults. Updater script can filter Cost Explorer query according to the job instance type requirement. Administrators should add multiple 1 core/hour instance type costs.
Send email notifications in case a daily cost exceeds the budget. Potentially also have 2 limits, soft and hard). Headnode sendmail process can be easily integrated with AWS Simple Email Service, like described here.
Enforce job submission cores resource requirement, e.g. using SLURM® PrologSlurmctld
. You can find more details in SLURM® prolog and epilog guide.
Retrieve 1-core cost automatically by querying AWS Price List Query APIs.
Modify the core logic or timings to better adapt to specific HPC environments, maybe reacting to costs spikes more promptly, or smooth new limits introduction.
BU tag, together with others might be enforced and validated as suggested in Enforce and validate AWS resource tags post.