Skip to content

Releases: aws/aws-parallelcluster

AWS ParallelCluster v3.9.1

11 Apr 10:42
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.9.1

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

BUG FIXES

  • Remove recursive deletion of shared storage mountdir when unmounting filesystems as part of update-cluster operation.

AWS ParallelCluster v3.9.0

12 Mar 01:27
0303ec9
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.9.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Permit to update the external shared storage of type Efs, FsxLustre, FsxOntap, FsxOpenZfs and FileCache
    without replacing compute and login fleet.
  • Permit to update MinCount, MaxCount, Queue and ComputeResource configuration parameters without the need to
    stop the compute fleet. It's now possible to update them by setting Scheduling/SlurmSettings/QueueUpdateStrategy
    to TERMINATE. ParallelCluster will terminate only the nodes removed during a resize of the cluster capacity
    performed through a cluster update.
  • Add support for RHEL9.
  • Add support for Rocky Linux 9 as CustomAmi created through build-image process. No public official ParallelCluster Rocky9 Linux AMI is made available at this time.
  • Remove CommunicationParameters from the Custom Slurm Settings deny list.
  • Add the configuration parameter DeploymentSettings/DefaultUserHome to allow users to move the default user's home directory to /local/home instead of /home (default).
  • Add configuration parameter DeploymentSettings/DisableSudoAccessForDefaultUser to disable sudo access of default user in supported OSes.

CHANGES

  • Upgrade Slurm to 23.11.4 (from 23.02.7).
    • Upgrade Pmix to 4.2.9 (from 4.2.6).
  • Add support for Python 3.11, 3.12 in pcluster CLI and aws-parallelcluster-batch-cli.
  • Build network interfaces using network card index from NetworkCardIndex list of EC2 DescribeInstances response,
    instead of looping over MaximumNetworkCards range.
  • Fail cluster creation when using instance types P3, G3, P2 and G2 because their GPU architecture is not compatible with Open Source Nvidia Drivers (OpenRM) introduced as part of 3.8.0 release.
  • Upgrade the default FSx Lustre server version managed by ParallelCluster to 2.15.
  • Upgrade NVIDIA driver to version 535.154.05.
  • Upgrade EFA installer to 1.30.0.
    • Efa-driver: efa-2.6.0-1
    • Efa-config: efa-config-1.15-1
    • Efa-profile: efa-profile-1.6-1
    • Libfabric-aws: libfabric-aws-1.19.0
    • Rdma-core: rdma-core-46.0-1
    • Open MPI: openmpi40-aws-4.1.6-2 and openmpi50-aws-5.0.0-11
  • Upgrade NICE DCV to version 2023.1-16388.
    • server: 2023.1.16388-1
    • xdcv: 2023.1.565-1
    • gl: 2023.1.1047-1
    • web_viewer: 2023.1.16388-1
  • Upgrade ARM PL to version 23.10.
  • Upgrade third-party cookbook dependencies:
    • nfs-5.1.2 (from nfs-5.0.0)

BUG FIXES

  • Refactor IAM policies defined in CloudFormation template parallelclutser-policies.yaml to prevent ParallelCluster API deployment failure caused by policies exceeding IAM limits.
  • Fix issue making job fail when submitted as active directory user from login nodes. The issue was caused by an incomplete configuration of the integration with the external Active Directory on the head node.
  • Fix issue making login nodes fail to bootstrap when the head node takes more time than expected in writing keys.

AWS ParallelCluster v3.8.0

19 Dec 17:40
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.8.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add support for EC2 Capacity Blocks for ML.
  • Add support for Rocky Linux 8 as CustomAmi created through build-image process. No public official ParallelCluster Rocky8 Linux AMI is made available at this time.
  • Add Scheduling/ScalingStrategy parameter to control the cluster scaling strategy to use when launching EC2 instances for Slurm compute nodes.
    Possible values are all-or-nothing, greedy-all-or-nothing, best-effort, with all-or-nothing being the default.
  • Add HeadNode/SharedStorageType parameter to use EFS storage instead of NFS exports from the head node root volume
    for intra-cluster shared file system resources: ParallelCluster, Intel, Slurm, and /home data. This enhancement reduces the load on the head node networking.
  • Allow for mounting home as an EFS or FSx external shared storage via the SharedStorage section of the config file.
  • Add new parameter SlurmSettings/MungeKeySecretArn to permit to use an external user-defined MUNGE key from AWS Secrets Manager.
  • Add Monitoring/Alarms/Enabled parameter to toggle Amazon CloudWatch Alarms for the cluster.
  • Add head node alarms to monitor EC2 health checks, CPU utilization and the overall status of the head node, and add them to the CloudWatch Dashboard created with the cluster.
  • Add support for Data Repository Associations when using PERSISTENT_2 as DeploymentType for a managed FSx for Lustre.
  • Add Scheduling/SlurmSettings/Database/DatabaseName parameter to allow users to specify a custom name for the database on the database server to be used for Slurm accounting.
  • Make InstanceType an optional configuration parameter when configuring CapacityReservationTarget/CapacityReservationId in the compute resource.
  • Add possibility to specify a prefix for IAM roles and policies created by ParallelCluster API.
  • Add possibility to specify a permissions boundary to be applied for IAM roles and policies created by ParallelCluster API.
  • Add support for il-central-1 region.

CHANGES

  • Upgrade Slurm to 23.02.7 (from 23.02.6).
  • Upgrade NVIDIA driver to version 535.129.03.
  • Upgrade CUDA Toolkit to version 12.2.2.
  • Use Open Source NVIDIA GPU drivers (OpenRM) as NVIDIA kernel module for Linux instead of NVIDIA closed source module.
  • Remove support of all_or_nothing_batch configuration parameter in the Slurm resume program, in favor of the new Scheduling/ScalingStrategy cluster configuration.
  • Changed cluster alarms naming convention to '[cluster-name]-[component-name]-[metric]'.
  • Change default EBS volume types in ADC regions from gp2 to gp3, for both the root and additional volumes.
  • The optional permissions boundary for the ParallelCluster API is now applied to every IAM role created by the API infrastructure.
  • Upgrade EFA installer to 1.29.1.
    • Efa-driver: efa-2.6.0-1
    • Efa-config: efa-config-1.15-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.19.0-1
    • Rdma-core: rdma-core-46.0-1
    • Open MPI: openmpi40-aws-4.1.6-1
  • Upgrade GDRCopy to version 2.4 in all supported OSes, except for Centos 7 where version 2.3.1 is used.
  • Upgrade aws-cfn-bootstrap to version 2.0-28.
  • Add support for Python 3.10 in aws-parallelcluster-batch-cli.

BUG FIXES

  • Fix inconsistent scaling configuration after cluster update rollback when modifying the list of instance types declared in the Compute Resources.
  • Fix users SSH keys generation when switching users without root privilege in clusters integrated with an external LDAP server through cluster configuration files.
  • Fix disabling Slurm power save mode when setting ScaledownIdletime = -1.
  • Fix hard-coded path to Slurm installation dir in update_slurm_database_password.sh script for Slurm Accounting.

AWS ParallelCluster v3.7.2

13 Oct 19:36
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.7.2

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

CHANGES

  • Upgrade Slurm to 23.02.6.

AWS ParallelCluster v3.7.1

22 Sep 20:15
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.7.1

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

CHANGES

  • Upgrade Slurm to 23.02.5 (from 23.02.4).
    • Upgrade Pmix to 4.2.6 (from 3.2.3).
    • Upgrade libjwt to 1.15.3 (from 1.12.0).
  • Upgrade EFA installer to 1.26.1, fixing RDMA writedata issue in P5.
    • Efa-driver: efa-2.5.0-1
    • Efa-config: efa-config-1.15-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.18.2-1
    • Rdma-core: rdma-core-46.0-1
    • Open MPI: openmpi40-aws-4.1.5-4

AWS ParallelCluster v3.7.0

30 Aug 12:11
79e139b
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.7.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add support for Ubuntu 22. RSA keys are not supported by default. See this page.
  • Add support for login nodes.
  • Add support to mount existing Amazon File Cache as shared storage.
  • Allow configuration of static and dynamic node priorities in Slurm compute resources via the ParallelCluster configuration YAML file.
  • Add a queue-level parameter (JobExclusiveAllocation) to ensure nodes in the partition are exclusively allocated to a single job at any given time.
  • Allow overriding the aws-parallelcluster-node package at cluster creation and update time (only on the head node during update). Useful for development purposes only.
  • Allow memory-based scheduling when multiple instance types are specified for a Slurm Compute Resource.
  • Avoid starting the NFS server on compute nodes.

CHANGES

  • Deprecate Ubuntu 18.
  • Upgrade Slurm to version 23.02.4.
  • Update the default root volume size to 40 GB to account for limits on Centos 7.
  • Upgrade NVIDIA driver to version 535.54.03.
  • Upgrade CUDA library to version 12.2.0.
  • Upgrade NVIDIA Fabric manager to nvidia-fabricmanager-535.
  • Upgrade NICE DCV to version 2023.0-15487.
    • server: 2023.0.15487-1
    • xdcv: 2023.0.551-1
    • gl: 2023.0.1039-1
    • web_viewer: 2023.0.15487-1
  • Upgrade EFA installer to 1.25.1.
    • Efa-driver: efa-2.5.0-1
    • Efa-config: efa-config-1.15-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.18.1-1
    • Rdma-core: rdma-core-46.0-1
    • Open MPI: openmpi40-aws-4.1.5-4
  • Upgrade ARM PL to version 23.04.1 for Ubuntu 22.04 only.
  • Assign Slurm dynamic nodes a priority (weight) of 1000 by default. This allows Slurm to prioritize idle static nodes over idle dynamic ones.
  • Change the default value of Imds/ImdsSupport from v1.0 to v2.0.
  • Make aws-parallelcluster-node daemons handle only ParallelCluster-managed Slurm partitions.
  • Create a Slurm partition-nodelist mapping JSON file to be used by the node package daemons to recognize PC-managed Slurm partitions and nodelists.
  • Increase EFS-utils watchdog poll interval to 10 seconds. Note: This change is meaningful only if EncryptionInTransit is set to true, because watchdog does not run otherwise.

BUG FIXES

  • Add validation to ScaledownIdletime value, to prevent setting a value lower than -1.
  • Fix issue causing dangling IAM policies to be created when creating ParallelCluster CloudFormation custom resource provider with CustomLambdaRole.
  • Fix an issue that was causing misalignment of compute nodes DNS name on instances with multiple network interfaces,
    when using SlurmSettings/Dns/UseEc2Hostnames equals to True.
  • Fix cluster creation failure with Ubuntu Deep Learning AMI on GPU instances and DCV enabled.

AWS ParallelCluster v3.6.1

05 Jul 14:22
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.6.1

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add support for Slurm accounting in US isolated regions.

CHANGES

  • Avoid duplication of nodes seen by clustermgtd if compute nodes are added to multiple Slurm partitions.
  • ParallelCluster AMI for US isolated regions are now vended with preconfigured CA certificates to speed up node bootstrap.
  • Replace nvidia-persistenced service with parallelcluster_nvidia service to avoid conflicts with DLAMI.

BUG FIXES

  • Remove hardcoding of root volume device name (/dev/sda1 and /dev/xvda) and retrieve it from the AMI(s) used during create-cluster.
  • Fix cluster creation failure when using CloudFormation custom resource with ElasticIp set to True.
  • Fix cluster creation/update failure when using CloudFormation custom resource with large configuration files.
  • Fix an issue that was preventing ptrace protection from being disabled on Ubuntu and was not allowing Cross Memory Attach (CMA) in libfabric.
  • Fix fast insufficient capacity fail-over logic when using multiple instance types and no instances are returned.

AWS ParallelCluster v3.6.0

22 May 15:51
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.6.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add support for RHEL8.7.
  • Add a CloudFormation custom resource for creating and managing clusters from CloudFormation.
  • Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
  • Build Slurm with support for LUA.
  • Increase the limit on the maximum number of queues per cluster from 10 to 50. Compute resources can be distributed flexibly across the various queues as long as the cluster contains a maximum of 50 compute resources.
  • Allow to specify a sequence of multiple custom actions scripts per event for OnNodeStart, OnNodeConfigured and OnNodeUpdated parameters.
  • Add new configuration section HealthChecks/Gpu for enabling the GPU Health Check in the compute node before job execution.
  • Add support for Tags in the SlurmQueues and SlurmQueues/ComputeResources section.
  • Add support for DetailedMonitoring in the Monitoring section.
  • Add mem_used_percent and disk_used_percent metrics for head node memory and root volume disk utilization tracking on the ParallelCluster CloudWatch dashboard, and set up alarms for monitoring these metrics.
  • Add log rotation support for ParallelCluster managed logs.
  • Track common errors of compute nodes and longest dynamic node idle time on Cloudwatch Dashboard.
  • Enforce the DCV Authenticator Server to use at least TLS-1.2 protocol when creating the SSL Socket.
  • Install NVIDIA Data Center GPU Manager (DCGM) package on all supported OSes except for aarch64 centos7 and alinux2.
  • Load kernel module nvidia-uvm by default to provide Unified Virtual Memory (UVM) functionality to the CUDA driver.
  • Install NVIDIA Persistence Daemon as a system service.

CHANGES

  • Note 3.6 will be the last release to include support for Ubuntu 18. Subsequent releases will only support Ubuntu from version 20.
  • Upgrade Slurm to version 23.02.2.
  • Upgrade munge to version 0.5.15.
  • Set Slurm default TreeWidth to 30.
  • Set Slurm prolog and epilog configurations to target a directory, /opt/slurm/etc/scripts/prolog.d/ and /opt/slurm/etc/scripts/epilog.d/ respectively.
  • Set Slurm BatchStartTimeout to 3 minutes so to allow max 3 minutes Prolog execution during compute node registration.
  • Increase the default RetentionInDays of CloudWatch logs from 14 to 180 days.
  • Upgrade EFA installer to 1.22.1
    • Dkms : 2.8.3-2
    • Efa-driver: efa-2.1.1g
    • Efa-config: efa-config-1.13-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.17.1-1
    • Rdma-core: rdma-core-43.0-1
    • Open MPI: openmpi40-aws-4.1.5-1
  • Upgrade Lustre client version to 2.12 on Amazon Linux 2 (same version available on Ubuntu 20.04, 18.04 and CentOS >= 7.7).
  • Upgrade Lustre client version to 2.10.8 on CentOS 7.6.
  • Upgrade NVIDIA driver to version 470.182.03.
  • Upgrade NVIDIA Fabric Manager to version 470.182.03.
  • Upgrade NVIDIA CUDA Toolkit to version 11.8.0.
  • Upgrade NVIDIA CUDA sample to version 11.8.0.
  • Upgrade Intel MPI Library to 2021.9.0.43482.
  • Upgrade NICE DCV to version 2023.0-15022.
    • server: 2023.0.15022-1
    • xdcv: 2023.0.547-1
    • gl: 2023.0.1027-1
    • web_viewer: 2023.0.15022-1
  • Upgrade aws-cfn-bootstrap to version 2.0-24.
  • Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
    aws/codebuild/amazonlinux2-x86_64-standard:3.0 to aws/codebuild/amazonlinux2-x86_64-standard:4.0 and from
    aws/codebuild/amazonlinux2-aarch64-standard:1.0 to aws/codebuild/amazonlinux2-aarch64-standard:2.0.
  • OpenSSL version 1.1.1 or later is required for ParallelCluster CLI due to a change in urllib3 2.0. Using an older OpenSSL will trigger an ImportError when executing a pcluster command.

BUG FIXES

  • Fix EFS, FSx network security groups validators to avoid reporting false errors.
  • Fix missing tagging of resources created by ImageBuilder during the build-image operation.
  • Fix Update policy for MaxCount to always perform numerical comparisons on MaxCount property.
  • Fix an issue that was causing misalignment of compute nodes IP on instances with multiple network interfaces.
  • Fix replacement of StoragePass in slurm_parallelcluster_slurmdbd.conf when a queue parameter update is performed and the Slurm accounting configurations are not updated.
  • Fix issue causing cfn-hup daemon to fail when it gets restarted.
  • Fix issue causing dangling security groups to be created when creating a cluster with an existing EFS.
  • Fix issue causing NVIDIA GPU compute nodes not to resume correctly after executing an scontrol reboot command.

AWS ParallelCluster v3.5.1

28 Mar 20:11
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.5.1

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add a new way to distribute ParallelCluster as a self-contained executable shipped with a dedicated installer.
  • Add support for US isolated region us-isob-east-1.

CHANGES

  • Upgrade EFA installer to 1.22.0
    • Efa-driver: efa-2.1.1g
    • Efa-config: efa-config-1.13-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.17.0-1
    • Rdma-core: rdma-core-43.0-1
    • Open MPI: openmpi40-aws-4.1.5-1
  • Upgrade NICE DCV to version 2022.2-14521.
    • server: 2022.2.14521-1
    • xdcv: 2022.2.519-1
    • gl: 2022.2.1012-1
    • web_viewer: 2022.2.14521-1

BUG FIXES

  • Fix update cluster to remove shared EBS volumes can potentially cause node launching failures if MountDir match the same pattern in /etc/exports.
  • Fix for compute_console_output log file being truncated at every clustermgtd iteration.

AWS ParallelCluster v3.5.0

20 Feb 11:50
Compare
Choose a tag to compare

We're excited to announce the release of AWS ParallelCluster 3.5.0

Upgrade

How to upgrade?

sudo pip install --upgrade aws-parallelcluster

ENHANCEMENTS

  • Add official versioned ParallelCluster policies in a CloudFormation template to allow customers to easily reference them in their workloads.
  • Add a Python library to allow customers to use ParallelCluster functionalities in their own code.
  • Add logging of compute node console output to CloudWatch on compute node bootstrap failure.
  • Add failures field containing failure code and reason to describe-cluster output when cluster creation fails.

CHANGES

  • Upgrade Slurm to version 22.05.8.
  • Make Slurm controller logs more verbose and enable additional logging for the Slurm power save plugin.
  • Upgrade EFA installer to 1.21.0
    • Efa-driver: efa-2.1.1-1
    • Efa-config: efa-config-1.12-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.16.1amzn3.0-1
    • Rdma-core: rdma-core-43.0-1
    • Open MPI: openmpi40-aws-4.1.4-3

BUG FIXES

  • Fix cluster DB creation by verifying the cluster name is no longer than 40 characters when Slurm accounting is enabled.
  • Fix an issue in clustermgtd that caused compute nodes rebooted via Slurm to be replaced if the EC2 instance status checks fail.
  • Fix an issue where compute nodes could not launch with capacity reservations shared by other accounts because of a wrong IAM policy on head node.
  • Fix an issue where custom AMI creation failed in Ubuntu 20.04 on MySQL packages installation.
  • Fix an issue where pcluster configure command failed when the account had no IPv4 CIDR subnet.