Release notes for Slurm versions in AWS PCS
This topic describes important changes for each Slurm version currently supported in AWS PCS. We recommend you review the changes between the old and new versions when you upgrade your cluster.
Changes implemented in AWS PCS
-
Scheduler audit logs are now delivered separately through the
PCS_SCHEDULER_AUDIT_LOGSlog type, simplifying troubleshooting and auditing with independent control over log delivery. For more information, see Scheduler audit logs in AWS PCS. -
Expedited requeue is enabled by default. Jobs that fail due to node issues (such as insufficient capacity errors) can be requeued with the highest scheduling priority using
sbatch --requeue=expedite. This is controlled by theSchedulerParameters=enable_expedited_requeuesetting. -
The
requeue_delayparameter is available as a custom cluster setting with a default of 5 seconds. Previously, requeue delay was tied to credential expiration (70 seconds). Administrators can now configure this independently viaSchedulerParameters=requeue_delay=<seconds>. -
HealthCheckNodeStatenow supports theSTART_ONLYvalue, which runs the health check program only at node startup (slurmd start). -
CommunicationParameters=disable_httpis set by default to disable HTTP endpoints (metrics and health probes) introduced in Slurm 25.11. To re-enable these endpoints, setCommunicationParameters=enable_http. For more information, see Slurm metrics in AWS PCS.
Known issues
-
Slurm 25.11 validates
AllowQOSandDenyQOSpartition settings even whenAccountingStorageEnforce=QOSis not set. If a QOS referenced inAllowQOSorDenyQOSdoesn't exist in the Slurm accounting database,slurmctldexits with a fatal error. Ensure that all QOS values listed in partitionAllowQOSandDenyQOSsettings exist in the accounting database before upgrading to or restarting Slurm 25.11. -
The
slurmdlog may show the error messageerror: cannot create url_parser context for http_parser/libhttp_parser. This is a known Slurm issue that occurs even whenCommunicationParameters=disable_httpis set. The error can be safely ignored and doesn't affect cluster operation.
For more information about Slurm 25.11, see the following publications:
-
SchedMD release announcement: https://www.schedmd.com/slurm-version-25-11-0-is-now-available/
-
SchedMD release notes: https://github.com/SchedMD/slurm/blob/slurm-25.11/RELEASE_NOTES.md
Changes implemented in AWS PCS
-
The Slurm requeue_on_resume_failure SchedulerParameter is now Enabled by default.
-
"stderr" was removed as an option for LogTimeFormat, as it was disabled in Slurm 25.05.
-
AWS PCS supports Multi-cluster sackd configuration: login node can access multiple clusters.
For more information about Slurm 25.05, see the following publications:
-
SchedMD release announcement: https://www.schedmd.com/slurm-version-25-05-0-is-now-available/
-
SchedMD release notes: https://github.com/SchedMD/slurm/blob/slurm-25-05-0-1/RELEASE_NOTES.md
Changes implemented in AWS PCS
-
AWS PCS supports Slurm accounting. For more information, see Slurm accounting in AWS PCS.
For more information about Slurm 24.11, see the following publications:
Changes implemented in AWS PCS
-
The new Slurm Step Manager module is now enabled by default in AWS PCS. This module provides significant benefits by offloading step management from the central controller to compute nodes, substantially improving system concurrency in environments with heavy step usage. To support this configuration and better isolate
PrologandEpilogprocess execution, new prolog flags (Contain,Alloc) are enabled. -
Hierarchical communication from controller to compute nodes is enabled to optimize Slurm intra-node communication, which improves scalability and performance. Additionally, the routing configuration now uses partition node lists for communications from the controller, instead of the plugin's default routing algorithm, enhancing system resiliency.
-
A new hash plugin
HashPlugin=hash/sha3replaces the previoushash/k12 plugin. This is now enabled by default in AWS PCS clusters. -
Slurm controller logs now include enhanced auditing capabilities for all inbound remote procedure calls (RPC) to
slurmctld. The logs include the source address, authenticated user, and RPC type before connection processing.
For more information about Slurm 24.05, see the following publications:
Slurm settings you can change in AWS PCS
-
The
SuspendTimedefaults to60. Use the AWS PCSscaleDownIdleTimeInSecondsconfiguration parameter to set it. For more information, see thescaleDownIdleTimeInSecondsparameter of theClusterSlurmConfigurationdata type in the AWS PCS API Reference. -
The
MaxJobCountandMaxArraySizeis based on the size you choose for the cluster. For more information, see thesizeparameter of theCreateClusterAPI action in the AWS PCS API Reference. -
The
SelectTypeParametersSlurm setting defaults toCR_CPU. You can provide it as a value forslurmCustomSettingsto set it when you create a cluster. For more information, see theslurmCustomSettingsparameter of theCreateClusterAPI action and SlurmCustomSetting in the AWS PCS API Reference. -
You can set
PrologandEpilogat the cluster level. You can provide it as a value forslurmCustomSettingsto set it when you create a cluster. For more information, seeCreateClusterand SlurmCustomSetting in the AWS PCS API Reference. -
You can set
WeightandRealMemoryat the compute node group level. You can provide it as a value forslurmCustomSettingsto set it when you create a compute node group. For more information, seeCreateComputeNodeGroupand SlurmCustomSetting in the AWS PCS API Reference.