Skip to content

Step design

A step is the smallest unit of durability. It runs your code, checkpoints the return value, and retrieves that value for every subsequent replay. How you divide work into steps influences how many checkpoints your execution has, how retries behave and how you read the workflow history in logs and errors.

Name steps meaningfully

The step name shows up in checkpoints, execution history, CloudWatch logs, and errors. Names like step1, process-data, or do-stuff make failures hard to triage. Prefer names like validate-order, charge-payment, and notify-customer that describe the encapsulated logic.

Keep names static. Names are part of the deterministic identity of the step. Including a timestamp or a random ID in the name breaks replay because the same step resolves to a different name on subsequent invocations.

// Stable, descriptive name.
await context.step("validate-order", async () => validateOrder(order));

// Dynamic but deterministic: include the item ID from the input.
await context.step(`save-item-${item.id}`, async () => saveItem(item));

// Wrong: non-deterministic name, changes on replay.
await context.step(`run-${Date.now()}`, async () => doThing());
# Stable, descriptive name.
context.step(validate_order(order), name="validate-order")

# Dynamic but deterministic: include the item ID from the input.
context.step(save_item(item), name=f"save-item-{item['id']}")

# Wrong: non-deterministic name.
context.step(do_thing(), name=f"run-{time.time()}")
// Stable, descriptive name.
context.step("validate-order", ValidationResult.class,
    ctx -> validateOrder(order));

// Dynamic but deterministic: include the item ID from the input.
context.step("save-item-" + item.id(), Void.class,
    ctx -> { saveItem(item); return null; });

Warning

A step name with a timestamp or random value resolves to a different name on replay. Keep names static.

Single responsibility

Operations that must succeed or fail together belong in the same step. Split unrelated operations into separate steps so each one has a single logical intent. Keep one external API call per step when that call has side effects.

A step that batches three unrelated side effects reruns all three on retry. If the second call fails, the first runs again.

Related reads against the same resource can batch into one step because they are safe to re-run.

Info

Pure computation rarely benefits from its own step. Deriving a value from data that is already in memory does not need durability, it does not need retries, and each extra step is an unnecessary checkpoint.

// Wrong: one step for three unrelated side effects.
await context.step("process-order", async () => {
  await chargePayment(order);
  await sendConfirmationEmail(order);
  await updateInventory(order);
});

// Right: each side effect gets its own step.
await context.step("charge-payment", async () => chargePayment(order));
await context.step("send-confirmation", async () => sendConfirmationEmail(order));
await context.step("update-inventory", async () => updateInventory(order));
# Right: each side effect gets its own step.
context.step(charge_payment(order), name="charge-payment")
context.step(send_confirmation(order), name="send-confirmation")
context.step(update_inventory(order), name="update-inventory")
context.step("charge-payment", Receipt.class,
    ctx -> chargePayment(order));
context.step("send-confirmation", Void.class,
    ctx -> { sendConfirmationEmail(order); return null; });
context.step("update-inventory", Void.class,
    ctx -> { updateInventory(order); return null; });

Reuse step logic

Define a reusable step function once and reference it repeatedly from the workflow.

Wrap the core logic in a named function, pass it to context.step.

async function validateOrder(order: Order): Promise<ValidationResult> {
  // ...
}

await context.step("validate-order", () => validateOrder(order));

@durable_step wraps a callable so that calling it with arguments returns a (StepContext) -> T that context.step can run.

@durable_step
def validate_order(ctx: StepContext, order: dict) -> dict:
    return run_validation(order)


context.step(validate_order(order))

Define a method and reference it with a lambda.

private ValidationResult validateOrder(Order order) {
    return new Validator().validate(order);
}

context.step("validate-order", ValidationResult.class,
    ctx -> validateOrder(order));

Step nesting

A step receives a StepContext, not the full DurableContext. A step is the atomic unit the SDK checkpoints. You cannot call other durable operations such as step or wait inside another step. If you need to group several durable operations, use runInChildContext as described in Code organization instead.

// Wrong: calling context.step inside a step callback is invalid.
await context.step("outer", async (ctx) => {
  await context.step("inner", async () => work()); // ERROR
});

// Right: group durable operations in a child context.
await context.runInChildContext("order-pipeline", async (child) => {
  await child.step("validate", async () => validate());
  await child.step("charge", async () => charge());
  return "done";
});
# Wrong: calling context.step inside a step callback is invalid.
@durable_step
def outer(ctx: StepContext, context: DurableContext) -> None:
    context.step(work())  # ERROR: ctx is a StepContext, not a DurableContext


# Right: group durable operations in a child context.
def order_pipeline(child: DurableContext) -> str:
    child.step(validate())
    child.step(charge())
    return "done"


context.run_in_child_context(order_pipeline, name="order-pipeline")
// Wrong: calling context.step inside a step callback is invalid.
context.step("outer", Void.class, ctx -> {
    context.step("inner", Void.class, c -> { work(); return null; }); // ERROR
    return null;
});

// Right: group durable operations in a child context.
context.runInChildContext("order-pipeline", String.class, child -> {
    child.step("validate", Void.class, c -> { validate(); return null; });
    child.step("charge", Void.class, c -> { charge(); return null; });
    return "done";
});

Handle errors explicitly

Code inside a step runs under the step's retry strategy. An unhandled error triggers the strategy's decision function. A returned value checkpoints the result.

Let errors propagate and use the retry strategy's configuration to decide which error types are retryable. Retry transient failures such as network timeouts, rate limits and 503s with backoff. Fail the step immediately for permanent failures such as invalid input, 404s and authentication errors.

Match the retry strategy to the work. Fast idempotent calls get tight retries, meaning a handful of attempts with only a few seconds of backoff. Long-running calls to third parties get wide retries (many attempts, minutes of backoff). See Retries for presets and configuration options.

List the retryable error classes in the retry strategy configuration.

import { createRetryStrategy } from "@aws/durable-execution-sdk-js";

class TransientApiError extends Error {}
class RateLimitError extends Error {}

// Retry only transient errors. Anything else (bad input, not-found) fails immediately.
const retryStrategy = createRetryStrategy({
  maxAttempts: 5,
  initialDelay: { seconds: 2 },
  retryableErrorTypes: [TransientApiError, RateLimitError],
});

await context.step(
  "call-api",
  async () => externalApi.get(event.id),
  { retryStrategy },
);

List the retryable error classes in the retry strategy configuration.

from aws_durable_execution_sdk_python import durable_step
from aws_durable_execution_sdk_python.config import StepConfig
from aws_durable_execution_sdk_python.retries import (
    RetryStrategyConfig,
    create_retry_strategy,
)
from aws_durable_execution_sdk_python.types import StepContext


class TransientApiError(Exception):
    pass


class RateLimitError(Exception):
    pass


@durable_step
def call_api(ctx: StepContext, record_id: str) -> dict:
    return external_api.get(record_id)


# Retry only transient errors. Anything else (bad input, not-found) fails immediately.
retry_strategy = create_retry_strategy(
    RetryStrategyConfig(
        max_attempts=5,
        retryable_error_types=[TransientApiError, RateLimitError],
    )
)

context.step(
    call_api(event["id"]),
    config=StepConfig(retry_strategy=retry_strategy),
)

Write a RetryStrategy lambda that checks the error type before delegating to a preset for the delay decision.

import java.time.Duration;
import software.amazon.lambda.durable.config.StepConfig;
import software.amazon.lambda.durable.retry.JitterStrategy;
import software.amazon.lambda.durable.retry.RetryDecision;
import software.amazon.lambda.durable.retry.RetryStrategies;
import software.amazon.lambda.durable.retry.RetryStrategy;

static class TransientApiException extends RuntimeException {}
static class RateLimitException extends RuntimeException {}

// Retry only transient errors. Anything else fails immediately.
RetryStrategy retryStrategy = (error, attempt) -> {
    if (!(error instanceof TransientApiException) && !(error instanceof RateLimitException)) {
        return RetryDecision.fail();
    }
    return RetryStrategies.exponentialBackoff(
            5, Duration.ofSeconds(2), Duration.ofMinutes(1), 2.0, JitterStrategy.FULL)
        .makeRetryDecision(error, attempt);
};

context.step(
    "call-api",
    Record.class,
    ctx -> externalApi.get(input.id()),
    StepConfig.builder().retryStrategy(retryStrategy).build());

Warning

Swallowing an exception inside a step hides the failure from the retry strategy and the caller. Let the error propagate and configure the retry strategy to decide.

See also