Idempotency and Durable Execution

Life Without Idempotency

Have you ever been working on a feature, or fixing a bug, and when you got to the bottom of it, duplicate records were being created in a database, or duplicate charges or orders were sent out? If so, idempotency is an important topic for you.

Idempotency is a property of functions or operations in programming, and it’s a crucial property when working with durable execution systems such as Temporal. In this post, we’ll explain what it means to be idempotent and how idempotency can make your software systems easier to work with. We’ll show examples and get into some complex cases that can be managed with Temporal.

Idempotency diagram

What is Idempotency and Why Does It Matter?

It’s a fact of life that no system is perfectly reliable, and when things fail, we need to retry what we were doing. Wouldn’t it be nice if we could retry safely without causing unintended duplicate effects? Idempotency is what lets us do this! Idempotency is defined as a request that produces the same result, regardless of how many times it is made.

If an idempotent request is sent once, it will produce the desired result. If it is sent multiple times, it still produces the desired result.

An example idempotent operation is updating a customer’s address in a table in the database because no matter how many times it is called, the final address is always the same in the database. On the other hand, an example non-idempotent operation is foo += 3 because the value of foo changes based on the number of times it is called.

A simple, real-life example of idempotency is the stop button in a media player app. You tap the stop button, and the player stops playing. If you tap it again, the player is still stopped. The result is the same: the player stops. You could hit it a hundred times, and it will still be stopped. Therefore, tapping the stop button is idempotent.

Media player with pause button Media player with play button

Compare this to a play/pause button: you tap it and the media starts, and then you tap it again and the media pauses. The play/pause button doesn’t always result in the same state, so it is not idempotent.

Idempotency matters because systems are not perfect, and it is nice to be able to retry things safely without causing unintended duplicate effects. In an ideal world, you could call:

chargeCard(paymentInfo, orderId)

multiple times with the same inputs and only make one charge for the order.

If every operation were idempotent, dealing with transient failures could have a simpler developer experience—just keep calling the operation until it succeeds. Because some operations are not inherently idempotent, we have to consider the complexities of duplicate calls, especially if something fails without an obvious complete success or failure.

Temporal and “At Least Once” Execution

Temporal allows developers to avoid coding for infrastructure nuances and other inevitable failures, focusing on what matters: coding the process they are trying to implement. It would be nice if you could code up the high-level process, but not worry about the underlying or external systems, right? Except you have to at some point interact with those systems. So we separate out those two concerns into “Workflows” and “Activities” respectively.

Workflow code concerns itself with the higher-level business logic of the application. Activity code focuses on more granular functions such as writing to a database, calling a service, or processes that could be longer running or even blocking in nature.

Temporal is an event-sourced system and each Workflow has its own event history. Temporal will maintain the state of the Workflow and continually progress each Workflow to completion regardless of any failures. In addition, Temporal provides built-in retry logic for Activities, and since Activities are just events in the Workflow history, Temporal is able to guarantee “at least once” execution.

Why the specificity around “at least once”?

Temporal is a distributed system. The Temporal server orchestrates and maintains state, while Temporal worker(s) execute Workflows and Activities. Part of what is great about using Temporal is that by default, Activities are retried until they run successfully. But Activity execution is not atomic due to factors such as failures, timeouts, environment failure, or other conditions that lead to partial success. Temporal recommends Activities be idempotent, or at least if they aren’t, that executing an Activity more than once does not cause any unexpected or undesired side effects.

Practices for Idempotency With Temporal

Temporal can provide a guarantee for “at least once” or “at most once” Activity execution, depending on configuration. By default, Activity retries are unlimited, meaning “at least once” execution, but practically speaking, it means as many times as needed. However, limiting maximumAttempts to 1 and using normal Activities (not Local Activities) will guarantee “at most once” execution, meaning zero times is also possible. There are several practices to help make your Activities idempotent.

Idempotency Keys

Using idempotency keys provides a mechanism to understand if an Activity is called more than once, and if so, take the necessary action, for example, skipping execution if it already occurred. Consider this example SubmitOrder Activity:

Activity Code Example

func SubmitOrder(ctx context.Context, orderId int) (Order, error) {
	order, err := ExternalAPI.SubmitOrder(orderId)
	if err != nil {
	return Order{}, err
	}

	return order, nil
}

If ExternalAPI.SubmitOrder doesn’t do anything when receiving duplicate order IDs (for example, if the first thing it attempts is inserting a record into a database with a uniqueness contraint on the order ID, no database data will be changed—it will just return an error), then orderId serves as the idempotency key.

If your use case doesn’t already have a value that can be used as an idempotency key, you can use workflowRunId + ‘-’ + activityId. For example, in the TypeScript SDK, it would be this Activity code:

import { info } from ‘@temporalio/activity’ … const idempotencyKey = `{info.workflowRunId}-${info.activityId}`

This value will be constant across Activity retries, and unique among all Workflows.

SQL Example

If your data model doesn’t have a natural place to use the idempotency key, you can create an operations table with a uniqueness constraint:

CREATE TABLE operations ( idempotency_key VARCHAR(255) UNIQUE NOT NULL ); -- Example transaction: BEGIN; -- Assuming variable declarations are adapted to your SQL dialect DECLARE @account_id INT = 1; DECLARE @debit_amount DECIMAL(10,2) = 100.00; DECLARE @idempotency_key VARCHAR(255) = 'unique-key-123'; DECLARE @operation_inserted BIT = 0; -- Default to not inserted -- 1. Attempt to insert the operation first, to check for idempotency early INSERT INTO operations (idempotency_key) VALUES (@idempotency_key) ON CONFLICT (idempotency_key) DO NOTHING RETURNING 1 INTO @operation_inserted; -- If the operation was inserted, then check balance and proceed IF @operation_inserted = 1 THEN IF EXISTS ( SELECT 1 FROM accounts WHERE account_id = @account_id AND balance >= @debit_amount ) THEN -- 2. Update the account balance since this is a new operation UPDATE accounts SET balance = balance - @debit_amount WHERE account_id = @account_id; -- Commit if update is successful COMMIT; -- Return or indicate success here, if needed ELSE -- Insufficient balance, rollback and indicate the issue -- Note: Since the operation was new, it's appropriate to indicate insufficient balance ROLLBACK; RAISE 'Insufficient balance for debit.'; END IF; ELSE -- Operation was a duplicate, consider it a success -- No need to update balance or check it, just end transaction gracefully -- Return a success indicator since this is a repeated but originally successful operation COMMIT; -- Or simply end transaction if COMMIT is not required here RETURN 1; -- Adjust based on how you handle return values or success indicators END IF;

In step 1, if the operation has already run, it will be unable to insert the idempotency key into the operations table—instead, it will throw something like:

ERROR: duplicate key value violates unique constraint

This is a simple short example. In a production use case, you’d likely want to prevent the table from growing forever by:

adding a timestamp field to the operations table, with an index on it, and
periodically prune entries from operations that are old enough that there’s no longer a risk of duplicate requests arriving.

Check Pre-existing Results

An alternative to idempotency keys is checking for pre-existing results. If a CreateUser Activity inserts a user record into a database, the Activity could start out by checking whether a user record already exists with the given email address. If it does, then you know the Activity has already successfully run, and you can return without inserting.

func CreateUser(ctx context.Context, userData UserInput) (User, error) {
	// Check if user was already created
	if ExternalAPI.UserExists(userData.Email) {
		return User{}, temporal.NewApplicationError("Can’t create user: email already exists", "UserAlreadyExists")
	}

	// Create user
	user, err := ExternalAPI.CreateUser(userData)
	if err != nil {
		return User{}, err
	}

	return user, nil
}

There unfortunately is a race condition with this code—it still is possible for a duplicate user to be created. (This is assuming that ExternalAPI.CreateUser isn’t idempotent by itself—like, the underlying user database doesn’t have a unique email constraint. If it did, we wouldn’t need to check whether it exists at all—we could just attempt creation as we did in SubmitOrder above.)

If the first attempt takes a long time to run ExternalAPI.CreateUser—for example, longer than the Start-To-Close Timeout—then a second attempt will be run, and if the second attempt gets past the UserExists() check before the first attempt’s CreateUser() completes, then both the first and second attempts could successfully run CreateUser().

The likelihood of the race condition can be reduced by:

Setting a longer interval between Activity retries.
Responding to Activity Cancellation by closing the current API request and returning.
Setting a timeout on each API request.

The risk of duplication could be removed if there were a way to get a lock on user operations with a given email. If the API had such a GetLock() function, we could use it—but we should really just get them to accept an idempotency key as input 😄.

On the other hand, if we knew we were the only users of the API, we could implement the lock on our end with a lock workflow:

UserLock Workflow is an entity Workflow that runs forever, receives get-lock and release-lock Signals, and sends pass-lock Signals.
Before running the CreateUser Activity, the UserOnboarding Workflow requests a lock on me@email.com’s user operations by sending this request:

Signal-With-Start:
  WorkflowType: UserLock
  WorkflowId: me@email.com
  Signal: ‘get-lock’

The UserLock Workflow keeps a queue of lock requests, and sends pass-lock to the Workflow that made the first request, telling it that it now has the lock.
When the UserOnboarding Workflow receives the give-lock Signal, it runs the CreateUser Activity. Once the Activity completes, it sends the release-lock Signal to UserLockWorkflow.

For a similar code sample, see samples-typescript ▶️ mutex.

Stateless Activities

Whenever possible, activities should be stateless meaning only rely on inputs and outputs, rather than any inter-activity state for example global variables or shared memory within activity. Intra-activity state can be shared across activity retries using activity heartbeat with data in the heartbeat details.

Activity inputs and output are stored in the workflows event history. Temporal uses the event history to reconstruct state in the event of certain failures. During replay, activities that have already been executed are not re-executed. Instead Temporal simply takes activity input and output from its event history while continuing with Workflow progression. If activities were to store some internal state, it would be lost during Workflow replay as already completed activities would never be re-executed.

If activity state needs to be stored it is recommended to rely on a database, context propagation or a Custom Search Attribute for storing simple state, where eventual consistency is tolerable.

Idempotent Workflows

Activities aren’t the only thing in Temporal that might be executed twice! It’s possible to unintentionally create two Workflow Executions with the same input, unless:

Both Workflows have the same ID, and the second start request is received by the Server while the first is still running.
Both Workflows have the same ID, and while the first Workflow completed before the second started, the Workflow Id Reuse Policy is set to “Reject Duplicate”, and the first Workflow hasn’t exceeded its Retention Period.

In both of these two cases, the Workflow Id acts as an idempotency key, and the Server will return a duplicate error instead of creating the second Workflow.

An example scenario in which a duplicate Workflow will be created is:

A user double-clicks a submit order button on a website.
The website doesn’t disable the button after the first click, and sends two order requests to the backend.
For each request, the backend generates a random Workflow Id and starts the CreateOrder Workflow.

One way to address this is:

On the backend, deterministically generate the Workflow Id from the input.
Set a Reject Duplicate Reuse Policy. (Unless you’re not concerned about such a slow double click that the first Workflow has completed by the time the second starts.)

The input could include the user, the address, the item, and the quantity. The problem with this is that the user won’t be able to make this order again until the retention period is up, even if they wanted to. Imagine not being able to find the amazing scissors you just bought, and then you keep getting an error when you try to reorder them! We could try to fix this by adding the current date to the Workflow Id generation, but then what if the user double clicks right before and after midnight? 😄

This demonstrates the benefit to the idempotency key being generated as close as possible to the origination of a request. In this case, it’s best for the browser to generate an idempotency key when the cart is created, and to send it to the backend with the checkout request, so the backend can use it as the Workflow Id.

Avoiding duplicate Workflows is one reason why we recommend using business meaningful IDs for Workflow Ids. If there’s already a record in a database with an ID, and you’re starting a Workflow with knowledge of that ID, and there should only be one Workflow running related to that record, then use the record ID as the Workflow Id! It will be guaranteed unique and easy to look up 😊.

Idempotency Code Samples

Many services already implement idempotency. Temporal’s Go starter application uses ReferenceID as an idempotency key. In our “charging a credit card” example, Stripe’s API supports idempotency keys for POST requests: stripe.com/docs/api/idempotent_requests

Finally, here is a sample application that implements a simple form of idempotency:

Order Management Sample

If you attempt to process multiple orders with the same order ID, it only processes the first request:

blog-image-keith-4

The idempotency check is done in the process order function.

Dealing with non-idempotent functions

What happens when you have to deal with operations that are not idempotent and don’t support idempotency keys or check-then-set? Such questions may lead to many other reflections:

Is it impossible to retry these operations?
Are we doomed to sometimes create duplicate orders/requests/charges?
Are these functions incompatible with Temporal’s programming model?
Should we just assume these operations always work, or have a human take over this part of the process if their execution is unreliable?

These are great questions that might lead to feeling like operations that aren’t idempotent can’t be used in durable systems like Temporal. But there is hope.

One strategy is to configure Temporal to execute Activities at most once. You can do this by setting the max attempts for an Activity to be 1. There are ways in Temporal to manage this situation easily, through taking an alternative path if the Activity fails or running a compensating Activity.

Temporal Retry Policy Example

ao := workflow.ActivityOptions{
	StartToCloseTimeout: 2 * time.Minute,
	RetryPolicy: &temporal.RetryPolicy{
		MaximumAttempts:    1,
	},
}

You can build on the at most once strategy is to create an idempotent process that retries until it succeeds for an operation that writes data and also supports a read operation: The existing process:

Calls an operation that is not idempotent (and hope it always works or always fails completely)

Wrap the non-idempotent operation in a function that:

Calls an operation that is not idempotent
Verify it succeeded with a read operation (that is idempotent)
If the read shows it was not successful, try again at #1

Idempotency by Validation Example (simplified)

for ticketFound := false; !ticketFound; {
	ticketCreateErr := workflow.ExecuteActivity(ctx, activities.CreateTicket, order.OrderID, reservation, token).Get(ctx, &order.Ticket)
	if ticketCreateErr != nil {
		workflow.Sleep(ctx, time.Duration(15)*time.Second)
	}
	err = workflow.ExecuteActivity(ctx, activities.ValidateTicket, order.OrderID, reservation, token).Get(ctx, &order.Ticket)
	if err != nil {
		return "", err
	}

	if len(order.Ticket) > 0 {
		ticketFound = true
		break
	}
}

Working with failures in computer programming is challenging. Temporal Activities make this easier by automatically retrying things that fail. Idempotent operations produce the same result, regardless of how many times they are called. Combine automatic retries with idempotency, and dealing with failures gets much, much easier.

By following the practices outlined above, we hope that you can have confidence that your application will have no unintended consequences from non-idempotent operations, and that you will enjoying coding and supporting critical systems knowing that their operations are idempotent and automatically retried - so they just work, with Temporal.

Links and Further Reading: