Stop Abusing DynamoDB Transactions. Leverage Idempotency Instead

Matteo Agius
3 min readSep 15, 2023
Idziorek, J. (n.d.). Distributed transactions at scale in amazon dynamodb — usenix.org. https://www.usenix.org/system/files/atc23-idziorek.pdf

I often see an engineer’s first instinct be to use DynamoDB transactions when the need for fault-tolerance arises in a distributed system. In this post, I want to break away from that idea and show how you can often achieve the same thing in a safer, more reliable way.

I’ll go through an example showing the approach I took with a recent Golang service I built with AWS at Twilio.

Contention: A Common Problem with Transactions in a Distributed System

Transactions— particularly those involving concurrent operations on the same data — can quickly become a source of complexity and risk.

Many databases, DynamoDB included, use Optimistic Concurrency Control (OCC) for transactions. OCC is a concurrency control method applied on databases where multiple transactions are executed concurrently without the restriction of data.

This can be performance advantageous, but the problem lies when multiple transactions try modifying the same data. Databases like DynamoDB will throw transaction conflicts, meaning they will fail. This forces a decision between retrying the failed transaction — increasing the system’s latency and potentially causing more conflicts — or dropping the transaction.

Below is the AWS Golang SDK throwing transaction conflict errors.

operation error DynamoDB: TransactWriteItems, https response error StatusCode:
400, RequestID: 85C5D8KGRPQI6NNLU0F2UDI3JRPP4KQNSO5AEMVJF21Q9ASUAAJG,
TransactionCanceledException: Transaction cancelled, please refer cancellation
reasons for specific reasons [TransactionConflict]
...
operation error DynamoDB: TransactWriteItems, https response error StatusCode:
400, RequestID: 7FU3S31EMG6LB0FDA0S9VNQKL3PP4KQNSO5AEMVJF21Q9ASUAAJG,
TransactionCanceledException: Transaction cancelled, please refer cancellation
reasons for specific reasons [TransactionConflict]
...

This was an issue I could have run into with my service. There are other applications that handle the same data that I needed to access.

But fault tolerance was a crucial aspect to consider. For my Golang service, the criteria included several tasks: continuously querying DDB for items, enqueuing them into SQS, updating a separate item’s status, and deleting the original items.

It could be tempting to use transactions here. But ordering your idempotent operations correctly can do the trick.

Idempotency as a Solution

Simply stated, if an operation is idempotent, you can apply it any number of times and the result will be the same as if it had been done only once. It provides robustness against failures and retries, making it a friend of distributed, fault-tolerant systems.

The AWS SDK DynamoDB and SQS clients make leveraging idempotent properties simple. Scenarios like doing the same update on an item or enqueuing the same message do not throw errors back to the caller. This property allows us to create an order of operations so if any call fails, no harmful data inconsistencies occur upon retries.

Take a look at this simplified Golang pseudocode from my service below.

func poll() {
...
for {
// get all items with status = "Ready"
items := ddb.QueryItems(StatusReady)

// enqueue items into SQS
enqueuedItems := sqs.SendMessages(items)

// delete each items' corresponding "task" item
deletedItems := ddb.DeleteItems(enqueuedItems)

// update each item with status = "Finished"
ddb.UpdateItems(deletedItems, StatusFinished)
}
}

Here, any of the above operations can fail, but we maintain fault tolerance since we will just pick up the non-updated items in the next polling iteration.

In the code, we always keep a returned result of only the items that succeeded from the previous call. We also make sure to only perform the status update as the last call. This, in conjunction with each call being idempotent works because each individual item will still be marked as a successful operation by the AWS SDK client even if no new action is taken. The exact same query, deletion, enqueue¹, or update will give back a 2xx even when no resource change has occurred. Allowing us to append the item to an “operation successful” slice in each call, just as if it were the first time.

Returning the items that had successful operations and passing them into the argument of the next idempotent call is a pattern I have found to be very useful. In addition to reliability, it keeps code cleaner since the calls don’t need extra branches to handle returned failures.

Final Thoughts

Idempotency can often offer a simple and elegant alternative to complex and heavy transaction usage in distributed systems.

¹SQS message enqueues (deduplication enabled) are idempotent for the first 5 minutes.

--

--