Asynchronous commands are dangerous

Diego Martin | 03 Oct 2020

CQRS is a simple but powerful concept that promotes, through design, the separation between write models and read models. I have an introductory video on CQRS (in Spanish).

Read operations

The founding principle of CQRS is that a read operation doesn't affect the state of an application as it's "just" a thread-safe act to consume information on existing resources. A read model is something optimized for that, without business logic (i.e: there is no domain layer in a query service), where if the query is authorized to proceed the response will be the desired information in a synchronous way.

Queries are synchronous

It's our duty as software architects and developers to ensure read operations are as quick as possible and therefore the most suitable read models are those who are persisted in a denormalized way and in the same shape they're to be consumed. In other words, a good read model should not require data transformation between its retrieval from the persistance storage (e.g: generally a NoSql database or in-memory) and the response to the querier (e.g: http response).

It's understandable queries are inherently synchronous messages for read-only operations. But what about the write side?

Write operations

CQRS promotes a different optimized model for write operations. When an actor desires to do something in the system that would cause a change of state, it will send a message to communicate its write intention.

The state change is generally represented as an event, an immutable fact that contains the delta between the previous state and the new one. Whether your system persists the whole state in a traditional CRUD way or, in addition to the what is there, your system also cares about how did it get there and persists every single event in an event sourced mechanism as the current state can be inferred at any time out of the event streams, is irrelevant for this article's intention. We can always assume that a command originates zero, one or more events (e.g: no state change, or some state change).

The debate

There is an ongoing debate amongst CQRS practitioners on whether commands are naturally asynchronous, as Udi Dahan states on this video at minute 01:30, or on the contrary there is no such a thing as asynchronous commands because they are a synchronous concept, as Greg Young suggests here.

If you don't mind, I will go straight to my conclusion. Commands are naturally synchronous and viewing them as something async is actually a dangerous thing as I'll elaborate soon on.

Although my intention is to focus on the message and not to shoot the messenger (even less when the messenger is who he is), I cannot help thinking that Udi Dahan's view on the subject is too biased as he is the creator of NServiceBus, a magnificent software that allows us to take advantage of messaging mechanisms in a scalable, resilient, well monitored and secure way without worrying about the transport protocol underneath.

In queue based messaging systems, everything is obviously likely to be asynchronous as by definition any message could be queued and its delivery re-attempted or re-routed if for whatever reason it failed to reach its destination. In fact Udi himself states on the previous video that placing a messaging bus in the middle of a request/response communication is a bad idea and an old fashioned RPC would be more convenient. So the only option to keep using a messaging bus also for write operations is to conveniently view them as something asynchronous.

Asynchronous commands

Let's dig a bit deeper into this concept of asynchronous commands. To summarize the above paragraphs, we all agree on the definition of a command as a write operation that can cause state changes (i.e: events) but may disagree a bit on whether these operations are a fire and forget thing or something we must await to produce the result.

Async command is a dangerous concept

The async-commanders (please allow me the silly tags async-commanders vs sync-commanders) know the web world is driven by http, a synchronous protocol. When a browser sends an http request to a server, it awaits a response. An http request can carry a command intention within a POST, PUT, PATCH or DELETE, and the fact that the browser expects an answer to that request does not mean that it expects an answer to the command this CRUD verb carries, just to the verb itself. So the async-commanders advice is to respond to the http request with a transaction ID for the command to inform the sender that the intention has been captured and it has been tagged with an identifier that can be used, later on, to query the current state for this transaction or to correlate a future message with the original intention.

That is a valid point. In fact newer techniques such as web sockets allow full-duplex communication so the server can directly push a notification with the command result to the original sender when ready, relieving the sender from continuously annoying the server with a repetitive are we there yet? question transaction ID in hand.

I agree with separating the http message's intention from the command's intention, but that doesn't mean commands are asynchronous operations transported synchronously in the web.

The flow for async-commanders would be described as similar to the following.

  1. An actor sends an http request towards a system through a gateway. The request carries a command as an intention to change state.
  2. The gateway generates a new transaction Id.
  3. Asynchronously, the gateway places the command in a message bus in a fire-and-forget fashion.
  4. The gateway returns a transaction Id in the http response.

Why asynchronous command is a dangerous concept?

At this stage, the sender has the transaction identifier to correlate a future command result and hopes that eventually it arrives. What does it show on screen? How long would the user like to see a loading icon? Should it show instead the transaction Id and invitation for the user to check later? or maybe guess what the command would eventually cause and take the risk and show it on screen as if it had already happened?

These are questions any business would have to think of. After all, expecting that the specific result for the command will be received shortly is a fallacy.

An important detail is that, at least, the system waits for the command message to be safely placed in the message bus before responding with the transaction Id to the original sender.

The sender expected an answer and we answered with a thanks for your request, we'll be back to you soon. We better make an effort to provide the real command result in a timely manner because the user probably has some expectations already.

The user intention is safely queued in our message bus which and there's a handler on the other side who is receiving the command. If there isn't, the message delivery would happen as soon as the command handler is ready again. Messaging systems are great at ensuring at least once delivery (not so great at ensuring order though).

Eventually the message is handled, the current state of the system and the business logic is taken into consideration in order to produce the required changes. If everything goes well, the write operation succeeds and the system is ready to notify the original sender of this result and, in the best case scenario, the wait has been of just some milliseconds or a few seconds.

Commands can succeed ..or not

Commands can be rejected. It's a possibility and therefore we should expect it to happen. Commands are not facts, they are intentions, and they imply certain expectations that could be unrealistic unless the command sender already knows absolutely everything about the current system state (including what other users are doing at the same time) and about all the business logic in place. As I cannot think of any example of such an exceptional case I stand by the statement that a command is not certain to succeed. If it was, it'd be a fact, an event, not a command.

A command could be rejected due to business logic. Let's think of a rule that states that a shopping cart cannot contain more than 10 articles, so a command to add the eleventh one would have to fail. Of course you could always implement business logic in the UI and skip the domain as the protector of invariants, but that would open a can of worms.

Even if the original command sender thinks to know how things are and prevent sending commands likely to fail, in multi-user applications (any web?) it's basically impossible to know what other commands are being sent at the same time. Let's think of highly concurrent systems where there is some optimistic concurrency in place and commands intended on a stale state can be rejected because another user modified a resource just milliseconds before.

If those arguments are not enough to convince you that commands can be rejected and that it's dangerous to design your system and user experience around the idea that they can't, let's explore other dangers.

Long running processes are implementation details

Another subject where asynchronous commands gained traction was the long running processes which are transactions that contain multiple operations (e.g: long calculations, RPC with external services, etc.) to eventually produce a successful or an unsuccessful result.

Async-commanders like to view commands as likely to be long running processes. This could be true when, as part of a transaction (i.e: single command), the business logic must query some external state by using, for example, a third party service.

They think that an asynchronous command placed in a message bus is a good solution due to its retry logic. If the third party service is unavailable, the transaction can be on-hold until it becomes available again, moment in which the change of state finally happens and the command result is issued (assuming it didn't fail business logic validation).

There's also a concept of process managers (often referred as a Saga) related to these long running processes where the long running process is modeled as a state machine that reacts to state changes with new transactions in the form of commands, including compensating actions. They are a cross-aggregate concept not because they span cross-border transactions, but because they are aware of the different transactional boundaries and are able to coordinate those transactions to produce a meaningful final state.

A user does not talk to process managers, it doesn't even know they exist. These are things that react on state changes. In other words, a user sends commands triggering a transaction on a specific bounded context expecting a result (the sooner the better). Process managers react to events and have also its own state, sending also commands that trigger a new transaction. In any case, the command always precedes the event, and it's a matter of design and business expectations what's their understanding of what a transaction is and how to account for eventual consistency if allowed.

Long running processes are implementation details when the end user willing to send a write operation doesn't really know about it or simply doesn't care. For a user intending to issue a multi-currency invoice, it's probably irrelevant to know whether the system has to access an external third party Forex to query the current exchange rates or the exchange rates are already available in the system. In the first scenario, the operation would be a long running process, in the second it wouldn't because some other background service would deal with keeping up to date the currency exchange information or cache it.

The danger of seeing commands as naturally asynchronous is that it can make us lazy when designing a system. If all the transactions are asynchronous, why bothering so much with response times? We could even make the mistake of splitting a monolyth in microservices without caring about transactional boundaries and end up suffering the pain of a distributed monolyth where the code, teams and processes remain highly coupled and we get to experience the complexity of distributed systems in addition. After all, if our modules can communicate asynchronously for both commands and events, why bothering with bounded contexts?

If we have a distributed system, well defined transactional boundaries are of vital importance. Thinking of commands as a synchronous concept will help us shape our system in a more maintainable and resilient way. Different aggregates (or microservices, except process managers that we saw they could be seen as a cross-aggregate concept) should not really know about other microservices. The only dependency between them should be through schema or contracts, but the communication must happen in an asynchronous way (yes), with and only with events. Thinking of commands as something asynchronous could lead us to make the mistake of having a microservice send commands to another and therefore coupling them at least conceptually.

If an aggregate or microservice needs another to do something in order to complete its own transaction, there is a design smell there. It doesn't matter if you put it under the carpet making a command asynchronous between them. If you cannot communicate distributed modules in a way where the messages between them are facts that can never, ever be rejected, rethink your solution.