Reliably Processing Azure Service Bus Topics with Azure Functions

At ASOS, we’re currently migrating some of our message and event processing applications from Worker Roles (within classic Azure Cloud Services) to Azure Functions.
A significant benefit of using Azure Functions to process our messages is the billing model.
As an example, with our current approach, we use a Worker Role to listen to subscriptions to Azure Service Bus topics. These Worker Roles are classes that inherit `RoleEntryPoint` and behave similar to console apps – they’re always running.

Let’s suppose we get 1 message in an hour. We’re billed for the full hour that worker role was running.
With Azure Functions, we would only have been billed for the time it takes for that 1 message to be processed, when it is received by the function.

Another benefit is a huge reduction in code we have to maintain. Most of our queue listener apps are simply listening to a service bus topic, and calling the appropriate internal API, in order to keep downstream systems in-sync within the ASOS microservices architecture.
With Worker Roles, we manage the whole queue listening, deserialization, error handing, retry policies etc… ourselves.
This has resulted in some quite large applications we needed to maintain.

Some of our `ServiceBusTrigger` Azure Functions could look almost as simple as this example:

In this scenario, our function receives the BrokeredMessage in PeekLock mode.
We can change the BrokeredMessage parameter to a string or a typeof our own class – the SDK will deserialize for us, but the underlying behaviour remains the same:
The code within the Run method is then executed, and if there are no exceptions, <code>.Complete()</code> is called on the message.

Fantastic!

This simplicity, however, comes at a cost.

When an unhandled exception is thrown within the function, then <a href="https://docs.microsoft.com/en-us/dotnet/api/microsoft.servicebus.messaging.brokeredmessage.abandon?view=azure-dotnet">.Abandon()</a> is called instead of .Complete() on the brokered message.
Abandon() doesn’t deadletter the message, but instead releases the lock that the function had on it, returning it to the queue, incrementing the DeliveryCount property on that message.
The function will then receive this message again at some point, and will try again.
Emphasis on at some point because this is undeterminable.
It could be instant, or in a few seconds, depending on how the queue is configured, or how many messages there are, and the scale out settings of your function.

If it fails again, it will again increment the DeliveryCount property, call `.Abandon()` on the message again, returning it to the queue.
This will continue until the MaxDeliveryCount is exceeded.
Both the queue, and the subscription can have their own `MaxDeliveryCount` (The default is 10)
Once this is exceeded, the message is finally sent to the dead letter queue.

There’s a few problems with this approach.
For example, what if the exception was because our API returned a 503 – it may be unavailable, but we expect it to return.
In this scenario, we’d like to delay the retry of the message for 1 minute.
Currently, the message will be retried up to 10 times before being sent to the dead letter queue.
Those 10 executions could happen within a second.

Another scenario could be if our API returns a 40x error.
The data we’re sending it is bad. It’s not going to get any better, it’s immutable.
So we should dead letter that message straight away – with a DeadLetterReason so we can log on it, or something else can pick it up and try again, etc…

I naively tried this:

This causes the message to be sent to the dead letter queue, however the WebJobs SDK is still managing the lifecycle of the message, and since we’re not throwing an exception to the function, still tries to call `.Complete()` on the message.
This results in the following exception:

The lock supplied is invalid. Either the lock expired, or the message has already been removed from the queue.

Under the hood, the Azure Function ServiceBus Trigger does something like this rudimentary example:

  • The runtime receives the message from the topic.
  • It then runs your code, passing the message as a parameter.
  • If there’s an exception, it calls .Abandon() on the message.
  • Otherwise, it marks it as complete if there’s no exception (which it assumes is a success)

This might work in very simple cases, but for our case, we needed to have greater control over the behaviour of the message, particularly with retrying.
Fortunately, this was addressed in the following GitHub issue, and this (now completed) pull request.

This allows us to add autoComplete: false to our host.json file, which alters the way messages are handled by the function runtime.
Should the function complete successfully (ie- there were no unhandled exceptions thrown) then the message will not automatically have Complete() called on it.

This does mean we must manage message lifetime ourselves.

Our previous example will now work, allowing us to `.DeadLetterAsync()` a message:

Retrying messages at some point in the future
Taking things a step further, we can also retry messages at some point in the future, by cloning and re-sending to the topic.
Adding an output binding to our function makes this incredibly easy:

In this example, we first call apiClient.PostAsync(thing);
If that is a success, the message is marked as complete (this is important).
If we receive a 400 Bad Request from our API, we dead letter that message.
If we receive a 503 Unavailable from our API, we retry the message in 30 seconds.
We do this by cloning the original message, setting the `ScheduledEnqueueTimeUtc` of that cloned message to 30 seconds from now, and adding to the output queue.
We track number of retries using a custom “RetryCount” property on the message.
If it exceeds 10, we simply dead letter it.

Note: It’s important to complete your messages!
Note my emphasis on `msg.CompleteAsync();` – on both the successful api call, and the retry.
Because we’ve set `autoComplete: false` the runtime is no longer managing the lifetime of our messages.
While testing, I forgot to add this line, which meant the original message wasn’t completed. It wasn’t abandoned either, which meant it went back on the queue. To be retried, again and again and again. The queue had over 40million messages in when I noticed. Whoops.

Conclusion
Azure Functions makes processing queue messages extremely easy, especially if your scenario is a simple in -> out flow with no error handling.
Using the new autoComplete: false setting allows greater control over the lifetime of the messages received.
We can use a ‘clone and resend’ pattern to offer scheduled retries of messages.