How To Set-Up SQS Dead-Letter Queues With Alarm Notifications
Welcome to the first installment in our new engineering content series. These articles will be exploring the work our development team is doing on a daily basis. While the site blog touches on many subjects related to Fidel API, you can head to our new Fidel Technology Blog to subscribe exclusively to our engineering content.
AWS Simple Queue Service (SQS) is a service that allows the setup of queueing between any distributed components. A queueing system is hugely useful in providing a scalable layer between micro-services and other distributed components or systems. As such, it’s essential in modern serverless and event-driven architectures.
However, as part of normal production use, problems may arise and if so, we need mechanisms to support our monitoring and potential recovery decisions. In this article, we’ll be diving into the setup of a backup queueing for failed message consumptions and an alert to make sure that such failures are not forgotten.
Author’s Note: For complete documentation on SQS, check the official developer guide. All code snippets below will be provided by infrastructure as code using AWS CloudFormation. Some parts might be summarized or omitted for simplicity reasons.
An SQS queue can be defined by several characteristics, such as being unordered or FIFO, having limits on maximum message sizes, or even on the length of time messages are visible.
In practice, a queue works by having producers publish messages to it for consumers to read and process. The happy path is defined by messages being consumed successfully and consequently being removed from the queue.
Capture Consumption Failures
If there is a failure consuming a message, it may end up back in the source queue and be available for consumption again, which can potentially continuously fail.
If the consumer encounters momentary errors - for example, some temporary network issue - this retrial might be fine. However, if the failure is permanent we might see a message that will always throw an error and never be deleted from the queue (until the queue’s MessageRetentionPeriod). This can obviously have high costs both computationally and financially.
The most common and recommended behavior when consumptions fail is to redirect failed messages to a dead-letter queue (DLQ).
A DLQ is a queue defined solely for the purpose of holding messages that could not be processed successfully.
They are specified in the same manner as normal queues and can then be attached to a source SQS queue easily by setting the deadLetterTargetArn and a maxReceiveCount. This last parameter is setting a finite value coherent to the domain, as this will allow messages to be consumed and potentially fail, but only up to that value before being moved to the DLQ.
deadLetterTargetArn: !GetAtt MyDeadLetterQueue.Arn
There are some considerations to note too, namely the fact that the DLQ must be in the same account and region as the source queue, and of the same type, i.e., the DLQ of a FIFO queue must be a FIFO queue and, analogously, the DLQ of a standard queue must also be a standard queue.
Setup An Alarm Notification
Now that we’re sure that our failed messages are not being continuously processed, we need ways to monitor our DLQs. For this end, we can use Amazon CloudWatch metrics to monitor DLQs, as all queues emit several CloudWatch metrics at one-minute intervals. It’s important to note that count-related metrics on FIFO queues generate exact values. However, on Standard queues, the counts are approximate due to SQS’s internal architecture. All SQS metrics are identified solely by QueueName.
To be notified of activity in the DLQs, since we should expect the DLQ to be always empty, we can create a CloudWatch alarm for the ApproximateNumberOfMessagesVisible metric (number of messages available for retrieval from the queue).
Metrics can be analyzed using several statistical operations. For this specific alarm, we can use the Sum and set a threshold of 0, creating an action to an SNS topic, which will be notified when the alarm triggers, i.e. when the total number of messages visible is above 0.
This SNS topic can then deliver a notification to any of the available destinations, for example, an email address. The metric, statistical operation, and threshold can be adapted to better fit the domain where it is to be applied.
deadLetterTargetArn: !GetAtt MyDeadLetterQueue.Arn
TopicArn: !Ref MyAlertTopic
- Name: QueueName
- !Ref MyAlertTopic
The final architecture for the example depicted ends up being fairly straightforward.
AWS SQS queue with DLQ specified and corresponding CloudWatch alarm and alert topic subscription.
The objective of the created CloudWatch alarm is to create awareness and thus a notification will be emitted when the metric switches alarm states, i.e. when moving from an OK state to “In alarm”. The alarm can also be checked and monitored easily from the CloudWatch alarms section on the AWS console.
CloudWatch alarm switching states as messages arrive in the DLQ.
The alarm notification email is received as soon as the alarm is triggered and thus allows us to be notified that something is not working properly. Unfortunately, at this moment, the alarm email does not show which messages are available in the DLQ, and thus, to check them, we will have to set up a monitoring consumer or check them via the AWS console or CLI.
The DLQ messages need to be taken care of to eventually drop the total number of available messages below the threshold. Only then will the alarm return to the “OK” state and thus be able to be activated again.
Example email notification for the alarm created.
Notes On Failure Handling
As this topic goes beyond this article’s main objective, we’ll avoid going into too much detail, but it’s vital to reinforce the fact that as messages arrive at DLQs, following the arrival notification, these messages should be investigated quickly to understand why they were not consumed correctly. Such investigations will allow us to figure out why they were not processed, and if possible adapt, resolve, fix issues, and potentially reprocess the failed messages.
It’s possible to create a lambda that consumes messages from the DLQ and perform whatever custom recovery mechanism logic we define, for example, simply storing on DynamoDB for logging purposes or retrying consumption after adaptations/corrections. An alternative to help with this feedback loop of possibly introducing failed messages back within the source queue is the enhanced DLQ management experience that allows the easy recycling of unconsumed messages.
At Fidel API
At Fidel API, we are AWS-first with all our cloud components and rely heavily on SQS and on all its features, namely parameter tuning, enhanced monitoring metrics, and the application of failure handling strategies. SQS is one of the mechanisms we rely on for achieving high scalability by allowing better resource allocation while processing high amounts of processing tasks.
Specifically, we are currently making use of a few hundred queues, spanning between Standard and FIFO types depending on the context, and being used by Node.js lambda producers and consumers. These contexts range from card transaction filtering and enrichments to background file processing and generation.
If you have any questions regarding the topic above, you can join our Developer Community, where Fidel API engineers will be on hand to answe any questions you have.
If you are interested in working with SQS, Lambdas, DynamoDB, and other AWS services, and knowing how they support our real-time financial APIs, have a look at our careers page. We’d love to hear from you!