Have you ever got a call in the middle of the night saying “Infrastructure looks ok, but some service is not consuming data / messages get redelivered. Please figure it out”
Redelivered messages are also called “Poison messages”.
Poison messages = Messages that cause a consumer to repeatedly require a delivery (possibly due to a consumer failure) such that the message is never processed completely and acknowledged so that it can be stopped being sent again to the same consumer.
Example: Some message on an arbitrary queue pushed/pulled to or by a consumer. That consumer, for some reason, doesn’t succeed in handling it. It can be due to a bug, an unknown schema, resources issue, etc…
In RabbitMQ, for example, quorum queues keep track of the number of unsuccessful delivery attempts and expose it in the “x-delivery-count” header that is included with any redelivered message. It is possible to set a delivery limit for a queue using a policy argument, delivery-limit. When a message has been returned more times than the limit the message will be dropped or dead-lettered (if a DLX is configured).
It is known to any developer that uses a queue / messaging bus that poison messages should be taken care of, and it’s the developer’s responsibility to do so.
In RabbitMQ, the most common, simple solution would be to enable DLX (dead-lettered queue), but it doesn’t end there.
Recovering a poison message is just the first part, and the developer must also understand what causes this behavior and mitigate the issue.
While there is the classic solution of committing/acknowledging a message as soon as possible, it’s not the best option for use cases requiring ensuring messages are acknowledged only when finished being handled.
Other approaches –
How to handle unacknowledged messages in RabbitMQ
How to handle unacknowledged messages in Apache Kafka
There is no out-of-the-box redelivery/recovery of such messages.
When we started to refine our approach, understand the needs, validate the experience, and craft the value it will bring to our users – three key objectives led our process –
1 – Define the trigger
In Memphis broker, at the SDK level, we use a parameter called “maxMsgDeliveries.”
Once this threshold is reached, the broker will not repeatedly send the “failed-to-ack” message to the same consumer group.
2 – Notification
3 – Identify the consumer group which didn’t acknowledge the message
Instead of going around through logs and multiple consumer groups, we wanted to narrow the finding to the minimum, so a simple click on the “real-time tracing” will lead to a graph screen showing the CGs which passed the redelivery threshold.
4 – Fix and Debug
After the developer understand what went wrong and creates a fix, before pushing the code which will lead to probably more adjustments when new messages will arrive, we have created the “Resend” mechanism which will push the unacknowledged message as many times as needed (Until ACK) to the consumer group that was not able to acknowledge the message in the first place.
The unacknowledged message will be retrieved from Memphis internal DB, ingested into a dedicated station per station, per CG, and only upon request. Next, it will be pushed to the unacknowledged CG – WITHOUT ANY CODE CHANGE, using the same already-configured emit.
That’s it. No need to create a persistency process, cache DB, DLQ, or massy logic. You can be sure no message is lost.
Still under construction, but if you’re interested – sandbox.memphis.dev