Watch Memphis in ActionBook a demo
RabbitMQ is an open-source message broker. It is used for implementing messaging architecture between different components of your applications. In order to achieve peak application performance in apps that are built around RabbitMQ messaging, continuous monitoring is required to ensure the smooth functioning of services. Several metrics are evaluated to compute the performance of RabbitMQ. But before we get into monitoring details, let’s take a brief look at how RabbitMQ works so that we can better understand how to monitor the performance of RabbitMQ properly.
Initially, RabbitMQ was based on the Advanced Messaging Queuing Protocol but now it can support several other protocols. Different parts of an application or microservices need to communicate with each other inside an application, for it to function. Some components generate messages, and others consume these messages. RabbitMQ is responsible for successfully routing these messages between the right components.
Let’s go through some basic RabbitMQ terminologies to understand better how it achieves this.
A queue is a buffer for storing messages until a consumer retrieves them. It stores messages in the order they are received.
Producers are microservices or components that generate messages and send them to the message Queue.
Consumers are microservices or components that collect and process messages from the Queue.
RabbitMQ server that ensures transmission of messages from producers to Consumers by managing Queues.
A RabbitMQ component that is responsible for routing messages from Producers to the Queues. It operates on the rules defined by the exchange type. Different exchange types are available such as Direct, Fanout, Topic, etc.
Binding is the relation between an Exchange and a Queue. Binding defines the rules for routing messages from the Exchange to a specific Queue.
When a producer generates a message, it is sent to an exchange. Then the exchange routes the message to an appropriate queue, based on the binding rules. Then a consumer fetches the message from the queue and processes it.
Suppose a Producer generates a message and no available component or service can consume that message. In that case, RabbitMQ will keep that message in a queue until a consumer becomes available for delivery. RabbitMQ holds messages in a queue until they can be processed. In this manner, RabbitMQ ensures that all messages are delivered to consumers, even temporarily unavailable.
Most transaction-based infrastructures utilize messaging services for sending transactions downstream. Even if a single message is lost, it could mean losing an actual transaction which could lead to potential monetary loss. There can be several reasons why a message was lost such as high CPU usage, unacknowledged messages, etc. Monitoring can help you pinpoint the reasons for losing messages or slowed server performance.
Usually, an alert system is also used because monitoring alone is not enough. Alert systems notifies when an error or anomaly is about to happen or has already happened. Anomalies could include queues shutting down, messages piling up in a queue, etc. Continuous monitoring can help ensure that problems are immediately identified and resolved.
By monitoring insights on the message broker, we can identify bottlenecks or issues that may affect its performance. Once such issues are identified, they can be removed or rectified to optimize the configuration and performance of the RabbitMQ server.
Monitoring can also help ensure that the message broker is always up and running. The broker will be reliable when messages are being delivered to consumers in a timely manner. If we continuously monitor the health and status of the broker, issues can be quickly detected and resolved before they impact availability.
Monitoring insights provide real-time data about the state of the message broker. This can help in detecting errors and anomalies. By continuously monitoring the state of the server to detect errors and resolving them as soon as they are discovered, we can ensure that the server remains in a healthy and available state.
Monitoring can also help in ensuring compliance with regulatory requirements and industry standards. The insights available can also be used for auditing and reporting purposes.
Monitoring also provides historical data about the performance of the message broker. This data can be used to analyze trends and patterns in the performance graph of the broker, which can be used to fine-tune the message broker further.
The historical performance data used for trend analysis can also be utilized to plan for future capacity and performance requirements.
The most basic way of monitoring RabbitMQ systems is through health checks. Health checks provide information about whether a node is healthy or not. But depending on your system, the definition of a healthy node can vary.
For example, in one system a node will be classified as healthy if the ABC VM is running on the system. But the criteria of a healthy node might be different in another project i.e. the ABC VM is running, and XYZ service is also running on the Virtual Machine, which will qualify as healthy. So all the nodes can not be monitored under the same criteria. Also, RabbitMQ allows you to define what qualifies as a healthy node and what does not.
Typically, system-level parameters are monitored in health checks. Which means they provide very limited information. A simple health check will only reveal information about the node it was run on. But if you want to assess the health of the overall RabbitMQ server, you will need to collect metrics from all nodes in a cluster.
The metrics that can be collected from the nodes can be classified into three main types:
One important thing to note here is that you will not get much information by monitoring individual metrics. The trick is to monitor multiple metrics together so that you get enough data to ensure that you can debug any issues that have occurred or may occur.
Let’s take a detailed look at the above classifications one by one to understand better how these metrics can help us in monitoring RabbitMQ.
Kernel metrics only focus on nodes. The information presented in the Kernel metrics is not focused on RabbitMQ; specifically, it only informs about the health of a node. But if we collect information about all the nodes present in a cluster, then we can assess the overall health of a RabbitMQ cluster.
A few examples of Kernel metrics are CPU usage, IO, memory, etc. Because all of these metrics are related to the Kernel, they are also referred to as System Metrics. Monitoring Kernel metrics can highlight information that can help us understand the reasons behind hold-ups and slow performance. For example, we monitor an increase in IO operations. The reason behind this could be that messages are piling up in the message Queues. We could increase the number of nodes to rectify this problem.
While monitoring Kernel metrics, there are certain things that should be kept in mind to ensure maximum system performance. If you are frequently using a monitoring tool to collect these metrics, the tool could end up consuming too much of your system resources. This means fewer resources for RabbitMQ, which could slow down its performance.
Vice versa, if your metric pulling frequency is too low, it could also cause problems. For example, you could miss pulling important data about resource usage spikes during peak hours. And thus, end up being unable to ascertain the reason behind the spike in resource usage. So keep in mind to have a moderate approach to pulling Kernel metrics.
Monitoring application metrics can also highlight underlying issues that hinder the performance of RabbitMQ. Both the producer and consumer applications are important in maintaining cluster health. For example, there is an application failing to maintain a stable connection with RabbitMQ. This can block the message Queue, which could cause the cluster to slow down.
Another example could be an application taking extra time to acknowledge that messages have been delivered. This can also affect the performance of the cluster. This is how the applications producing and consuming messages have an effect on the performance of a cluster and why it is important to monitor application metrics to ensure maximum performance.
Analyzing application metrics and RabbitMQ metrics makes it much easier to identify applications or services slowing down the cluster. Some examples of application metrics that you can monitor are Connection Opening and Failure Rates, Message Publishing and Delivery Rates, and Channel Opening Rates.
There are different approaches that can be taken in order to pull these metrics. One method is using the RabbitMQ Java client. Another method is using libraries to pull application metrics from specific frameworks such as using the Spring AMQP library.
There are also a number of HTTP APIs provided by RabbitMQ that are used to collect application metrics. RabbitMQ also provides a UI for monitoring these metrics collected via APIs. But only the most recent collected data is available in this UI, which makes the UI impractical. Because in order to analyze the performance of real-world applications, data from days or weeks ago is often required.
These metrics are specific to RabbitMQ only. RabbitMQ metrics may be collected at the exchange level, at the node level, at the cluster level, and at the queue level.
We will take a look at all these metrics separately. Some examples of each metric type will also be discussed so that you can better understand how these different metrics can help assess the performance of a RabbitMQ cluster.
These metrics provide information about the performance of an exchange. Monitoring these metrics could highlight if your messages are facing any routing problems.
Let’s take a look at a few examples of exchange-level metrics and how they can point out hidden performance issues.
This metric represents the number of messages being published in an exchange per second.
Let’s say you notice a decrease in the number representing this metric. The reason behind this could be that a producer application is down. Now you can take timely action.
This metric is also a number that represents the number of messages leaving an exchange. It is also represented as a rate per second.
If you monitor a decrease in this number, this could mean that a consumer application is down or taking too long to process messages.
Monitoring node-level metrics is the best place to look if you want to analyze your RabbitMQ resource usage.
Let’s take a look at some node-level metrics.
These are two different metrics. The memory usage metrics represent the amount of RAM being consumed by a RabbitMQ node. While the disk usage metric portrays the bytes of disk memory being utilized by a node.
Both these metrics need constant monitoring if you want to ensure that your cluster is up all the time. For example, if the available disk space to a node drops below a certain threshold, the RabbitMQ will go into an alarm state. Even if a single node hits this threshold, all the nodes in that cluster will stop accepting messages. And the whole cluster will become unavailable. By regularly monitoring disk usage, you can make sure that all nodes have enough storage space available so that your cluster remains available.
Similarly, if the RAM usage of a particular node exceeds a threshold all connections to that node will be blocked. By regularly monitoring the memory usage of nodes you can make sure that the threshold is never crossed and you never face choked connections.
These are also 2 different metrics and comparing these two numbers can highlight important information.
The number of sockets available represents the total sockets that are available to a node for connections. While the number of sockets used represents the sockets that are currently being used by a node.
If you notice that the difference between these two numbers is decreasing, it means that the node is reaching the maximum number of connections it can support. This is a sign that you should consider scaling your cluster before it stops functioning.
The information in these metrics provides an audit of the entire cluster. Let’s take a look at a few of the most monitored cluster-level metrics and what information they can highlight regarding the performance of a RabbitMQ cluster.
The number of connections metric provides information about the exact amount of TCP connections to a cluster.
Let’s say you notice a drop in this number while monitoring a cluster. This could highlight that a consumer is down. And you could rectify the problem before it halts any major process or service.
The number of messages metric is the total number of messages that are being published, delivered, acknowledged, or unacknowledged in a RabbitMQ cluster.
Monitoring this metric regularly along with other metrics can highlight important information that might not be apparent. For example, if the number of messages being published and delivered is not equal, it could mean that a consumer is down or that messages are going unacknowledged. This would require you to further investigate if you want to ensure that your cluster does not go down unexpectedly.
The queues in RabbitMQ are used to receive, store, and deliver messages. Monitoring queues is as important as monitoring exchanges, as the queue is the last stop a message will take inside a RabbitMQ server.
Regularly monitoring queue metrics and analyzing any fluctuations can point out resource-related issues that can cause the whole system to shut down. Making sure that your message queues are working as expected will ensure that your RabbitMQ server is always running without any hindrance.
Let’s take a look at some queue metrics and how you can use the information provided by them to assess the performance of your RabbitMQ server.
This metric represents the count of messages that are ready to be delivered to consumers.
Let’s say you notice an increase in this number while monitoring. It is obvious that the reason behind this increase will be related to the consumer side. One reason could be that a consumer is down. Another reason could be that message processing is taking too much time, hence the consumer takes longer to accept new messages and the number represented by this metric is increasing. Then you can start rectifying this issue before it causes production to halt.
This metric represents the count of messages that have been delivered from a queue, but the queue has yet to receive an acknowledgment from the consumer.
Let’s say that while monitoring you notice that this number is increasing. The apparent reason behind this is an unresponsive consumer.
Now that we’ve covered numerous RabbitMQ monitoring metrics and what they represent, we are better equipped to get the best out of this open-source message broker. But continuous RabbitMQ monitoring is resource and time intensive. Also, human negligence can lead to issues being ignored.
To avoid all these problems, you can decide to go with a modern message broker that has the ability to self-monitor. One of the best out there is Memphis message broker. It can anonymously do health checks and assess its performance, instead of someone having to go through each metric one by one.
By deploying periodic self-checks and proactive rebalancing tasks, Memphis can automatically make sure that it is up and running 24/7 in an optimized form.
The first tool that we are going to take a look at is the built-in CLI called RabbitMQ Diagnostics.
RabbitMQ diagnostics is a built-in RabbitMQ tool. It provides a very basic monitoring functionality. You can use basic commands such as “ping” and “status” to monitor very specific metrics.
If you are getting started with RabbitMQ monitoring, this tool is the best place to start.
Sematext monitoring software collects a wide range of metrics of different systems and applications. And as we have discussed earlier, monitoring sets of metrics is generally the best practice to highlight underlying issues. This is the main reason we recommend Sematext.
Sematext can easily integrate with a RabbitMQ cluster and collect all the metrics provided by RabbitMQ along with all the system metrics.
This tool comes with a powerful visualization tool. This helps when all these different metrics are collected. You can build custom dashboards to plot metrics that are crucial to your business.
|Easy to set up and comes with automatic detection of RabbitMQ installations.||There are no annual pricing plans available.|
|You can configure which metrics to collect.||Needs more integration with security tools.|
|Pre-defined alert rules.|
Prometheus is a famous data collection application and Grafana is a well-known visualization tool. Prometheus can be hosted anywhere, on-premises or cloud, and it can be integrated into applications written in most programming languages.
As Prometheus is not a visualization tool, we have to use Grafana to visualize different metrics collected from the system.
|Both of the software can be self-hosted.||Needs a lot of maintenance if deployed on-premises.|
|Both are open source.||Setting up both these software takes a considerable amount of time.|
|This combination of tools is highly customizable.||No alerting is available by default, another tool is required for this.|
Comprehensive RabbitMQ monitoring is crucial to ensure the availability, stability, and performance of the RabbitMQ server. Monitoring also helps maintain the reliability of the messaging infrastructure which is crucial for the smooth functioning of your system.
But you will need to spend significant time and resources on RabbitMQ monitoring. If you are looking to save quite a lot of resources and time, then take a look at Memphis. Memphis utilized several automatic techniques such as periodic self-checks, proactive rebalancing tasks, and fencing users from system misuse, etc., to ensure that the server is performing well. This gives Memphis a great edge over traditional message brokers like RabbitMQ.