Among the open-source projects my college buddies (and my future co-founders of memphis.dev) and I built, you can find “Makhela”, a Hebrew word for choir.
For the sake of simplicity – We will use “Choir”.
“Choir” was an open-source OSINT (Open-source intelligent) project focused on gathering context-based connections between social profiles using AI models like LDA and topic modeling, written in Python to explain what the world discusses over a specific domain and by high-ranking influencers in that domain and focus on what’s going on at the margins. For proof-of-concept or MVP we used a single data source, fairly easy for integrations – Twitter.
The graph below was the “brain” behind “Choir”. The brain autonomously grows and analyzes new vertexes and edges based on incremental changes in the corpus and fresh ingested data.
Each vertex symbolizes a profile, a persona, and each edge emphasizes (a) who connects to who. (b) Similar color = Similar topic.
Purple = Topic 1
Blue = Topic 2
Yellow = Marginal topic
After a reasonable amount of research, dev time, and a lot of troubleshooting & debug, things started to look good.
Among the issues we needed to solve were:
As with any startup or early-stage project, we built “Choir” as MVP, Working solely with “Twitter”, and it looked like this –
The “Collector” is a monolith, python-written application that collects and refines the data for analysis and visualization in batches and a static timing every couple of hours.
However, as the collected data and its complexity grew, problems started to arise. Each batch processing cycle analysis took hours for no good reason in terms of the capacity of the collected data (Hundreds of Megabytes at most!!). More on the rest of the challenges in the next sections.
Fast forward a few months later, users started to use “Choir”!!!
Not just using, but engaging, paying, and raising feature requests.
Any creator’s dream!
But then it hit us.
(a) “Twitter” is not the center of the universe, and we need to expand “Choir” to more sources.
(b) Any minor change in the code breaks the entire pipeline.
(c) Monolith is a death sentence to a data-driven app performance-wise.
As with every eager-to-run project that starting to get good traction, fueling that growth and user base is your number 1, 2, and 3 priority,
and the last thing you want to do at this point is to go back and rebuild your framework. You want to continue the momentum.
With that spirit in our mind, we said “Let’s add more data sources and refactor in the future”. A big Mistake indeed.
While there is no quick fix, what we can do is build a framework to support such requirements.
Option 1 – Duplicate the entire existing process to another source, for example, “Facebook”.
In addition to duplicating the collector, we needed to –
And the list goes on…
As a result, it cant scale.
Option 2 – Here it comes. Using a message broker!
I want to draw a baseline. A message broker is not the solution but a supporting framework or a tool to enable branched, growing data-driven architectures.
What is a message broker?
“A message broker is an architectural pattern for message validation, transformation, and routing. It mediates communication among applications[vague], minimizing the mutual awareness that applications should have of each other in order to be able to exchange messages, effectively implementing decoupling.”. Wikipedia.
Firstly, let’s translate it to something we can grasp better.
A message broker is a temporary data store. Why temporary? Because each piece of data within it will be removed after a certain time, defined by the user. Therefore, the pieces of data within the message broker are called “messages.” Each message usually weighs a few bytes to a few megabytes.
Around the message broker, we can find producers and consumers.
Producer = The “thing” that pushes the messages into the message broker.
Consumer = The “thing” that consumes the messages from the message broker.
“Thing” means system/service/application/IoT/some objective that connects with the message broker and exchanges data.
*Small note* the same service/system/app can act as a producer and consumer at the same time.
Messaging queues derive from the same family, but there is a crucial difference between a broker and a queue.
Famous message brokers/queues are Apache Kafka, RabbitMQ, Apache Pulsar, and our own Memphis.dev. Kafka use cases span from event streaming to real-time data processing. One might consider using Memphis.dev instead of Kafka due to the ease of deployment and developer friendliness it provides.
Still with me? Awesome!
Thus, let’s understand how using a message broker helped “Choir” to scale.
Instead of doing things like this –
By decoupling the app to smaller microservices, and orchestrating the flow using a message broker, it therefore turned into this –
Starting from the top-left corner, each piece of data (tweet/post) inserted into the system automatically triggers the entire process and flows between the different stages.
The main reason behind my writing is to emphasize the importance of implementing a message broker pattern and technology as early as possible to avoid painful refactoring in the future. Message brokers, by default, enable you to build scalable architectures because they remove the tight coupling constraints.
Yes, your roadmap and added features are important, Yes it will take a learning curve, yes it might look like an overkill solution for your stage, but when it comes to a data-driven use case, the need for scale will reveal quickly in performance, agility, feature additions, modifications, and more. Bad design decisions or a lack of proper framework will burn out your resources. It is better to build agile foundations, not necessarily enterprise-grade, before reaching the phase you are overwhelmed by users and feature requests.
To conclude, the entry barrier for a message broker is definitely worth your time.