When structuring data with Kafka, you first need to consider how it’s going to be consumed. A topic groups similar kinds of message that are to be consumed by the same kind of consumer. So you can just have a single topic for one kind of data, and if you want to push some other kind, you can just add a new topic later. However, keep in mind that topics are registered in ZooKeeper, so you might come across some problems if you try to add too many (like when you have thousands of users, and you create one topic per user).
Meanwhile, partitions are used to parallelize message consumption. The total number of partitions in a broker should be equal to the number of consumers in a consumer group. The consumers in a group split the load of processing the topic among themselves based on the partitioning. This way, a consumer only has to take care of the messages in the partition that it has assigned to itself. There are two ways to set partitioning. You can either set is explicitly with a partition key on the producer end, or you can let a random partition be select for each message.
The partition structure you select will mainly depend on how you want the event stream to be processed. A partition key is the most ideal option, which means event processing will be partition-local.
Well-defined schemas and data models are the secret to ensuring and controlling the quality of data flowing between different owners. The constant breakdown of data pipelines leads to usability and data quality issue and creates a big communication gap between data consumers, data engineers, and service implementers.
Memphis’ schemaverse offers a robust schema management layer and store over the Memphis broker without any dedicated resources or standalone computer. It adopts a unique programmatic and UI approach that allows users to create and define schemas, attach them to different stations, and choose whether the schema should be enforced. Plus, since Memphis has a low-code approach, you also don’t need to worry about serialization since that’s already taken care off in the producer library.
Other features of the schemaverse include:
- The ability to render existing consumers and producers in real-time
- Unaligned messages are dropped immediately once the schema is attached
- The ability to modify schemas without rebooting producers
- Producer-level validation to reduce the resources required from the broker
- Zero-trust enforcement, out-of-the-box monitoring, versioning and the ability to import & export schemas
The first step in the schemaverse is the creation and enforcement of the schema. Once a schema is created based on the most suitable data model, it’s applied over a station. The data format you choose when creating a schema will also determine the format of the consumed messages.
When a schema is attached to a station, the producers must fulfill that schema. Once the producer is connected to a station, the new schema is retrieved live and all new messages sent to the station are validated against the attached schema that’s also present in the producer cache and constantly listens for updates. Only valid messages that conform to the attached schema can enter the schema.
The last step in the process is serialization, which refers to the process of converting a data object into a series of bytes so that it can save its state in an easily transmittable form. In case the data object (content and structure) doesn’t match the schema defined in the .json/.avro/.proto struct, the process will fail, and the message won’t be sent to the broker.