We can recover lost events if the data source lives in the stack’s persistence layer.
In terms of Ingestion latency, increasing CPU resources and fine-tuning batch sizes result in a significant increase in throughput, reducing total lag from several hours to a few minutes.
Most importantly, we can make changes in Kafka to solve this problem too, which are as follows:
- Changing Kafka offset reset policy in several applications to the earliest.
For example, suppose an ingestion lag is observed again due to spikes in data traffic.
In that case, we attempt to load the oldest available record in Kafka to reduce the blast radius associated with data loss.
It is especially useful in services where auto commits are disabled, and idempotency is considered. - Extending Kafka retention policy by several days to allow for system re-stabilization and to give development teams enough time to work on potential fixes to issues without losing data.
Where, of course, the retention policy can be reduced or reverted to its default or custom setting.
Each station in Memphis uses a stream file to store the station’s messages.
The user can choose whether the streaming file is saved in memory or on a file from where the user can retrieve it.