We distilled 160 conversations with tech leaders across enterprises, startups, NASDAQ companies, and legacy organizations, from team leaders to CTOs and VP R&Ds. Here are their Top 5 list of data engineering trends that will likely come to life in 2023.
“Data Contracts are API-based agreements between Software Engineers who own services and Data Consumers that understand how the business works in order to generate well-modeled, high-quality, trusted data.”If you take a good look at it, you have at least ten different data producers and multiple consumers, written in different languages, interacting with various databases, SQL, No-SQL, and the holy grail data models. It's a mess. Data contracts are still a management or operational concept, but we are starting to see more and more traction and conversation around it (Chad Sanderson covers the subject in depth in his newsletter). The end goals of data contracts are to a) increase the quality of produced data. b) Easier maintenance. c) Apply governance and standardization over a federated data platform.
A new role - Data Reliability Engineer (DRE).
One of the most common challenges that leaders have raised is how to narrow the technological gap between the different data stakeholders -
Engineers to Analysts, BIs, and Scientists.
This gap is not only the source of over-complicated architectures but also one of the significant cost generators. The BIs, analysts, and scientists each have a stack with dedicated languages like SQL and R. Besides technical differences, there are also different interests and sort of a bubble-like environment that is much different than any other group of teams that assemble one unit with one clear goal, like the famous triangle - IT, DevOps, and Devs. Due to the growing complexity of data and the increase in in-house investments in making data much more cost-effective, accessible, and a real growth engine, a new position must be filled. Just as the SRE (Site Reliability Engineer) narrowed the gap between the developers and DevOps engineers, so will the DRE, by having a swiss army knife of capabilities starting with business understanding and requirements, on to data structures and SQL, to theory concepts in ML an AI, and lastly in how to create a straight to the point pipelines that will gather the needed data to fulfill the other layers.
Streaming and Real-time.
Data is growing too fast to process as a whole. That’s a simple truth.
These days we can find super intelligent and efficient algorithms that will process output in milliseconds, but in order to bring the data in, each pull will take minutes and hours. That example demonstrates that if the entire process could be refactored and generate results per a single event or in small batches, the output would take a reasonable amount of time. Not hours.
It is one example of many more, but not all use cases can happen in real time, and refactoring is hard. The mindset should be real-time from the get-go.
Tracking stream lineage.
The troubleshooting barriers must be lowered to enable the “streaming” growth and increase usability. Most of the respondents said that they are using a message broker to enable real-time pipelines; for them, a message broker is a black box. Something comes in, something comes out, and at the end of the pipeline, some events are dropped, and some get ingested. The lack of a pattern for failures with debugging experience that requires an assembly of different teams and engineers keeps architects from deepening in real-time. To overcome these barriers, engineers need better observability, context-based, that can display the full evolution of a single event all the way from the first producer (a stage in a pipeline) to the very last consumer. Several products and projects started to address this challenge, including Memphis.dev with embedded event’s journey, Confluent, OpenLineage, monte carlo, and more.
Event sourcing is coming back.
How do you struct or emphasize a user’s journey?
Let’s take for example, the journey of some users in an eCommerce store.
- They entered the store
- They searched
- They found something / They didn't
- They purchase something within two minutes of entrance / They arrived to the checkout and went out
Eventually, you can describe it in 10X different SQL tables or in one No-SQL document, do some joins/aggregations and finally, after the user went far away from your store, perform some actions. Well, there is a better approach for store and action, and it's called event sourcing. In simple words, it means that there is a queue, and into that queue, you push every event that a certain user has made instead into a database. Until now, pretty straightforward, but we want to perform some real-time actions derived from their behavior pattern while the user is in our store. Btw, We at memphis.dev are pushing to also enable this use case.
To conclude, even though it seems we hear it all the time, data keeps growing, arriving from multiple sources in different shapes and sizes.
Mastering the streams and lake can benefit any organization by reducing costs, increasing sales, becoming more efficient, and, most importantly, understanding the customer on the other side.
May the force be with you!