Data Engineering

The Future of SQL: Databases Meet Stream Processing

Idan Asulin September 09, 2022 23 min read

Structured Query Language (SQL) was created in 1974 and has become a very useful and popular tool for programmers who develop applications for various businesses. Even though this is a fairly simple programming language, it provides a lot of significant advantages, which makes it an important and necessary tool for data scientists. It is a query-based programming language used to work with structured databases. It is known as a relational database management system (RDBMS).

However, over the past decades, technology has changed a lot, the amount of data is constantly growing, and traditional SQL (data manipulation language) can no longer provide all the necessary data processing functions.

In this article, we will explain what instantaneous data processing language is, how it differs from traditional data manipulation language, and what are the advantages of using it. You will also learn how to write simple queries using instantaneous data processing language and what its future holds.


History of databases

People have studied databases for a long time. After the advent of the relational model, it became very popular and is still widely used today. However, with the rapid growth of the amount of data, it no longer fully meets the requirements of the business.

At this time, non-relational databases appeared, which differ in their structure and method of storing data from relational ones. Among them, the most popular are:

  • Document bases are provided by MongoDB and CouchDB.
  • Wide column databases introduced by HBase, Cassandra.
  • Distributed, a strictly consistent relational database provided by Google Spanner, including its open-source implementations CockroachDB, and TiDB.
  • Analytical databases based on column storage represented by Apache Druid, and ClickHouse.
  • Time series databases are provided by InfluxDB and TimescaleDB.
  • Full-text indexing databases provided by Elasticsearch.
  • Graph databases provided by Neo4j.

However, none of these systems can satisfy all the business needs of companies. This is because the needs of real businesses are constantly changing and becoming more complex and diverse. Data storage methods, access models, data storage size and time, etc. are changing.

New requirements for data storage and processing, the development of new industries, and the emergence of new needs lead to the emergence of new databases and influence to future of data manipulation language.


What are SQL databases used for?

Structured Query Language (SQL) is a query-based programming language used to work with relational databases. With SQL, you can store, retrieve, manage and manipulate large amounts of data in a database management system.

Even though SQL was developed in the early 1970s, it is still one of the most popular programming languages ​​today. It is used as a standard relational database management system.

Almost every business now needs a reliable database along with a database management system. SQL allows you to query, update, and reorganize data, create and modify the structure of databases, and control access to its data. In addition, with the help of SQL, you can store data about every client your business has ever dealt with, and quickly retrieve this information.

SQL is inexpensive open-source software. Using it, you can work with data in a simple, convenient, and secure way. Therefore, both large and small businesses use SQL.

Let’s look at a few examples of industries that make extensive use of databases.

  1. Technology sector. Here, databases are used in business operations to support websites and businesses.
  2. Financial industry. Banking applications and payment systems are among the largest users of databases. Banks require additional security measures to make online transactions. SQL provides all the necessary tools for this.
  3. Music industry. There are more and more music streaming companies being created now. To provide online music service, they need huge libraries of music that hundreds of millions of customers can access at any time.
  4. Social media. Social networks have a huge number of users who constantly consume an incredible amount of traffic. They exchange photos and messages every day. This requires constant updating and display of information to millions of users at the same time. In addition, social media databases must store the personal information of millions of users that can be tracked. 

There are many uses for SQL. This language can be used in different ways depending on your goals. Let’s look at some examples of using SQL.

  1. Change the data in the table. You can change a specific record in a table or update an entire group of records that match certain criteria.
  2. Create a table. You can create new tables and enter data into them. This allows you to enter data into the system and use it according to your needs.
  3. To get data. People can use SQL to retrieve data from a database for further analysis, evaluation, and other uses.
  4. Change the data structure. You can change the structure of a database or a table that stores data.
  5. Manipulate data. SQL allows you to store, modify, delete or update data.
  6. Grant access to data. If you manage the database, you can grant access to other users using SQL. This feature is very useful when storing sensitive data.
  7. Define custom functions and procedures. With SQL, you can add functions and procedures based on the specific needs of your database.
  8. Analyze data manually. You can use SQL to manually parse the data on your system. This can be useful when troubleshooting or resolving database issues. In addition, you can use SQL queries to find the necessary data, which you can then use for analysis.
  9. Merge datasets. SQL allows you to join and compare data from two tables.

What benefits do you get when using SQL?

SQL is the most popular database management language because it has a lot of advantages. Below we list the main ones.

  1. Fast processing of requests. SQL allows you to process a large amount of data in a very short time. Operations such as deleting, inserting, and updating data are very fast.
  2. Availability. SQL is compatible with most DBMS such as Microsoft SQL Server, MySQL, Microsoft Access, SAP Adaptive Server, etc. In addition, most database management systems support SQL.
  3. Good scalability. With SQL, you can easily create new tables and migrate existing tables to new databases.
  4. Excellent transaction support. SQL can support huge records and process many transactions at the same time.
  5. Safety. SQL keeps your data safe. With it, you can easily manage permissions.
  6. Easy to learn. To use SQL, you need to learn its functions and syntax. This language is quite simple and understandable. The statements mostly consist of English words, making it easy to learn and write SQL queries.
  7. Standardized language. SQL is open-source and comes with an active developer community. Update and troubleshooting releases happen very frequently. In addition, you can easily find documentation and troubleshooting tips.
  8. Portability. SQL is portable. You can use it on desktops, gaming systems, laptops, servers, tablets, and smartphones. It works on local systems, Intranet, and the Internet. In addition, you can move databases using SQL between devices.
  9. Integration. You can integrate SQL with other programming languages ​​such as Python and R. This allows you to manipulate data and manage the database more easily because you use the same coding language throughout the system.
  10. Multiple views of data. Using SQL, you can provide different views of the structure and contents of a database to different users.
  11. Client/server language. SQL is used to connect client computers to internal databases (servers). Thus, a client-server architecture is supported.

Future of databases

Modern software is constantly processing large data flows. Data comes from many different systems and is stored in many different places. At the same time, regardless of where the data is stored, the program must be able to manage it quickly and use it in business logic.

Data is constantly in motion, and the infrastructure must process data flows immediately. Programmers are tasked with ensuring that software interacts with other applications 24 hours a day, 7 days a week. Regular databases were not designed for this purpose.

To effectively use regular databases and RDBMS in modern applications, it is necessary to create new language abstractions and request semantics that will work with both data in motion and data at rest.

Databases now provide us with tables to work with data at rest. To work effectively with modern applications, you need to add streaming (broadcasting) capabilities to them. What it is and how it works together with traditional data management will be covered in detail in the following sections.


What is streaming SQL?

In today’s software, data is generated in a variety of ways and comes from a variety of sources. This can be data from sensors, user actions on websites, data from mobile terminals and smart devices, real-time trading data from financial markets, data from various monitoring programs, and so on. Most often, we cannot control the order in which data enters the system and the speed of their generation. Now, real-time streaming data is taking up an increasing portion of the total data volume.

Traditional data processing systems typically compute and analyze intact static data that is already stored in databases. This processing method is not suitable for constantly generated, infinite, and dynamic data streams (flows). In addition, traditional batch processing methods usually have a relatively long time interval between data generation and processing. In today’s world, to be competitive, you must make business decisions quickly and accurately. Therefore, the data must be processed in a shorter time, preferably even instantaneously.

For this reason, real-time streaming data processing techniques are beginning to replace batch processing and occupy a central place in the stack of modern data analysis technologies. Unlike queries and analysis of static data, streaming can naturally simulate data flows and update computational results instantaneously as recent data becomes available. This allows companies to stay active and stay ahead of the curve in a highly competitive marketplace.

Structured databases store data before it is available for requests. By the time the request is made, the data is outdated. Real-time data analysis requires data streaming. This raises the need for a stream RDBMS extension that supports the processing of flows. The requirements of today’s business determine the future of streaming data processing.

Instantaneous CQL (Continuous Query Language) is an RDBMS that extends standard SQL with the ability to process streams of data instantaneously. It differs from old SQL in that it constantly receives data streams. It adds the ability to manipulate streams, which are infinite sequences of tuples, not all of which are available simultaneously. Stream requests are continuous, run for long periods, and return incremental results.

Utilizing streaming systems, organizations can enter huge amounts of data into event tables from files, databases, and various other sources. You can write SQL-like queries to stream data without having to write any code. Instantaneous CQL makes it easy for organizations to capture, process, and deliver real-time data across multiple environments — whether they’re in the cloud or on-premises.

It is worth noting that instantaneous CQL is better suited for working with small subsets of data, where you need to get fast results and immediately determine the value of new data that is being created. It should not be used when dealing with massive databases.

To get the most out of the data that is constantly coming into the system, instantaneous CQL allows you to transform, filter, merge, and enrich data.


Difference between SQL in streams and databases

There are several key differences between using SQL in streams and databases.

  1. Instant and continuous requests. When executing an SQL query against a traditional database, it returns a static result set that exists at the time of the query. With streaming SQL, you can run the same query and get a response at a specific point in time. However, given that the data changes quickly, you get results that are probably out of date. Therefore, it is much more useful to execute a query that is constantly updated (materialized view). When you create a materialized view, the SQL stream engine processes the entire history of data changes up to the present and then continues to update as new data comes in.
  2. Response time. When you query traditional databases, you get a response after a while. This time is called response time. In streaming databases, the initial response time only exists when the view is first materialized. Further, when streaming data, a delay in time is possible, which depends on the amount of input data.
  3. Frequency of queries to the database. A traditional database, after receiving a query, plans, optimizes it and returns the result. After that, the engine is idle until it receives a new request. In streaming databases, the engine that is responsible for reading data is constantly running, as it must process new data that are constantly coming in.
  4. Sequential and random access to data. With traditional databases, you can query data in any order, regardless of when the data entered the system. Streaming databases process data sequentially as it arrives.
  5. Data update rate. When working with streams, the data update rate is faster than when working with traditional databases.
  6. Storage size. Streaming databases use limited disk storage while traditional databases use unlimited ones.

Challenges and opportunities for Databases in the Age of Streaming

A lot of organizations are realizing the value of streaming data processing and are starting to use the appropriate systems.

The following streaming data processing solutions currently exist:

  • Systems for collecting and capturing data in real-time.
  • Real-time data storage systems.
  • Stream computing engines.
  • Subsequent data systems and applications.

However, implementing tools for streaming data processing is a complex task. Existing solutions are difficult to use and have problems with deployment, operation, and maintenance. They also have high entry barriers, non-unified APIs, migration difficulties, etc.

An important problem in working with streaming data is the difficulty in development and maintenance, and the need for various additional tools, which leads to a less reliable system. In addition, components and systems are not always efficient and the integration between them is not perfect. Therefore, it is often difficult to meet business needs.

To solve all problems in data flow management, it is important to use professional tools for working with stream data.

Such tools are a database system designed for streaming data, a streaming database for storing and processing streams of data in real-time, as well as the well-known SQL programming language. A streaming database uses streams as its main entity and real-time as its main feature, unlike other database systems that use static datasets (tables, documents, etc.) as their main unit of storage and processing.

The future of databases must solve all the problems that SQL stream developers face now.


Why do you need SQL for real-time streaming?

Even though traditional data manipulation languages were designed to work with static data, they are also great for working with streaming data. Let’s list the main advantages of using a data manipulation language for real-time streaming.

  • A data manipulation language is declarative, allowing you to write simple requests to get complex real-time analytics. Developers have little to no need to make optimization decisions, resulting in improved performance.
  • Data manipulation languages are very popular, widely used, and easy to learn. This means that most developers can use them to work with streaming data without having to learn additional technologies.
  • Data manipulation languages allow the core complexity of data management operations to be encapsulated, making real-time analytics easier.
    Manipulations can be automatically optimized for significant performance improvements, reducing the time developers spend on optimization.
  • Data manipulation platforms can be upgraded on the fly without the need to take them apart and recompile them.
    If you cannot implement certain operations using the data manipulation language, then you can write them in Java and deploy them as part of the data manipulation process.
  • Data manipulation languages can be generated automatically, allowing you to analyze streaming data using a GUI.
  • Data manipulation languages are standardized and can be used with modern DBMS, Oracle, but for working with streaming unstructured data. In doing so, you can use all standard data types and manipulation statements. This allows you to add powerful real-time correlation, request streaming data, and process streaming data in time windows.

How to add SQL stream abstraction?

You should use different semantics for processing streaming data and data at rest. This is because data streams are initially ordered in time, since they are events that occur one after another, and tables model the current state of the world.

You can use SQL to query the event stream using the following semantics:

SELECT * FROM Products WHERE Status=‘Outdated’ EMIT CHANGES;

This query returns all added products and also keeps track of the product flow and adds a new entry whenever a product’s status changes to ‘‘Outdated’’.

Using streaming databases, you can get information about all the changes that have happened to the data, as well as the current state of the data.

The stream can also be aggregated into a constantly updated table (materialized view) to provide events more compactly.


What are the types of queries in a streaming database?

In traditional databases, data is stored at rest. To get this data, we need to execute a request. Every time we want to know if this data has changed, we need to rerun the request. With the rapid growth of the amount of data and the high frequency of their updates, this approach becomes inefficient. In addition, we receive updated data with a delay in time, which harms modern business.

Streaming requests are more efficient because the data is updated continuously. This allows us to receive up-to-date data as soon as it enters the system.

Streaming data uses requests that are different from those that work with data at rest. Stream requests allow you to get information about the history of data changes, while standard ones return only the current state of the data.

Stream Request Syntax

To write a stream query using SQL, you can use the same syntax as for a classic query. However, the EMIT CHANGES syntax is used to tell the database to provide a change stream of results.

SELECT * FROM Cities WHERE CountryId=1 EMIT CHANGES;

This query differs from the one that returns static data only by the presence of the EMIT CHANGES statement. However, the query behavior has an important difference. This request is executed until explicitly aborted. The result of query execution is an endless stream of data. After each change (whether it’s an insert, update, or deletion of a record), a new record is sent. A query that returns static data terminates as soon as the result is returned.

Differences in Querying Traditional and Streaming Databases

In a conventional database, every request performs a full table scan. To find out whether the data has been updated or not, you need to re-run the request. However, they may not be updated. This means that scanning is done in vain, which is too wasteful. In addition, the delay in request execution increases.

When executing streaming requests, the result can be computed much more efficiently. A full table scan is performed only once, the first time the request is run to retrieve the original data. After that, the request subscribes to changes in the table. This allows you to receive fresh data immediately after updating the table, without having to rescan the table each time. If there are no updates, then no calculations need to be done. This reduces latency and ensures that clients do not miss any updates.


What are materialized views on the data stream?

A materialized view is a database object that contains the result of a query. They allow you to greatly speed up the execution of queries that access a large number of records and quickly process huge amounts of data.

In a streaming database, a materialized view can be a stream or a table. They are used to store streaming data.

There are two main types of queries in a streaming database:

  • Classic table lookup queries (pull queries).
  • Continuous streaming requests. We can divide this type into two groups: temporary subscriptions (push requests) and persistent requests.

Push requests constantly send new records of results to the client. After disconnecting the client, such a request ends. However, events can continue to flow into the stream. Therefore, there is a need to process these events even without the participation of the client. This work is done by Persistent requests.

Persistent requests run inside the database server and continually update a stream or table as new input comes in. If the same materialized view is used as the data source for multiple queries, then any change to the input data will affect the results of all queries.


How to query SQL stream?

Above, we have already seen some examples of how to query data using SQL stream. Now let’s take a closer look at some of the features of streaming SQL.

All events take place at a certain point in time. Time is a very important concept in the streaming data processing. The data in the stream must be placed depending on the time of their occurrence. However, in practice, this is a rather complicated process.

An event may arrive at the processing system much later than it occurred at its source. The reasons for this may be an unstable connection, system errors, etc. But regardless of the cause, this leads to the fact that the events in the stream are placed in the wrong order. To solve this problem, each record must have an event timestamp.

Window for aggregations

To combine data streams into a table by time intervals, you can use the timestamp of the events, which must be present in each record. To do this, use the TUMBLING WINDOW expression. It allows you to set additional grouping conditions on the timeline of data records.

Each tumbling window has a fixed length. They do not overlap. This means that each event is assigned to only one window, and each window is placed on the time axis one after the other.

Tumbling windows are designed around one attribute: time. Therefore, each window contains only one event, providing a robust use case for simple reporting.

The following example creates a 20-second rollover window to aggregate and group all clicks on a website based on the visitor’s country of origin. The following SQL code is used for this.

SELECT Country, Count (*) AS VisitorCount

FROM ClickStream TIMESTAMP BY CreatedAt

GROUP BY Country, TumblingWindow(second, 20);

It is worth noting a few more features of tumbling windows.

  • The groups are distinct, so each entry belongs to exactly one group. At the same time, one entry can belong to several windows, because the windows can overlap.
  • Groups have no temporal semantics and are always “open” as time passes. Windows may close at some point in time and after that, they will no longer accept writes.

Using JOINs for Stream-to-Stream Joins

Joins are used to combine data from multiple datasets, enrich data, and detect correlation. This also works for connecting with unlimited streaming data.

Merging two threads means merging all their events. The streams are endless. Infinite storage is required to store all of their events, as you can not safely delete an event, since it could be connected to a future event from another thread.

To solve this problem, you can use the idea of ​​window calculations and the WITHIN statement. This allows a sliding window to be applied to the connection so that the connection becomes restricted.

CREATE STREAM joined AS

  SELECT *

   FROM left_stream AS l INNER JOIN right_stream AS r</span>

         WITHIN 1 MINUTES GRACE PERIOD 20 SECONDS 

         ON l.attribute1 = r.attribute2;

You can use the WITHIN statement can as an additional join condition for input record event timestamps (just like the TUMBLING WINDOW statement in an aggregation). However, there are certain drawbacks to using the WHERE statement.

  • The possibility of failures. If the user forgets to add the required time limit, then the request will fail because they will run out of memory.
  • Insufficient semantics. You cannot express additional time-related concepts such as GRACE PERIOD.

It is worth noting that the use of a stream-to-stream association gives real-time results. When a new event occurs in any input stream, it goes through the continuous streaming request and tries to find merge partners in the other stream’s window. If any entry to attach is found, the result of the attachment is immediately added to the result stream. This process is much faster than the equivalent join in relational databases with table-to-table joins, which are usually very expensive and often produce results with high latency.

Stream-Table Connections

Another use of joins is to join events in a stream to rows in a table to enrich the event with additional information. Let’s look at the following example:

CREATE STREAM enriched_payments AS

  SELECT *

  FROM payments INNER JOIN customers

               ON payments.customer_id = customers.id;

The result of this request is a stream. For each customer transaction, the stream-side payment event is used to look up the customer table to enrich the event with the appropriate customer information, and the enriched event is added to the result stream.

Here, as well as in the process of combining streams, you get results in real-time. Each time a new event appears in the input stream, it is passed through the query to search this table, and the enriched event is immediately added to the result stream.

It is important that during the execution of a query, not only the stream but also the table can be updated. In this case, updates to the table are synchronized in time, and these changes will only be visible to newer events. Already enriched events will not be updated when the table changes to ensure immutability.

Thus, just like data at rest, data in motion can be combined from different sources. Permitted connections are stream-stream and stream-table. For this, you can use special SQL stream semantics, designed to work with streaming data.


Streaming SQL integration with external sources

To integrate streaming data with external sources, for a seamless process of importing and exporting streaming data, the streaming database must offer a wide range of options. This will keep the data in constant motion.

You can use Kafka connectors to integrate with external systems. Use the CREATE CONNECTOR statement to define source connectors for importing data from external sources as streams. You can also define sink connectors to export data as streams to external destinations.

CREATE SOURCE CONNECTOR `customers-from-postgres` WITH (

    'connector.class' ='connector_class_name',

    'connection.url'  ='connection_url',

    ...

    'mode'            = 'bulk',

    'topic.prefix'    = 'prefix',

    'table.whitelist" = 'table_name',

    'key'                = 'id');

You can use this code example to create a connector that will transfer data between different databases. All you have to do is set the connector class, select connections, and configure advanced options.


Data stream processing

For processing a large amount of data, and especially for processing streaming data, time is a very important factor. There are several aspects of time-based streaming data processing that we will cover in this section: real-time processing, time series data processing, and understanding time.

The key factors to consider when deploying a dataflow application are:

  • The ability to maintain low latency results regardless of the amount of incoming data. This factor is especially important when data volumes grow by hundreds of thousands of records per second.
  • Ability to process incoming data record by record. This factor is required for real-time analytics, alerts, and real-time processing.
  • Correct understanding of time and the ability to process by the time the data was created, and not by the time it was received.

Real-time processing

With an unlimited stream of real-time data coming in, it is important to be able to generate responses fairly quickly as the amount of data grows. The DBMS can process the data stream and generate low latency responses if the amount of data is small and the latency is measured in minutes. As the number of data increases, architectural constraints are important. You need to find a compromise between the delay of the result and the amount of data using data flow control.

Processing Time Series Data

Stream processing of time series data means processing record by record as it comes in. Some systems use a batch architecture to process streaming data. However, stream batching has several disadvantages when it comes to real-time data stream analytics.

Understanding time

For the correct and reliable processing of streaming data, the ability to understand time is essential. Most data flows use time windows to process incoming data. There is an approach that is to process the data using the time of arrival of the data. However, the data arrival time is not suitable for most real-world applications that use streaming data. In many cases, alerts and templates are only valid if they are based on when the data was created.

The time of data entry into the system can be used in very limited cases:

  • When the data does not contain timestamps for any reason.
  • On systems that do not require a data creation timestamp.

Example use cases for streaming SQL

Using streaming SQL has a lot of advantages. You can easily start using it if you are running a message broker. With the development of software, using streaming SQL will become even more difficult. Let’s look at which business sectors it will be useful to apply it.

Business Analytics

When using streaming SQL, you get real-time data, which makes it more accurate and valuable for business analysis. In addition, often the materialized representation of the primary source data has a simpler data structure. You can also speed up the process of loading data panels for analysis because streaming SQL is faster. It is also possible to coordinate time intervals and the order of operations in the batch data processing.

Microservices

Streaming SQL is used to replace complex code that coordinates and transforms data in microservices. Any microservice component that looks like a query can be replaced with streaming SQL.

Real-time applications

Streaming SQL is a simple and cheap tool that allows you to deliver data and updates in real-time. It may be the best option if your application is heavily dependent on real-time data.

Business Automation

If you get real-time dashboards with streaming SQL, then the next step could be automated decision-making using that data.

Fraud detection

A stream processor should help developers write applications that respond to incoming data. With this data, you can detect anomalies that signal fraud in real-time and take appropriate action. You can also check, compare and analyze data to stop fraudulent transactions.

You can show patterns to detect fraudulent transactions with real-time transaction analysis. In addition, it is possible to send notifications about detected fraud using various forms of communication, such as email, SMS, social networks, etc.

Сybersecurity

To protect against cyber attacks, it is not enough to use only technical security systems. An important role is played by people who make reasonable decisions on cyber defense.

Using streaming data to detect anomalies in the data stream allows you to identify security issues in real-time to isolate threats. With streaming data processing, you can filter data from storage and integrate it into a more robust analytics platform.

With data streaming, you can identify a DDoS attack by analyzing the traffic to see if a suspicious amount of traffic is coming from a single IP address or huge traffic coming from a user with a single profile.

Sensor Data Processing

You can process and analyze data from sensors and devices in real-time. This can be useful for technical systems to get an idea of ​​their performance and eliminate possible errors before a serious problem arises. This results in fewer failures and reduced maintenance costs.

Online advertising

With streaming data processing, you can analyze user behavior in real-time. Based on this data, advertisements that may be of interest to users are promoted.


Wrapping up

In this article, we talked about the benefits of using traditional SQL in business, the opportunities it provides, and its use cases. Then we analyzed the development of modern technologies, new challenges, and requirements for automated systems, and concluded that traditional SQL can no longer meet modern requirements. At the same time, it is a very powerful and popular tool that should not be abandoned but only adapted to new requirements. Therefore the future of SQL depends on those tasks which will be put before it by businesses.

In this regard, streaming SQL has appeared. It allows you to process endless streams of data in real time. We have learned the SQL stream syntax, and how to use it to write queries, and create joins. We also talked about what types of queries are in a streaming database and examined the difference between queries in traditional and streaming databases. 

Also, we mentioned that it is very important to integrate a streaming database with external sources and talked about how this can be implemented. In the end, we listed in detail the possible use cases for the streaming database.

If your business needs to process large volumes of data in real-time, if timely information is critical, then you should consider using a streaming SQL.