SQL was created in 1974 and has become a very useful and popular tool for programmers who develop applications for various businesses. Even though this is a fairly simple programming language, it provides a lot of significant advantages, which makes it an important and necessary tool for data scientists.
However, over the past decades, technology has changed a lot, the amount of data is constantly growing, and traditional SQL can no longer provide all the necessary data processing functions.
In this article, we will explain what streaming SQL is, how it differs from traditional one, and what are the advantages of using it. You will also learn how to write simple queries using streaming SQL and what is the future of SQL.
History of databases
People have studied databases for a long time. After the advent of the relational model, it became very popular and is still widely used today. However, with the rapid growth of the amount of data, it no longer fully meets the requirements of the business.
At this time, non-relational databases appeared, which differ in their structure and method of storing data from relational ones. Among them, the most popular are:
- Document bases are provided by MongoDB and CouchDB.
- Wide column databases introduced by HBase, Cassandra.
- Distributed, a strictly consistent relational database provided by Google Spanner, including its open-source implementations CockroachDB, and TiDB.
- Analytical databases based on column storage represented by Apache Druid, and ClickHouse.
- Time series databases are provided by InfluxDB and TimescaleDB.
- Full-text indexing databases provided by Elasticsearch.
- Graph databases provided by Neo4j.
However, none of these systems can satisfy all the business needs of companies. This is because the needs of real businesses are constantly changing and becoming more complex and diverse. Data storage methods, access models, data storage size and time, etc. are changing.
New requirements for data storage and processing, the development of new industries, and the emergence of new needs lead to the emergence of new databases and influence to future of SQL.
What are SQL databases used for?
Structured Query Language (SQL) is a query-based programming language used to work with relational databases. With SQL, you can store, retrieve, manage and manipulate large amounts of data in a database management system.
Even though SQL was developed in the early 1970s, it is still one of the most popular programming languages today. It is used as a standard relational database management system.
Almost every business now needs a reliable database along with a database management system. SQL allows you to query, update, and reorganize data, create and modify the structure of databases, and control access to its data. In addition, with the help of SQL, you can store data about every client your business has ever dealt with, and quickly retrieve this information.
SQL is inexpensive open-source software. Using it, you can work with data in a simple, convenient, and secure way. Therefore, both large and small businesses use SQL.
Let's look at a few examples of industries that make extensive use of databases.
- Technology sector. Here, databases are used in business operations to support websites and businesses.
- Financial industry. Banking applications and payment systems are among the largest users of databases. Banks require additional security measures to make online transactions. SQL provides all the necessary tools for this.
- Music industry. There are more and more music streaming companies being created now. To provide online music service, they need huge libraries of music that hundreds of millions of customers can access at any time.
- Social media. Social networks have a huge number of users who constantly consume an incredible amount of traffic. They exchange photos and messages every day. This requires constant updating and display of information to millions of users at the same time. In addition, social media databases must store the personal information of millions of users that can be tracked.
There are many uses for SQL. This language can be used in different ways depending on your goals. Let's look at some examples of using SQL.
- Change the data in the table. You can change a specific record in a table or update an entire group of records that match certain criteria.
- Create a table. You can create new tables and enter data into them. This allows you to enter data into the system and use it according to your needs.
- To get data. People can use SQL to retrieve data from a database for further analysis, evaluation, and other uses.
- Change the data structure. You can change the structure of a database or a table that stores data.
- Manipulate data. SQL allows you to store, modify, delete or update data.
- Grant access to data. If you manage the database, you can grant access to other users using SQL. This feature is very useful when storing sensitive data.
- Define custom functions and procedures. With SQL, you can add functions and procedures based on the specific needs of your database.
- Analyze data manually. You can use SQL to manually parse the data on your system. This can be useful when troubleshooting or resolving database issues. In addition, you can use SQL queries to find the necessary data, which you can then use for analysis.
- Merge datasets. SQL allows you to join and compare data from two tables.
What benefits do you get when using SQL?
SQL is the most popular database management language because it has a lot of advantages. Below we list the main ones.
- Fast processing of requests. SQL allows you to process a large amount of data in a very short time. Operations such as deleting, inserting, and updating data are very fast.
- Availability. SQL is compatible with most DBMS such as Microsoft SQL Server, MySQL, Microsoft Access, SAP Adaptive Server, etc. In addition, most database management systems support SQL.
- Good scalability. With SQL, you can easily create new tables and migrate existing tables to new databases.
- Excellent transaction support. SQL can support huge records and process many transactions at the same time.
- Safety. SQL keeps your data safe. With it, you can easily manage permissions.
- Easy to learn. To use SQL, you need to learn its functions and syntax. This language is quite simple and understandable. The statements mostly consist of English words, making it easy to learn and write SQL queries.
- Standardized language. SQL is open-source and comes with an active developer community. Update and troubleshooting releases happen very frequently. In addition, you can easily find documentation and troubleshooting tips.
- Portability. SQL is portable. You can use it on desktops, gaming systems, laptops, servers, tablets, and smartphones. It works on local systems, Intranet, and the Internet. In addition, you can move databases using SQL between devices.
- Integration. You can integrate SQL with other programming languages such as Python and R. This allows you to manipulate data and manage the database more easily because you use the same coding language throughout the system.
- Multiple views of data. Using SQL, you can provide different views of the structure and contents of a database to different users.
- Client/server language. SQL is used to connect client computers to internal databases (servers). Thus, a client-server architecture is supported.
Future of databases
Modern software is constantly processing large data streams. Data comes from many different systems and is stored in many different places. At the same time, regardless of where the data is stored, the program must be able to manage it quickly and use it in business logic.
Data is constantly in motion, and the infrastructure must process data flows immediately. Programmers are tasked with ensuring that software interacts with other applications 24 hours a day, 7 days a week. Relational databases and SQL were not designed for this purpose. So the future of databases is different from their present.
To effectively use relational databases and SQL in modern applications, it is necessary to create new language abstractions and query semantics that will work with both data in motion and data at rest.
Databases now provide us with tables to work with data at rest. To work effectively with modern applications in SQL, you need to add streaming. What it is and how it works together with SQL will be covered in detail in the following sections.
What is streaming SQL?
In today's software, data is generated in a variety of ways and comes from a variety of sources. This can be data from sensors, user actions on websites, data from mobile terminals and smart devices, real-time trading data from financial markets, data from various monitoring programs, and so on. Most often, we cannot control the order in which data enters the system and the speed of their generation. Now, real-time streaming data is taking up an increasing portion of the total data volume.
Traditional data processing systems typically compute and analyze intact static data that is already stored in databases. This processing method is not suitable for constantly generated, infinite, and dynamic data streams. In addition, traditional batch processing methods usually have a relatively long time interval between data generation and processing. In today's world, to be competitive, you must make business decisions quickly and accurately. Therefore, the data must be processed in a shorter time, preferably even in real-time.
For this reason, real-time streaming data processing techniques are beginning to replace batch processing and occupy a central place in the stack of modern data analysis technologies. Unlike queries and analysis of static data, streaming can naturally simulate data flows and update computational results in real-time as new data becomes available. This allows companies to stay active and stay ahead of the curve in a highly competitive marketplace.
Relational databases store data before it is available for query. By the time the request is made, the data is outdated. Real-time data analysis requires data streaming. This raises the need for an SQL stream extension that supports stream processing. The requirements of today's business determine the future of SQL.
Streaming SQL is a query language that extends standard SQL with the ability to process streams of data in real time. It differs from old SQL in that it constantly receives data streams. It adds the ability to manipulate streams, which are infinite sequences of tuples, not all of which are available simultaneously. Thread queries are continuous, run for long periods, and return incremental results.
Using streaming systems, organizations can enter huge amounts of data into event tables from files, databases, and various other sources. You can write SQL-like queries to stream data without having to write any code. Streaming SQL makes it easy for organizations to capture, process, and deliver real-time data across multiple environments — whether they're in the cloud or on-premises.
It is worth noting that Streaming SQL is better suited for working with small subsets of data, where you need to get fast results and immediately determine the value of new data that is being created. It should not be used when dealing with massive databases.
To get the most out of the data that is constantly coming into the system, Streaming SQL allows you to transform, filter, merge and enrich the data.
Difference between SQL in streams and databases
There are several key differences between using SQL in streams and databases.
- Instant and continuous requests. When executing an SQL query against a traditional database, it returns a static result set that exists at the time of the query. With streaming SQL, you can run the same query and get a response at a specific point in time. However, given that the data changes quickly, you get results that are probably out of date. Therefore, it is much more useful to execute a query that is constantly updated (materialized view). When you create a materialized view, the SQL stream engine processes the entire history of data changes up to the present and then continues to update as new data comes in.
- Response time. When you query traditional databases, you get a response after a while. This time is called response time. In streaming databases, the initial response time only exists when the view is first materialized. Further, when streaming data, a delay in time is possible, which depends on the amount of input data.
- Frequency of queries to the database. A traditional database, after receiving a query, plans, optimizes it and returns the result. After that, the engine is idle until it receives a new request. In streaming databases, the engine that is responsible for reading data is constantly running, as it must process new data that are constantly coming in.
- Sequential and random access to data. With traditional databases, you can query data in any order, regardless of when the data entered the system. Streaming databases process data sequentially as it arrives.
- Data update rate. When working with streams, the data update rate is faster than when working with traditional databases.
- Storage size. Streaming databases use limited disk storage while traditional databases use unlimited ones.
Challenges and opportunities for Databases in the Age of Streaming
A lot of organizations are realizing the value of streaming data processing and are starting to use the appropriate systems.
The following streaming data processing solutions currently exist:
- Systems for collecting and capturing data in real-time.
- Real-time data storage systems.
- Stream computing engines.
- Subsequent data systems and applications.
However, implementing tools for streaming data processing is a complex task. Existing solutions are difficult to use and have problems with deployment, operation, and maintenance. They also have high entry barriers, non-unified APIs, migration difficulties, etc.
An important problem in working with streaming data is the difficulty in development and maintenance, and the need for various additional tools, which leads to a less reliable system. In addition, components and systems are not always efficient and the integration between them is not perfect. Therefore, it is often difficult to meet business needs.
To solve all problems in data flow management, it is important to use professional tools for working with stream data.
Such tools are a database system designed for streaming data, a streaming database for storing and processing streams of data in real-time, as well as the well-known SQL programming language. A streaming database uses streams as its main entity and real-time as its main feature, unlike other database systems that use static datasets (tables, documents, etc.) as their main unit of storage and processing.
The future of databases must solve all the problems that SQL stream developers face now.
Why do you need SQL for real-time streaming?
Even though SQL was designed to work with static data, it is also great for working with streaming data. Let's list the main advantages of using SQL for real-time streaming.
- SQL is a declarative programming language. It allows you to write simple queries to get complex real-time analytics. Developers have little to no need to make optimization decisions, resulting in improved performance.
- SQL is a very popular, widely used, and easy-to-learn language. This means that most developers can use it to work with streaming data without having to learn additional technologies.
- SQL allows the core complexity of data management operations to be encapsulated, making real-time analytics easier
- SQL queries can be automatically optimized for significant performance improvements. This reduces the time developers spend on query optimization.
- SQL platforms can be upgraded on the fly without the need to take them apart and recompile them.
- If you cannot implement certain operations using SQL, then you can write them in Java and deploy them in a SQL query.
- SQL can be generated automatically, allowing you to analyze streaming data using a GUI and visualize streaming analytics.
- SQL is a standardized programming language that can be used with modern DBMS, Oracle, but for working with streaming unstructured data. In doing so, you can use all standard data types and SQL statements. This allows you to add powerful real-time correlation, query streaming data, and process streaming data in time windows.
How to add SQL stream abstraction?
You should use different semantics for processing streaming data and data at rest. This is because data streams are initially ordered in time, since they are events that occur one after another, and tables model the current state of the world.
You can use SQL to query the event stream using the following semantics:
SELECT * FROM Products WHERE Status=‘Outdated’ EMIT CHANGES;
This query returns all added products and also keeps track of the product flow and adds a new entry whenever a product's status changes to '‘Outdated’'.
Using streaming databases, you can get information about all the changes that have happened to the data, as well as the current state of the data.
The stream can also be aggregated into a constantly updated table (materialized view) to provide events more compactly.
What are the types of queries in a streaming database?
In traditional databases, data is stored at rest. To get this data, we need to execute a query. Every time we want to know if this data has changed, we need to rerun the query. With the rapid growth of the amount of data and the high frequency of their updates, this approach becomes inefficient. In addition, we receive updated data with a delay in time, which harms modern business.
Streaming queries are more efficient because the data is updated continuously. This allows us to receive up-to-date data as soon as it enters the system.
Streaming data uses queries that are different from those that work with data at rest. Stream requests allow you to get information about the history of data changes, while standard ones return only the current state of the data.
Stream Request Syntax
To write a stream query using SQL, you can use the same syntax as for a classic query. However, the EMIT CHANGES syntax is used to tell the database to provide a change stream of results.
SELECT * FROM Cities WHERE CountryId=1 EMIT CHANGES;
This query differs from the one that returns static data only by the presence of the EMIT CHANGES statement. However, the query behavior has an important difference. This request is executed until explicitly aborted. The result of query execution is an endless stream of data. After each change (whether it's an insert, update, or deletion of a record), a new record is sent. A query that returns static data terminates as soon as the result is returned.
Differences in Querying Traditional and Streaming Databases
In a traditional database, every query performs a full table scan. To find out whether the data has been updated or not, you need to re-run the query. However, there may not be updated. This means that scanning is done in vain, which is too wasteful. In addition, the delay in query execution increases.
When executing streaming queries, the result can be computed much more efficiently. A full table scan is performed only once, the first time the query is run to retrieve the original data. After that, the query subscribes to changes in the table. This allows you to receive new data immediately after updating the table, without having to rescan the table each time. If there are no updates, then no calculations need to be done. This reduces latency and ensures that clients do not miss any updates.
What are materialized views on the data stream?
A materialized view is a database object that contains the result of a query. They allow you to greatly speed up the execution of queries that access a large number of records and quickly process huge amounts of data.
In a streaming database, a materialized view can be a stream or a table. They are used to store streaming data.
There are two main types of queries in a streaming database:
- Classic table lookup queries (pull queries).
- Continuous streaming requests. We can divide this type into two groups: temporary subscriptions (push requests) and persistent requests.
Push requests constantly send new records of results to the client. After disconnecting the client, such a request ends. However, events can continue to flow into the stream. Therefore, there is a need to process these events even without the participation of the client. This work is done by Persistent requests.
Persistent requests run inside the database server and continually update a stream or table as new input comes in. If the same materialized view is used as the data source for multiple queries, then any change to the input data will affect the results of all queries.
How to query SQL stream?
Above, we have already seen some examples of how to query data using SQL stream. Now let's take a closer look at some of the features of streaming SQL.
All events take place at a certain point in time. Time is a very important concept in the streaming data processing. The data in the stream must be placed depending on the time of their occurrence. However, in practice, this is a rather complicated process.
An event may arrive at the processing system much later than it occurred at its source. The reasons for this may be an unstable connection, system errors, etc. But regardless of the cause, this leads to the fact that the events in the stream are placed in the wrong order. To solve this problem, each record must have an event timestamp.
Window for aggregations
To combine data streams into a table by time intervals, you can use the timestamp of the events, which must be present in each record. To do this, use the TUMBLING WINDOW expression. It allows you to set additional grouping conditions on the timeline of data records.
Each tumbling window has a fixed length. They do not overlap. This means that each event is assigned to only one window, and each window is placed on the time axis one after the other.
Tumbling windows are designed around one attribute: time. Therefore, each window contains only one event, providing a robust use case for simple reporting.
The following example creates a 20-second rollover window to aggregate and group all clicks on a website based on the visitor's country of origin. The following SQL code is used for this.
SELECT Country, Count (*) AS VisitorCount
FROM ClickStream TIMESTAMP BY CreatedAt
GROUP BY Country, TumblingWindow(second, 20);
It is worth noting a few more features of tumbling windows.
- The groups are distinct, so each entry belongs to exactly one group. At the same time, one entry can belong to several windows, because the windows can overlap.
- Groups have no temporal semantics and are always "open" as time passes. Windows may close at some point in time and after that, they will no longer accept writes.
Using JOINs for Stream-to-Stream Joins
Joins are used to combine data from multiple datasets, enrich data, and detect correlation. This also works for connecting with unlimited streaming data.
Merging two threads means merging all their events. The streams are endless. Infinite storage is required to store all of their events, as you can not safely delete an event, since it could be connected to a future event from another thread.
To solve this problem, you can use the idea of window calculations and the WITHIN statement. This allows a sliding window to be applied to the connection so that the connection becomes restricted.
CREATE STREAM joined AS
FROM left_stream AS l INNER JOIN right_stream AS r</span>
WITHIN 1 MINUTES GRACE PERIOD 20 SECONDS
ON l.attribute1 = r.attribute2;
You can use the WITHIN statement can as an additional join condition for input record event timestamps (just like the TUMBLING WINDOW statement in an aggregation). However, there are certain drawbacks to using the WHERE statement.
- The possibility of failures. If the user forgets to add the required time limit, then the request will fail because they will run out of memory.
- Insufficient semantics. You cannot express additional time-related concepts such as GRACE PERIOD.
It is worth noting that the use of a stream-to-stream association gives real-time results. When a new event occurs in any input stream, it goes through the continuous streaming request and tries to find merge partners in the other stream's window. If any entry to attach is found, the result of the attachment is immediately added to the result stream. This process is much faster than the equivalent join in relational databases with table-to-table joins, which are usually very expensive and often produce results with high latency.
Another use of joins is to join events in a stream to rows in a table to enrich the event with additional information. Let's look at the following example:
CREATE STREAM enriched_payments AS
FROM payments INNER JOIN customers
ON payments.customer_id = customers.id;
The result of this request is a stream. For each customer transaction, the stream-side payment event is used to look up the customer table to enrich the event with the appropriate customer information, and the enriched event is added to the result stream.
Here, as well as in the process of combining streams, you get results in real-time. Each time a new event appears in the input stream, it is passed through the query to search this table, and the enriched event is immediately added to the result stream.
It is important that during the execution of a query, not only the stream but also the table can be updated. In this case, updates to the table are synchronized in time, and these changes will only be visible to newer events. Already enriched events will not be updated when the table changes to ensure immutability.
Thus, just like data at rest, data in motion can be combined from different sources. Permitted connections are stream-stream and stream-table. For this, you can use special SQL stream semantics, designed to work with streaming data.
Streaming SQL integration with external sources
To integrate streaming data with external sources, for a seamless process of importing and exporting streaming data, the streaming database must offer a wide range of options. This will keep the data in constant motion.
You can use Kafka connectors to integrate with external systems. Use the CREATE CONNECTOR statement to define source connectors for importing data from external sources as streams. You can also define sink connectors to export data as streams to external destinations.
CREATE SOURCE CONNECTOR `customers-from-postgres` WITH (
'mode' = 'bulk',
'topic.prefix' = 'prefix',
'table.whitelist" = 'table_name',
'key' = 'id');
You can use this code example to create a connector that will transfer data between different databases. All you have to do is set the connector class, select connections, and configure advanced options.
Data stream processing
For processing a large amount of data, and especially for processing streaming data, time is a very important factor. There are several aspects of time-based streaming data processing that we will cover in this section: real-time processing, time series data processing, and understanding time.
The key factors to consider when deploying a dataflow application are:
- The ability to maintain low latency results regardless of the amount of incoming data. This factor is especially important when data volumes grow by hundreds of thousands of records per second.
- Ability to process incoming data record by record. This factor is required for real-time analytics, alerts, and real-time processing.
- Correct understanding of time and the ability to process by the time the data was created, and not by the time it was received.
With an unlimited stream of real-time data coming in, it is important to be able to generate responses fairly quickly as the amount of data grows. The DBMS can process the data stream and generate low latency responses if the amount of data is small and the latency is measured in minutes. As the number of data increases, architectural constraints are important. You need to find a compromise between the delay of the result and the amount of data using data flow control.
Processing Time Series Data
Stream processing of time series data means processing record by record as it comes in. Some systems use a batch architecture to process streaming data. However, stream batching has several disadvantages when it comes to real-time data stream analytics.
For the correct and reliable processing of streaming data, the ability to understand time is essential. Most data flows use time windows to process incoming data. There is an approach that is to process the data using the time of arrival of the data. However, the data arrival time is not suitable for most real-world applications that use streaming data. In many cases, alerts and templates are only valid if they are based on when the data was created.
The time of data entry into the system can be used in very limited cases:
- When the data does not contain timestamps for any reason.
- On systems that do not require a data creation timestamp.
Example use cases for streaming SQL
Using streaming SQL has a lot of advantages. You can easily start using it if you are running a message broker. With the development of software, using streaming SQL will become even more difficult. Let's look at which business sectors it will be useful to apply it.
When using streaming SQL, you get real-time data, which makes it more accurate and valuable for business analysis. In addition, often the materialized representation of the primary source data has a simpler data structure. You can also speed up the process of loading data panels for analysis because streaming SQL is faster. It is also possible to coordinate time intervals and the order of operations in the batch data processing.
Streaming SQL is used to replace complex code that coordinates and transforms data in microservices. Any microservice component that looks like a query can be replaced with streaming SQL.
Streaming SQL is a simple and cheap tool that allows you to deliver data and updates in real-time. It may be the best option if your application is heavily dependent on real-time data.
If you get real-time dashboards with streaming SQL, then the next step could be automated decision-making using that data.
A stream processor should help developers write applications that respond to incoming data. With this data, you can detect anomalies that signal fraud in real-time and take appropriate action. You can also check, compare and analyze data to stop fraudulent transactions.
You can show patterns to detect fraudulent transactions with real-time transaction analysis. In addition, it is possible to send notifications about detected fraud using various forms of communication, such as email, SMS, social networks, etc.
To protect against cyber attacks, it is not enough to use only technical security systems. An important role is played by people who make reasonable decisions on cyber defense.
Using streaming data to detect anomalies in the data stream allows you to identify security issues in real-time to isolate threats. With streaming data processing, you can filter data from storage and integrate it into a more robust analytics platform.
With data streaming, you can identify a DDoS attack by analyzing the traffic to see if a suspicious amount of traffic is coming from a single IP address or huge traffic coming from a user with a single profile.
Sensor Data Processing
You can process and analyze data from sensors and devices in real-time. This can be useful for technical systems to get an idea of their performance and eliminate possible errors before a serious problem arises. This results in fewer failures and reduced maintenance costs.
With streaming data processing, you can analyze user behavior in real-time. Based on this data, advertisements that may be of interest to users are promoted.
In this article, we talked about the benefits of using traditional SQL in business, the opportunities it provides, and its use cases. Then we analyzed the development of modern technologies, new challenges, and requirements for automated systems, and concluded that traditional SQL can no longer meet modern requirements. At the same time, it is a very powerful and popular tool that should not be abandoned but only adapted to new requirements. Therefore the future of SQL depends on those tasks which will be put before it by businesses.
In this regard, streaming SQL has appeared. It allows you to process endless streams of data in real time. We have learned the SQL stream syntax, and how to use it to write queries, and create joins. We also talked about what types of queries are in a streaming database and examined the difference between queries in traditional and streaming databases.
Also, we mentioned that it is very important to integrate a streaming database with external sources and talked about how this can be implemented. In the end, we listed in detail the possible use cases for the streaming database.
If your business needs to process large volumes of data in real-time, if timely information is critical, then you should consider using a streaming SQL.