Predicting Univariate Time Series in Stream Processing Systems

I can divide the work I have done this semester into two parts. I did not have much experience in the field of streaming data and how to use machine learning techniques on it. Therefore, I had to do thorough research to understand the topic I want to work on. So, the first part of my work was theoretical. The second part was the practical part. Nowadays, machine learning-based solutions are already widespread in the industry. They are applied in various fields to solve complex problems such as image recognition, detecting fraudulent banking transactions, and predicting heart disease. In many real-world applications, it is essential to make inferences as quickly as possible since fast reactions could save a lot of money. There are many architecture solutions for solving real-time machine learning problems. I want to present the main architecture types. The first is the request-driven architecture type, in this case, there are two main components: a client and a model server. The prediction model is running in the model server. The second is the event-driven architecture type, in this case, we have a data storage and a data processing application. The job of the data storage is to store the data, that is created by an event. The job of the processing application is to read the data out of the storage, process it and to make predictions. In this case the ML model is embedded into a stream processing application. In many cases, it is not clear which architecture type you need to use; it would be great to use some elements from the event-driven architecture and some from the request-driven architecture. The good news is that they do not exclude each other. Using elements from the event-driven architecture does not mean that you cannot use components from the request-driven architecture, and vice versa. This kind of architecture is called hybrid architecture. For example, if you want to use the event-driven architecture because you are working with streaming data. Still, you do not want to embed the model inside the data processing application because using a model server would be more practical in your case, than a hybrid type is the right choice. In this semester I made an application, that solves a real-time machine learning problem using streaming data on a specific use case. The goal of the application is to make traffic related predictions. I made two applications for the same use case. One build on the event-driven architecture type, the other build on the hybrid architecture type. Since we are working on streaming data and latency is crucial, the event-driven architecture type is the best choice, according to my research. I also implemented another application built on the hybrid architecture type because I wanted to measure the latency difference between an embedded prediction model and a model server.

Kismóni Botond

2021-06-09

Támogató: Cloudera