Concept Drift in Streamed Big Data

This section presents the definition of concept drift and how to detect, understand and react it. Related real-world applications are also discussed.


Learning with concept drift

Learning with concept drift is an auxiliary research field of continuous learning, as discussed in, and has also been referred to as learning under a dynamic environment, or learning in a non-stationary environment. The research objective is to identify whether the model learnt from historical data is the same as that in the hypothesis set, which demonstrates the best performance on current concepts, where a concept is a mapping from input space to labels or target values. Concept drift can be caused by changes in data distribution, or training with misleading samples. Learning with high-volume streaming data requires particular attention to be paid to concept drift.

Concept drift can be categorized as sudden/abrupt drift, incremental drift, gradual drift or recurrent drift, according to When, How and Where: (1) When the drift occurs and how long it lasts; (2) How severe the drift is; and (3) Where the drift region is. These three criteria provide a three-dimensional perspective to describe concept drift. Drift adaptation strategies are thus specifically designed and applied to update models experiencing different types of drift.

Early concept drift studies mainly focused on drift point detection, addressing the When criterion by identifying when the empirical error exceeded the upper bound of an established model. Adaptation methods are to relearn the models or to use ensemble algorithms to adapt to new concepts. In recent years, drift point detection has been developed to cover more complicated cases, such as feature selection drift, region selection drift and the detection of multi-layer drift. These developments address the Where criterion. Some drift detection techniques have similar objectives as multivariate two-sample tests, which compare the similarity between two distributions according to the available samples. A number of recent publications have considered the test statistics applied in two-sample tests as a measure for quantifying drift severity, addressing the issue of How (How severe the drift is). However, very few have proposed drift adaptation strategies that use the severity information to learn new concepts.

Learning with concept drift has three steps: drift detection, understanding drift, and drift adaptation. We will discuss the challenges of each step in the paragraphs that follow.


Drift detection

A wide range of algorithms for concept drift detection have been developed to identify the inconsistency between historical data and newly available data. False-positive and false-negative criteria are used to evaluate the performance of drift detection algorithms. Type I errors detect drifts with fewer false-positive detections, and Type II errors detect drifts with fewer false-negative detections. In the case of high-volume streaming data, this may be inadequate, since Velocity ensures that data arrives at a very fast pace and there may be insufficient time to collect labels or target values for drift detection. Drift detection algorithms must detect drift with a limited quantity of labeled samples, thus solutions that achieve the desired drift detection accuracy with the least number of samples are preferable. In other words, the convergence rate of algorithms should also be considered as an evaluation metric. Although active learning has been applied to solve this problem, solving the issue of Velocity is still an open question.


Drift understanding

Understanding drift is another key stage of learning under concept drift. It refers to retrieving information about the When, How, and Where of concept drift and is used to describe the status of concept drift. This information is learned and integrated after drift has been confirmed by drift detection methods or algorithms and is used as the input for knowledge adaptation. The need to understand drift has increasingly gained attention, as mentioned in, but very few concrete methods have been developed to quantify this information.


Drift adaptation

How to update existing learning models according to the characteristics of the drift is critical to achieve consistently high performance. This is called drift adaptation (or knowledge adaptation). Some adaptation methods explicitly rely on drift detection algorithms and adopt a variety of retraining strategies to better handle different types of drift. Others, mainly decision-tree-based methods, may not include a global drift detection procedure but can partially update models according to changes in some leaf node based on the newly available data. Ensemble learning for streaming data with concept drift has also achieved remarkable results; however, integrating concept drift adaptation into incremental learning is still a challenging problem. Making better use of How and Where drift information in high-volume streaming data learning, rather than only When, is the next step in boosting learning performance.


Concept drift applications

Handling concept drift is highly important in real-world practice; for example, in traffic networks, telecommunications, and financial transactions. Machine learning tasks in these systems will inevitably encounter the problem of concept drift, and in some cases, the ability to handle concept drift will be the key factor in improving system performance.

A discussion of concept drift applications in industry can be found in. Drift detection applications in this context refer to the industrial requirement to diagnose significant internal and external environmental changes in industry trends or customer preferences, such as using drift detection technology to identify changes in the news preferences of users. Similar tasks include fraud detection in finance, intrusion detection in computer security, mobile masquerade detection in telecommunications, topic changes in information document organization, and clinical studies in biomedicine. The aim of drift adaptation applications is to maintain a continuously effective evaluation and prediction system for industry. This may also involve using drift detection technologies to achieve greater accuracy. A real case example, in which a credit risk assessment framework for dynamic credit scoring was designed, is represented in. Other real-world drift adaptation applications can be found in transportation traffic management, production and service monitoring, customer recommendation, bankruptcy prediction, and so on.

With the latest developments in technology, data streams have become larger in size and faster. The new challenges posed by high-volume streaming data require the development of more advanced concept drift applications. One such challenge is how to handle concept drift problems in the Internet of Things (IoT), since the huge quantity of streaming data from the IoT requires deeper insight and better understanding of concept drift.