Data-Driven Decision Support: Concept Drift in Streamed Big Data

Concept Drift in Streamed Big Data

This section presents the definition of concept drift and how to detect, understand and react it. Related real-world applications are also discussed.

Learning with concept drift

Learning with concept drift is an auxiliary research field of continuous learning, as discussed in, and has also been referred to as learning under a dynamic environment, or learning in a non-stationary environment. The research objective is to identify whether the model learnt from historical data is the same as that in the hypothesis set, which demonstrates the best performance on current concepts, where a concept is a mapping from input space to labels or target values. Concept drift can be caused by changes in data distribution, or training with misleading samples. Learning with high-volume streaming data requires particular attention to be paid to concept drift.

Concept drift can be categorized as sudden/abrupt drift, incremental drift, gradual drift or recurrent drift, according to When, How and Where: (1) When the drift occurs and how long it lasts; (2) How severe the drift is; and (3) Where the drift region is. These three criteria provide a three-dimensional perspective to describe concept drift. Drift adaptation strategies are thus specifically designed and applied to update models experiencing different types of drift.

Early concept drift studies mainly focused on drift point detection, addressing the When criterion by identifying when the empirical error exceeded the upper bound of an established model. Adaptation methods are to relearn the models or to use ensemble algorithms to adapt to new concepts. In recent years, drift point detection has been developed to cover more complicated cases, such as feature selection drift, region selection drift and the detection of multi-layer drift. These developments address the Where criterion. Some drift detection techniques have similar objectives as multivariate two-sample tests, which compare the similarity between two distributions according to the available samples. A number of recent publications have considered the test statistics applied in two-sample tests as a measure for quantifying drift severity, addressing the issue of How (How severe the drift is). However, very few have proposed drift adaptation strategies that use the severity information to learn new concepts.

Learning with concept drift has three steps: drift detection, understanding drift, and drift adaptation. We will discuss the challenges of each step in the paragraphs that follow.

Drift detection

A wide range of algorithms for concept drift detection have been developed to identify the inconsistency between historical data and newly available data. False-positive and false-negative criteria are used to evaluate the performance of drift detection algorithms. Type I errors detect drifts with fewer false-positive detections, and Type II errors detect drifts with fewer false-negative detections. In the case of high-volume streaming data, this may be inadequate, since Velocity ensures that data arrives at a very fast pace and there may be insufficient time to collect labels or target values for drift detection. Drift detection algorithms must detect drift with a limited quantity of labeled samples, thus solutions that achieve the desired drift detection accuracy with the least number of samples are preferable. In other words, the convergence rate of algorithms should also be considered as an evaluation metric. Although active learning has been applied to solve this problem, solving the issue of Velocity is still an open question.

Drift understanding

Understanding drift is another key stage of learning under concept drift. It refers to retrieving information about the When, How, and Where of concept drift and is used to describe the status of concept drift. This information is learned and integrated after drift has been confirmed by drift detection methods or algorithms and is used as the input for knowledge adaptation. The need to understand drift has increasingly gained attention, as mentioned in, but very few concrete methods have been developed to quantify this information.

Drift adaptation

How to update existing learning models according to the characteristics of the drift is critical to achieve consistently high performance. This is called drift adaptation (or knowledge adaptation). Some adaptation methods explicitly rely on drift detection algorithms and adopt a variety of retraining strategies to better handle different types of drift. Others, mainly decision-tree-based methods, may not include a global drift detection procedure but can partially update models according to changes in some leaf node based on the newly available data. Ensemble learning for streaming data with concept drift has also achieved remarkable results; however, integrating concept drift adaptation into incremental learning is still a challenging problem. Making better use of How and Where drift information in high-volume streaming data learning, rather than only When, is the next step in boosting learning performance.

Concept drift applications

Handling concept drift is highly important in real-world practice; for example, in traffic networks, telecommunications, and financial transactions. Machine learning tasks in these systems will inevitably encounter the problem of concept drift, and in some cases, the ability to handle concept drift will be the key factor in improving system performance.

A discussion of concept drift applications in industry can be found in. Drift detection applications in this context refer to the industrial requirement to diagnose significant internal and external environmental changes in industry trends or customer preferences, such as using drift detection technology to identify changes in the news preferences of users. Similar tasks include fraud detection in finance, intrusion detection in computer security, mobile masquerade detection in telecommunications, topic changes in information document organization, and clinical studies in biomedicine. The aim of drift adaptation applications is to maintain a continuously effective evaluation and prediction system for industry. This may also involve using drift detection technologies to achieve greater accuracy. A real case example, in which a credit risk assessment framework for dynamic credit scoring was designed, is represented in. Other real-world drift adaptation applications can be found in transportation traffic management, production and service monitoring, customer recommendation, bankruptcy prediction, and so on.

With the latest developments in technology, data streams have become larger in size and faster. The new challenges posed by high-volume streaming data require the development of more advanced concept drift applications. One such challenge is how to handle concept drift problems in the Internet of Things (IoT), since the huge quantity of streaming data from the IoT requires deeper insight and better understanding of concept drift.

Course Introduction

Course Syllabus

Unit 1: Introduction to Data-Driven Decision-Making

1.1: What is Data-Driven Decision-Making?

Data-Driven Decision-Making

Data-Driven Decisions

Project Lifecycle

More on Data-Driven Decisions

1.1.1: Data-Driven Information

Making Data-Driven Decisions

Data-Driven Decision Support

1.1.2: Data-Driven Learning

Process Indicators

Goal-Setting for Achievement

1.1.3: Data-Driven Science

Theory Driven or Process Driven Predictions?

Learn Data Science

1.2: Using Data-Driven Decision-Making in the Real World

Data-Driven Development

Using Data Responsibly

How Data Informs Business

Decision-Making in Management

The Advantage of Digital Decision-Making

The Effects of using Business Intelligence Systems

Unit 1 Discussion

Unit 1 Study Resources

Unit 1 Review Video

Unit 1 Review Slides

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Transforming to a Data-Driven Decision-Making Enterprise

2.1: DDDM Implementation Continuums

Developing an Analytical Mindset

Data-Driven Decision-Making Change Model

2.1.1: Data/Technology

The Data and Technology Continuum

2.1.2: Organizational/People

The Organization Continuum

2.1.3: Process/Workflow

The Process Continuum

2.2: Critical Success Factors

Critical Success Factors

Unit 2 Discussion

Unit 2 Study Resources

Unit 2 Review Video

Unit 2 Review Slides

Study Guide: Unit 2

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The Role of Leadership

3.1: The Role of Leadership

Leadership Needs in the 21st Century

Becoming a Data Driven PM

Embracing Big Data and Data Analytics

3.2: Leadership versus Other Critical Success Factors

Leadership and Innovation

3.3: What Is Effective Leadership?

Effecitve Leaders

Effective vs. Poor Leadership

Leadership Development Practices

Unit 3 Discussion

Unit 3 Study Resources

Unit 3 Review Video

Unit 3 Review Slides

Study Guide: Unit 3

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Types of Data

4.1: Quanitative Data

Qualitative versus Quantitative Data

Characteristics of Qualitative and Quantitative Data

4.2: Qualitative Data

Qualitative and Quantitative Data

More on Qualitative and Quantitative Data

Quantitative versus Qualitative Data Summary

4.3: Big Data

What is Big Data?

Big Data in New Product Development

4.4: Types of Analytics