Data Mining Analytics for BI and Decision Support
Summary
Current Status and Future Directions
The current situation of data mining in the marketplace is that it is primarily an enabler for business intelligence systems. Data mining algorithm suites are available as software packages, some loosely coupled with database technology. To successfully build a data mining application, there is usually heavy emphasis on data warehousing followed by exploratory data mining. The analysis and application building is typically conducted by
consultants or in-house analytic teams. The key challenges to the successful completion of a data mining project are the data warehousing requirements, and the sophisticated analytics requirements.
To address these challenges, key research trends in data mining include systems research, for enabling transparent and pervasive usage, algorithms research, for providing scalable, optimized, and robust mining, and solutions research, for embedding data mining into vertically integrated applications.
The key goal in systems research is to enable transparent usage of data mining in environments where data typically resides and is being managed. This includes building database extenders (e.g. User Defined Functions for model training/scoring and sufficient statistics like histograms, counts, samples, etc.), parallel and distributed data mining (for supporting scalability via parallelization and inbuilt sampling), XML based APIs for database coupling and application embedding (to enable interoperability and training/ scoring in different environments), and intelligent or semi-automated data warehousing for mining (by providing industry specific templates and meta-data mining).
The key goal in algorithms research is to enable robust and automated data mining, thereby making it easier for non-experts to conduct and run data mining applications. This includes building better techniques for automated evaluation metrics, automated feature extraction / transformation / selection, discovering relational and hierarchical structures amongst attributes, incorporating prior knowledge to account for costs / benefits / uncertainty / missing values, incremental and on-line mining, privacy preserving data mining, and heterogeneous data mining.
The key goal in solutions research is to develop solution specific data mining components that are optimized to the application at hand and can be embedded into a vertically integrated application. Some key application areas that are driving this research include business processes such as risk management, targeted marketing, and portfolio management; systems processes such as computer and network performance management, and Internet processes such as site profiling and performance tuning, and user personalization.
We see two areas in which data mining and operations research (and optimization techniques) will begin to intersect and interact more frequently as the data mining technology matures. While data mining can assist in the automated discovery of actionable insights from data, the efficient execution of the actions can only be effected by coupling the output of data mining with optimization methods.
Very often the actionable insights need to be acted upon taking into account business constraints such as budgets and schedules. This can be effectively done only by applying optimization techniques to the outputs of data mining. A classic example is the Mail Stream Optimization solution. A second area where the two disciplines will start interacting more frequently and productively is in the actual design of the data mining algorithms. Many data mining algorithms rely upon heuristic search techniques that are trying to optimize some objective function (e.g. minimizing predictive error on hold-out data). It is natural that optimization methods can contribute here to designing more effective data mining algorithms.
In conclusion, given the current research and solution development directions, we see two key trends emerging; data mining will join Mathematical Programming and Optimization as a key scientific technology for building decision support systems, and using data mining should eventually become as easy and pervasive as working with databases and spreadsheets today.