BUS610 Study Guide

Unit 3: Data Mining and Text Mining

3a. Choose appropriate datasets to meet the requirement 

  • What is big data, and how is it used?
  • How are data mining systems used to extract data from the data warehouse?
  • What are the differences between big data produced intentionally or unintentionally by humans and machines?

Big data typically describes data sets so large or complex that traditional data-processing techniques often prove inadequate. The structure of big data is described by:

  • Volume: amount measured in gigabytes or terabytes
  • Velocity: one-time snapshot frequency streams
  • Variety: structured, numeric, alpha, unstructured, text, sound, image or video, genomics
  • Veracity: validation, noise level, deception, detection, relevance, ranking
  • Value: the usefulness of the data in supporting decisions that add economic value

This figure illustrates these characteristics.


We store big data in the data warehouse and use data mining techniques to extract data for use by business intelligence systems. Data mining systems are designed to find patterns and correlations in data from data warehouses and generally prepare data for use in the decision support systems used by decision-makers. This means that they facilitate decision-making but are not directly involved in the decision-making process.
 
The vast majority of the current data was created in just the past few years. The challenge is to extract value from and put it to work for organizations and individuals. The vast amount of personal data produced by citizens can be of value to the public and private sectors.
 
To review, see Big Data.
 

3b. Describe the four stages of the data mining process: data generation, data acquisition, data storage, and data analytics 

  • How are data mining systems used to extract data from the data warehouse?
  • What is involved in the data preparation process?
  • Why is the data preparation and cleaning process important in supporting a BI system?

Data mining is a data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes. Data mining arose primarily along with data warehouses to address some of the limitations in OLTP systems.
 
Data mining is often implemented to populate a data warehouse. Data mining evolved to cater to the limitations of transaction processing systems to deal with massive data sets with high dimensionality, new data types, and multiple heterogeneous data resources.
 
The following figure illustrates the lifecycle of the data preparation process. Data must first be gathered from both internal and external sources. This data is likely to be stored in a wide variety of formats. We then use data discovery processes, like data mining, to understand the information and insights that the data might provide to the managerial decision-maker. After this, we must clean the data. Data gathered from a wide variety of sources is likely to contain inaccuracies and inconsistencies and lack the degree of integrity needed to provide a reliable source of actionable information. After the data has been cleaned, it will likely be necessary to transform it into formats and structures that will be more appropriate to support a business intelligence system. We may also enrich the data by providing additional insights, expansions, and clarifications. Finally, we must store the data in the data warehouse.
 
This data preparation process is fundamental to the ultimate success of the business intelligence system. Again, the adage of "garbage in, garbage out" comes into play. Suppose we do not remove the biases and inconsistencies from the source data. In that case, these biases and inconsistencies will find their way into our decision-making process, and the quality of our decisions will suffer.


The four leadership roles that are needed to take on the challenges of implementing big-data analytics in an organization include:

  1. Chief Data Officer: the data owner and architect who sets the data definitions and strategies
  2. Chief Analytics Officer: has a board-level realm and responsibilities to maintain forward-thinking progress
  3. Data Scientists: provide high technical skills and are proficient in their understanding of the business
  4. Data Manager: serves as the organizer and architect of the data

Online Analytic Processing systems use computing methods that enable users to easily and selectively extract and query data to analyze it from different points of view. These systems are recipients of data provided through data mining. They are also capable of dealing with high dimensionality, new data types, and multiple heterogeneous data resources.
 
To review, see Practical Real-world Data Mining.
 

3c. Standardize and exploit text and develop a taxonomy 

  • What are some of the reasons we use text mining?
  • How is text mining accomplished?

Much of the information and data that we are interested in using to support decision-making will take the form of natural language text. A natural language is any human language – English, Spanish, German, Chinese, etc. Natural languages by themselves are complicated and create myriad problems in text refinement methods for identifying textual relationships.; one example is words having the same spelling but with divergent meanings, such as "live" (to live) and "live" (to see in person). Text mining considers both as similar, while one is a verb and the other an adjective.
 
Text mining is transforming unstructured natural language text into a structured format to identify meaningful patterns and new insights. As these systems continue to evolve, the next major innovation is likely to include some of the recently developed AI systems. Technological advances will likely enhance analysts' ability to standardize and exploit text. AI techniques, especially those that can speed up the data mining process, are some of the most recent developments in the field. Their inclusion in advanced text mining systems is likely to be the most transformational.
 
Text analytics enables businesses to discover insights and meaning from unstructured text-based data. Through the analytic processing of unstructured text, the underlying facts of the situation are discovered.
 
Text analysis is how information is automatically extracted and classified from text data. For example, a text could take the form of survey responses, emails, support tickets, call center notes, product reviews, social media posts, and any other feedback given in free text format. Text analytics enables businesses to discover insights from within this unstructured data.
 
To review, see Introduction to Text Mining.
 

3d. Evaluate data quality based on source reliability, accuracy, timeliness, and application to the requirement 

  • What factors constitute data quality?
  • How can data quality be evaluated?

Data is obtained from a wide variety of sources and is widely diverse in terms of reliability, accuracy, timeliness, and appropriateness to the application.
 
Quantitative data is information that can be tabulated and measured. Data is measured by numbers and is clearly defined. For example, researchers can calculate the number of specific responses to a multiple-choice or yes/no question. Qualitative data is descriptive and can tell researchers how respondents feel about a particular product or service and what influences their purchase decisions.
 
Qualitative data are measures of 'types' and may be represented by a name, symbol, or number code. Qualitative data are data about categorical variables (what type or name). Quantitative data are measures of values or counts and are expressed as numbers. Quantitative data are about numeric variables (how many, how much, or how often).
 
There are some common issues when dealing with Big Data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
 
Knowledge discovery in databases (KDD) is discovering useful knowledge from a collection of data. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Data mining is just one step of the knowledge discovery process (the core step). Some steps that follow are pattern evaluation (this step interprets mined patterns and relationships), akin to your analytic process, and Knowledge Consolidation, which is similar to reporting your findings, although they ought to be more robust than simply consolidating knowledge to responsibly respond to your requirements. Like analysis, KDD is an iterative process. If the pattern evaluated after the data mining step is not useful, the process can begin again from the previous steps. You should utilize the learning transcribed in your journal in a similar fashion as your understanding grows with the most relevant pieces built upon to achieve the most useful and relevant knowledge.
 
To review, see Big Data Analytics for Disparate Data.
 

3e. Identify methods for optimization, filtering, or "cleaning" data for standardization and effective comparison 

  • What are some ways we can optimize or filter data for standardization?

Raw data is usually not suitable for direct analysis. This is because the data might come from different sources in different formats. Therefore, data preparation is an essential task that transforms or prepares data into a form that's suitable for analysis.
 
The following are some of the more common methods of preparing data:

  1. Aggregation – Multiple columns are reduced to fewer columns. Records are summarized
  2. Normalization – Data is scaled or shifted, perhaps to a range of 0-1
  3. Augmentation – Expand the dataset size without collecting more data. For example, in image data via cropping or rotating
  4. Formatting – Data is modified to a consistent form
  5. Imputation – Fill missing values using estimates from available data

Data lineage includes the data origin, what happens to it, and where it moves over time (essentially the whole journey of a piece of data). This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. There are many challenges, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.
 
Your data must be rigorous and contain a highly representative sample to achieve the most relevant, reliable, and reflective insights. It is pointless to collect data from only one subset of a large population when you wish to market to the whole.
 
The database administration process and the database administrator are responsible for the design and administration of data models and the data integrity constraints included in those models. Missing data elements are likely caused by poor data integrity controls and would thus represent a result of poor administration. Database administration is the function of managing and maintaining database management systems (DBMS) software. As a part of this, database administrators are responsible for the data modeling and design process and ensure that operational databases are designed to high professional standards.
 
To review, see Capturing Value from Big Data.
 

Unit 3 Vocabulary 

This vocabulary list includes the terms that you will need to know to successfully complete the final exam.

  • aggregation
  • augmentation
  • big data
  • big data analytics
  • business intelligence
  • business intelligence architecture
  • cognitive
  • data collection
  • data integration
  • data lineage
  • data mining
  • data quality
  • data warehouse
  • business knowledge
  • chief analytics officer
  • chief data officer
  • data manager
  • data scientist
  • formatting
  • imputation
  • knowledge discovery
  • natural language
  • normalization
  • text mining
  • transaction processing system
  • value
  • variety
  • velocity
  • veracity
  • volume