• Unit 3: Data Mining and Text Mining

    How do business intelligence teams get the information they need to support management teams' decisions? The human brain cannot process even a fraction of the vast amount of information available to it. Technology has evolved to let us access, organize, and filter massive datasets. What exactly are data and text mining? The academic definition is "a multi-disciplinary field based on information retrieval, data mining, machine learning, statistics, and computational linguistics". Essentially, data mining is the process of analyzing a large data set to identify relevant patterns. Text mining is analyzing text data that is in an unstructured format and mapping it into a structured format to derive relevant insights. This unit looks at some common uses and techniques for data and text mining.

    Completing this unit should take you approximately 12 hours.

    • 3.1: Understanding Big Data

      In the most basic terms, big data is larger, more complex data sets, especially from new data sources. The data sets are so large that "traditional" processing software cannot manage them. These datasets are helpful to you as they address problems that were not previously addressable.

      • 3.1.1: What is Big Data?

        Data has intrinsic value, but nothing can be gleaned until the volume of data coming in at a high velocity from various sources is preprocessed. Once that value is ascertained, it must also hold veracity.

      • 3.1.2: Where Does Big Data Live?

        Consider how much data is produced and how it is used as you prepare to understand how much "space" is needed to process and make it available for processing.
    • 3.2: Data and Text Mining

      Text mining is analyzing text data in an unstructured format and mapping it into a structured format to derive relevant insights. Data mining relies mostly on statistical techniques and algorithms. Text mining also depends on statistical analysis and uses linguistic analysis techniques.
      • 3.2.1: Data Mining Techniques

        Data mining is a process that is automated in various ways to allow analysts to exploit large datasets. The data's initial comparability and "cleanliness" will determine how complex the process needs to be. The process will vary with the type, level of existing structure, size, and complexity of your datasets.
      • 3.2.2: Text Mining and the Complications of Language

    • 3.3: Evaluating Source Data

      Using credible sources in your research gives it credibility. High-quality resources are more likely to translate into better results. Conversely, poor quality is likely to adversely affect your results. It is always best to remember these universally accepted criteria when sourcing - accuracy, authority, objectivity, currency, and coverage. Using poor-quality data that results in not-so-valuable findings is commonly called "garbage in, garbage out".
      • 3.3.1: Identifying Data Sources

        Where your data originated is vital. What, where, and how the data was entered into a machine-readable format is one of the most hotly discussed aspects of error tracing. The comprehensive article below provides a deep dive into the issues, such as providing an audit trail. Ensuring source origination allows you to ensure you know which topics your records relate to.
      • 3.3.2: Source Evaluation Trust Matrix

        These examples describe various types of trust models. To ensure standardized source validation expressions and evaluation methods in your organization, you should rely upon or develop if there is not one already in widespread use, to ensure that all team members dealing with data understand how to know whether to trust it. These articles describe two types of trust evaluation models for specific processes. Yours may be similar or much different depending upon your field and the source requirements in the discipline and in your organization.
    • 3.4: Data Optimization

      Complex optimization problems are no longer being handled by traditional techniques, especially as datasets get larger and more disparate. The research world is moving towards understanding how to reduce computational resources in various ways, including through artificial intelligence, (AI), which essentially teaches machines (computers) to "think" like humans. While the efficiency benefits of AI are obvious, as it develops there are numerous ethical and soundness issues that will be debated as new technologies are created, tested and deployed for various purposes in industry and even in consumer products. Do you want your freezer to decide for you how much and what kind of ice to make, for instance? Maybe you do. Others may find this intrusive and "creepy..".

      • 3.4.1: Preparing Data

        So now you have all that data, how do you make it useful? It must be cleaned and enriched to provide some relevant insights. These articles highlight how that can be achieved through confirmatory and exploratory approaches.
      • 3.4.2: Standardization

        This page explains the concept of data lineage and its utility in tracing errors back to their root cause in the data process. Data lineage is a way of debugging Big Data pipelines, but the process is not simple. Many challenges exist, such as scalability, fault tolerance, anomaly detection, and more. For each of the challenges listed, write your own definition.

      • 3.4.3: Combining Data from Different Sources

        Your data must be rigorous and contain a highly representative sample to achieve the most relevant, reliable, and reflective insights. Collecting data from only one subset of a large population is pointless when you wish to market to the whole.
    • Study Guide: Unit 3

      We recommend reviewing this Study Guide before taking the Unit 3 Assessment.

    • Unit 3 Assessment

      • Receive a grade