Data Preparation by Developers

This provides a concise summary of the data preparation process (gather → discover → cleanse → transform → enrich → store). This summary touches upon tasks involved in prepping data (aggregation, formatting, normalization, and labeling), the concept of data quality (as a measure of the success of the preparation pipeline), and challenges you may run into when preparing data (diversity of data sources, time required, lack of confidence in quality). What are the most common languages/libraries used for data preparation? What can be used for fast in-memory data processing in a distributed architecture?

Data Preparation

Data preparation enables data analytics.

Raw data is usually not suitable for direct analysis. This is because the data might come from different sources in different formats. Moreover, real-world data is not clean. Some data points might be missing. Some others might be out of range. There could be duplicates. Data preparation is therefore an essential task that transforms or prepares data into a form that's suitable for analysis.

Data preparation assumes that data has already been collected. However, others may consider data collection and data ingestion as part of data preparation. Within data preparation, it's common to identify sub-stages that might include data pre-processing, data wrangling, and data transformation. 

Useful insights from data via analytics is the final goal in today's data-driven world. However, data preparation is an important task. Poorly prepared data can make analytics more difficult and ineffective.


Discussion


What's the typical pipeline for data preparation?

Typical pipeline for data preparation.

The first step of a data preparation pipeline is to gather data from various sources and locations. Before any processing is done, we wish to  discover what the data is about. At this stage, we understand the data within the context of business goals. Visualization of the data is also helpful here. The next stage is to  cleanse the data of missing values and invalid values. We also reformat data to standard forms. Next we transform  the data for a specific outcome or audience. We can  enrich data by merging different datasets to enable richer insights. Finally, we store  the data or directly send it out for analytics.

In the context of machine learning, data pre-processing converts raw data into clean data. This involves iteratively cleaning, integration, transformation and reduction of data. When an  ML model is being prototyped, often data scientists may wish to go back to convert the data into a more useful form. This could be called  data wrangling  or data munging. This involves filtering, grouping, sorting, aggregating or decomposing data as required for modelling.


What are some common tasks involved in data preparation?

Some data preparation tasks to solve specific problems.

Data preparation involves one or more of the following tasks:

  • Aggregation: Multiple columns are reduced to fewer columns. Records are summarized.
  • Anonymization: Sensitive values are removed for the sake of privacy.
  • Augmentation: Expand the dataset size without collecting more data. For example, image data is augmented via cropping or rotating.
  • Blending: Combine and link related data from various sources. For example, combine an employee's HR data with payroll data.
  • Decomposing: Decompose a data column that has sub-fields. For example, "6 ounces butter" is decomposed into three columns representing value, unit and ingredient.
  • Deletion: Duplicates and outliers are removed. Exploratory Data Analysis (EDA) may be used to identify outliers.
  • Formatting: Data is modified to a consistent form. For example, 2019-Jul-01, 2019-07-1, and 1/7/19 are changed to a single form, such as 2019-07-01.
  • Imputation: Fill missing values using estimates from available data.
  • Labelling: Data is labelled for supervised machine learning.
  • Normalization: Data is scaled or shifted, perhaps to a range of 0-1.
  • Sampling: For a quick analysis, select a small representative sample.


What attributes are important in the context of data preparation?

Results of a survey of 384 respondents about data preparation attributes

It's important to measure the data preparation pipeline and assess how well it delivers data for analytics. To start with, data quality is measured by its accuracy, completeness, consistency, timeliness, believability, added value, interpretability, and accessibility. The pipeline must be able to validate data, spot problems and give tools to understand why they're occurring.

With the growing demand for near real-time analytics, we expect frequent data refreshing. Ideally, data quality is maintained even when refresh rate is high. In practice, there may be a trade-off.

Data must be easy to access even with growing data variety and volume. Data lakes and Hadoop are enablers in this regard.

Data must also conform to data models, domain models and database schemas. Data preparation must check for conformance.

Data preparation must ensure consistency across various datasets. For example, variations due to spelling or abbreviations must be handled. One sports dataset may use the term 'soccer' while another may use the term 'football'.

Unlike ETL systems, data preparation pipeline must be more flexible for ad hoc processing.


What are some challenges involved in preparing data?

Results of a survey of 311 respondents about data preparation challenges

A common challenge faced by businesses is the diversity of data sources, siloed or proprietary tools, tedious processes, and regulatory compliance. Incompatible data formats, messy data and unbalanced data are further challenges.

When manual processes are used, businesses spend more time preparing data than analyzing it. Therefore, organizations must invest in tools and automation. Indeed, data scientists should get involved with teams tasked with data preparation.

It's been said that "bad data is bad for business". Organizations are unsure of data's quality and hence lack confidence in using it for decision making. This can be solved by investing early in data collection and preparation. Profile the data landscape. Improve data quality at source.

Some applications (such as fraud detection or industrial IoT apps) may require data preparation in real time. With large data volumes, collecting, preparing and storing data at scale is a challenge. In production, the preparation pipeline should be repeatable, handle errors and tuned for performance. It should work for initial data, incremental data and streamed data.


Could you share some best practices for data preparation?

Check data types and formats. Check if data is accurate. Graph the data to get a sense of the distribution and outliers. If these are not as expected, question your assumptions. Label and annotate the graphs. Backup the data. Document or record the steps so that they can be repeated when new data comes in.

Data professionals shouldn't rely too much on IT departments. Adopt data systems and tools that are user friendly.

Data literacy, which is the ability to read, analyze and argue from data, is an essential skill today. Engineers, data scientists and business users must talk in a common language. Prepare data with a good understanding of the context and the business goals. 

Profile the data first. Start with a small sample. Iteratively, try different cleaning strategies and discuss the results with business stakeholders. Keep in mind that data may change in the future. Data preparation is therefore not a do-once-and-forget task.


Which programming languages are well suited for data preparation tasks?

Querying, filtering, and sampling in R and Pandas

Two well-known libraries for data preparation are pandas (Python) and dplyr (R). Apache Spark, more a framework than a language, is also suited for data preparation. Apache Spark enables fast in-memory data processing in a distributed architecture.

The main data structure in pandas is the DataFrame. The method df.describe() is a quick way to describe the data. A more detailed profile can be obtained using the pandas-profiling package. Missing values are represented as NaN or "not a number". Methods df.fillna() and df.dropna() help in dealing with NaN values. A couple of useful tutorials using Pandas are by Jaiswal and by Osei.

In R, the equivalent data structure is data.frame. A brief structure of the data can be seen using str(). Missing data is marked as NA. Packages dplyr and tidyr can be used for data preparation.

Spark provides APIs in Python, R, Java, Scala and SQL. When Spark is used, it's possible to convert between Spark DataFrames and Pandas DataFrames using Apache Arrow.


Could you list some tools that aid data preparation with minimal coding?

Market map of self-service data preparation tools

Data analysis can't happen without data preparation. To enable this with minimal or no coding, it's best to use a self-service data preparation tool. This helps data analysts, business owners and data scientists. There are many of these in the market today: Altair Monarch, Alteryx, ClearStory, Datameer, DataWatch, Dialogue, Improvado, LavaStorm, Microstrategy, Oracle, Paxata, Qlik, Quest, SAP, SAS, Tableau Prep, TIBCO, Trifacta, and Zaloni.

Many tools can combine data from different sources. They include visualization and exploratory data analysis. For data preparation, they can clean, augment, transform, or anonymize data. Some self-service tools include analytics and cataloguing. Good tools manage metadata and offer a search feature. They can track data sources and data transformations.

While developing data preparation pipelines, tools should support real-time visualizations. This facilitates quick debugging of the processing logic. However, it's been noted that many tools are unable to handle big data or fast enough on complex queries.

When choosing a suitable tool, some aspects to consider are features, pricing, performance, usability, collaboration, licensing model, vendor viability, customer support, enterprise integration, security and ecosystem.


Milestones


2000

This decade sees the growing use of data in organizations. However, data is controlled and managed by IT departments. Data scientists work with IT staff. Data preparation involves coding and specialized expertise. Meanwhile, Business Intelligence ( BI) tools show the benefits of data visualization and reporting. Users shift from spreadsheets to BI tools. This leads to more and more requests for data access and analysis. IT is soon overwhelmed by the workload.


May
2005

In a report titled Competing on Analytics, researchers note that companies are competing based on data and analytics. Statistical analysis and predictive modelling have become important. However, the report makes no mention of data preparation.


2010

To cope with the shift towards data-centric operations, IT open up data to other departments. While this brings speed, data also get siloed and inconsistent across datasets. Spreadsheets are error prone. ETL tools are rigid and therefore not suited for rapid prototyping. In this context, self-service data preparation tools emerge. These require almost no coding and enable rapid iterations.


Sep
2018

A screenshot of Google Cloud Dataprep by Trifacta.

Google Cloud Platform announces the general availability of Google Cloud Dataprep, a cloud offering from Trifacta. This is a self-managed data preparation tool that integrates well with other parts of Google Cloud such as BigQuery and Cloud Dataflow. Prepared data can then be analyzed in Google Data Studio.


2019

A study by the International Data Corporation (IDC) finds that 33% of time is spent on data preparation, 32% on analytics and only 13% on data science. Respondents note that "too much time is spent on data preparation". They also recognize that they need to automate data preparation to acceleration their analytics pipeline.


Source: gurumoorthyP, parvezi123, arvindpdmn and pavan.mt9, https://devopedia.org/data-preparation
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Friday, March 17, 2023, 1:51 PM