Normalizing Data Formats

Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:

Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).

Case 1: Standardizing Dates

Let's use this dataset as an example:

Event Dates in Multiple Formats

Event ID	Event Date
1	12/01/2023
2	December 2, 2023
3	2023-12-03
4	12.04.2023
5	5th Dec 2023

Google Sheets and Microsoft Excel

Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.

Python

import pandas as pd
from dateutil.parser import parse

# Sample data
data = {
    'Event ID': [1, 2, 3, 4, 5],
    'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}

df = pd.DataFrame(data)

print("Original Dataset")
print(df, '\n')

# Custom function to parse dates
def parse_date(date_str):
    try:
        return parse(date_str, dayfirst=False)
    except ValueError:
        return pd.NaT

# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)

# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')

print("Result")
print(df)

R

library(tidyverse)
library(lubridate)

# Sample data
data <- tibble(
  `Event ID` = 1:5,
  `Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")

# Convert 'Event Date' to Date format
data <- data %>%
  mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))

# Standardize format to YYYY-MM-DD
data <- data %>%
  mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))

cat("Result")
print(data)

Result:

Normalized Event Dates (YYYY-MM-DD Format)

Event ID	Event Date
1	2023-12-01
2	2023-12-02
3	2023-12-03
4	2023-12-04
5	2023-12-05

Case 2: Text Case Standardization

Let's use this dataset as an example:

Product Inventory (Case Variants Preserved)

Product ID	Product Name
101	apple
102	Banana
103	ORANGE
104	baNana
105	Apple

Google Sheets and Microsoft Excel

Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:

=LOWER(B2)

You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.

Python

import pandas as pd

# Sample data
data = {
    'Product ID': [101, 102, 103, 104, 105],
    'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}

df = pd.DataFrame(data)

# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()

print(df)

R

library(tidyverse)

# Sample data
data <- tibble(
  `Product ID` = c(101, 102, 103, 104, 105),
  `Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)

# Convert 'Product Name' to lowercase
data <- data %>%
  mutate(`Product Name` = tolower(`Product Name`))

print(data)

Result:

Product List (with Duplicate Entries)

Product ID	Product Name
101	apple
102	banana
103	orange
104	banana
105	apple

Case 3: Unit Conversion

Let's use this dataset as an example:

Sample Weights with Varying Unit Formats

Sample ID	Weight
A1	70 kg
A2	154 lbs
A3	68 KG
A4	160 pound
A5	72 kilograms

Google Sheets and Microsoft Excel

Formulas:

1. Extracting numeric value from Weight (Column B) and placing it in Column C:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))

For both Google Sheets and Microsoft Excel:

=VALUE(LEFT(B2, FIND(" ", B2) - 1))

2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))

For both Google Sheets and Microsoft Excel:

=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))

3. Converting the weight to kg based on the original unit and placing it in Column E:

For both Google Sheets and Microsoft Excel:

=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))

Python

import pandas as pd

# Sample data
data = {
    'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
    'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}

df = pd.DataFrame(data)

# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)

# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series

# Conversion function
def convert_to_kg(row):
    if row['Unit'] in ['kg', 'kilograms']:
        return row['Numeric Value']
    elif row['Unit'] in ['lbs', 'pound']:
        return row['Numeric Value'] * 0.453592
    else:
        return None  # or handle unknown units

df['Weight in kg'] = df.apply(convert_to_kg, axis=1)

print(df[['Sample ID', 'Weight', 'Weight in kg']])

R

library(tidyverse)

# Sample data
data <- tibble(
  `Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
  Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)

# Extract numeric value and unit
data <- data %>%
  mutate(
    `Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
    Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
  )

# Conversion function
convert_to_kg <- function(value, unit) {
  if (unit %in% c('kg', 'kilograms')) {
    return(value)
  } else if (unit %in% c('lbs', 'pound')) {
    return(value * 0.453592)
  } else {
    return(NA)
  }
}

# Apply conversion
data <- data %>%
  rowwise() %>%
  mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
  ungroup()

print(data %>% select(`Sample ID`, Weight, `Weight in kg`))

Result:

Weight Normalization Table (All Values Converted to Kilograms)

Sample ID	Weight	Numeric Value	Unit	Weight_kg
A1	70 kg	70	kg	70
A2	154 lbs	154	lbs	69.85322498
A3	68 KG	68	kg	68
A4	160 pound	160	pound	72.5747792
A5	72 kilograms	72	kilograms	72