Normalizing Data Formats

Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:

  1. Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
  2. Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
  3. Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).

 

Case 1: Standardizing Dates

Let's use this dataset as an example:

Event Dates in Multiple Formats
Event ID Event Date
1 12/01/2023
2 December 2, 2023
3 2023-12-03
4 12.04.2023
5 5th Dec 2023

 

Google Sheets and Microsoft Excel

Google Sheets and Microsoft Excel

Google Sheets and Microsoft Excel

Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.

 

Python
import pandas as pd
from dateutil.parser import parse

# Sample data
data = {
    'Event ID': [1, 2, 3, 4, 5],
    'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}

df = pd.DataFrame(data)

print("Original Dataset")
print(df, '\n')

# Custom function to parse dates
def parse_date(date_str):
    try:
        return parse(date_str, dayfirst=False)
    except ValueError:
        return pd.NaT

# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)

# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')

print("Result")
print(df)

 

R
library(tidyverse)
library(lubridate)

# Sample data
data <- tibble(
  `Event ID` = 1:5,
  `Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")

# Convert 'Event Date' to Date format
data <- data %>%
  mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))

# Standardize format to YYYY-MM-DD
data <- data %>%
  mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))

cat("Result")
print(data)

 

Result:
Normalized Event Dates (YYYY-MM-DD Format)
Event ID Event Date
1 2023-12-01
2 2023-12-02
3 2023-12-03
4 2023-12-04
5 2023-12-05

 

Case 2: Text Case Standardization

Let's use this dataset as an example:

Product Inventory (Case Variants Preserved)
Product ID Product Name
101 apple
102 Banana
103 ORANGE
104 baNana
105 Apple

 

Google Sheets and Microsoft Excel

Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:

=LOWER(B2)

You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.

 

Python
import pandas as pd

# Sample data
data = {
    'Product ID': [101, 102, 103, 104, 105],
    'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}

df = pd.DataFrame(data)

# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()

print(df)

 

R
library(tidyverse)

# Sample data
data <- tibble(
  `Product ID` = c(101, 102, 103, 104, 105),
  `Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)

# Convert 'Product Name' to lowercase
data <- data %>%
  mutate(`Product Name` = tolower(`Product Name`))

print(data)

 

Result:
Product List (with Duplicate Entries)
Product ID Product Name
101 apple
102 banana
103 orange
104 banana
105 apple

Case 3: Unit Conversion

Let's use this dataset as an example:

Sample Weights with Varying Unit Formats
Sample ID Weight
A1 70 kg
A2 154 lbs
A3 68 KG
A4 160 pound
A5 72 kilograms

 

Google Sheets and Microsoft Excel

Google Sheets and Microsoft Excel

Formulas:

1. Extracting numeric value from Weight (Column B) and placing it in Column C:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))


For both Google Sheets and Microsoft Excel:

=VALUE(LEFT(B2, FIND(" ", B2) - 1))


2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))


For both Google Sheets and Microsoft Excel:

=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))


3. Converting the weight to kg based on the original unit and placing it in Column E:

For both Google Sheets and Microsoft Excel:

=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))

 

Python
import pandas as pd

# Sample data
data = {
    'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
    'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}

df = pd.DataFrame(data)

# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)

# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series

# Conversion function
def convert_to_kg(row):
    if row['Unit'] in ['kg', 'kilograms']:
        return row['Numeric Value']
    elif row['Unit'] in ['lbs', 'pound']:
        return row['Numeric Value'] * 0.453592
    else:
        return None  # or handle unknown units

df['Weight in kg'] = df.apply(convert_to_kg, axis=1)

print(df[['Sample ID', 'Weight', 'Weight in kg']])

 

R
library(tidyverse)

# Sample data
data <- tibble(
  `Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
  Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)

# Extract numeric value and unit
data <- data %>%
  mutate(
    `Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
    Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
  )

# Conversion function
convert_to_kg <- function(value, unit) {
  if (unit %in% c('kg', 'kilograms')) {
    return(value)
  } else if (unit %in% c('lbs', 'pound')) {
    return(value * 0.453592)
  } else {
    return(NA)
  }
}

# Apply conversion
data <- data %>%
  rowwise() %>%
  mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
  ungroup()

print(data %>% select(`Sample ID`, Weight, `Weight in kg`))

 

Result:
Weight Normalization Table (All Values Converted to Kilograms)
Sample ID Weight Numeric Value Unit Weight_kg
A1 70 kg 70 kg 70
A2 154 lbs 154 lbs 69.85322498
A3 68 KG 68 kg 68
A4 160 pound 160 pound 72.5747792
A5 72 kilograms 72 kilograms 72