Normalizing Data Formats
Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:
- Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
- Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
- Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).
Case 1: Standardizing Dates
Let's use this dataset as an example:
| Event ID | Event Date |
|---|---|
| 1 | 12/01/2023 |
| 2 | December 2, 2023 |
| 3 | 2023-12-03 |
| 4 | 12.04.2023 |
| 5 | 5th Dec 2023 |
Google Sheets and Microsoft Excel


Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.
Python
import pandas as pd
from dateutil.parser import parse
# Sample data
data = {
'Event ID': [1, 2, 3, 4, 5],
'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')
# Custom function to parse dates
def parse_date(date_str):
try:
return parse(date_str, dayfirst=False)
except ValueError:
return pd.NaT
# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)
# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')
print("Result")
print(df)
R
library(tidyverse)
library(lubridate)
# Sample data
data <- tibble(
`Event ID` = 1:5,
`Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")
# Convert 'Event Date' to Date format
data <- data %>%
mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))
# Standardize format to YYYY-MM-DD
data <- data %>%
mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))
cat("Result")
print(data)
Result:
| Event ID | Event Date |
|---|---|
| 1 | 2023-12-01 |
| 2 | 2023-12-02 |
| 3 | 2023-12-03 |
| 4 | 2023-12-04 |
| 5 | 2023-12-05 |
Case 2: Text Case Standardization
Let's use this dataset as an example:
| Product ID | Product Name |
|---|---|
| 101 | apple |
| 102 | Banana |
| 103 | ORANGE |
| 104 | baNana |
| 105 | Apple |
Google Sheets and Microsoft Excel
Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:
=LOWER(B2)
You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.
Python
import pandas as pd
# Sample data
data = {
'Product ID': [101, 102, 103, 104, 105],
'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}
df = pd.DataFrame(data)
# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()
print(df)
R
library(tidyverse)
# Sample data
data <- tibble(
`Product ID` = c(101, 102, 103, 104, 105),
`Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)
# Convert 'Product Name' to lowercase
data <- data %>%
mutate(`Product Name` = tolower(`Product Name`))
print(data)
Result:
| Product ID | Product Name |
|---|---|
| 101 | apple |
| 102 | banana |
| 103 | orange |
| 104 | banana |
| 105 | apple |
Case 3: Unit Conversion
Let's use this dataset as an example:
| Sample ID | Weight |
|---|---|
| A1 | 70 kg |
| A2 | 154 lbs |
| A3 | 68 KG |
| A4 | 160 pound |
| A5 | 72 kilograms |
Google Sheets and Microsoft Excel

Formulas:
1. Extracting numeric value from Weight (Column B) and placing it in Column C:
Google Sheets (using REGEXEXTRACT)
=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))
For both Google Sheets and Microsoft Excel:
=VALUE(LEFT(B2, FIND(" ", B2) - 1))
2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:
Google Sheets (using REGEXEXTRACT)
=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))
For both Google Sheets and Microsoft Excel:
=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))
3. Converting the weight to kg based on the original unit and placing it in Column E:
For both Google Sheets and Microsoft Excel:
=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))
Python
import pandas as pd
# Sample data
data = {
'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}
df = pd.DataFrame(data)
# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)
# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series
# Conversion function
def convert_to_kg(row):
if row['Unit'] in ['kg', 'kilograms']:
return row['Numeric Value']
elif row['Unit'] in ['lbs', 'pound']:
return row['Numeric Value'] * 0.453592
else:
return None # or handle unknown units
df['Weight in kg'] = df.apply(convert_to_kg, axis=1)
print(df[['Sample ID', 'Weight', 'Weight in kg']])
R
library(tidyverse)
# Sample data
data <- tibble(
`Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)
# Extract numeric value and unit
data <- data %>%
mutate(
`Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
)
# Conversion function
convert_to_kg <- function(value, unit) {
if (unit %in% c('kg', 'kilograms')) {
return(value)
} else if (unit %in% c('lbs', 'pound')) {
return(value * 0.453592)
} else {
return(NA)
}
}
# Apply conversion
data <- data %>%
rowwise() %>%
mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
ungroup()
print(data %>% select(`Sample ID`, Weight, `Weight in kg`))
Result:
| Sample ID | Weight | Numeric Value | Unit | Weight_kg |
|---|---|---|---|---|
| A1 | 70 kg | 70 | kg | 70 |
| A2 | 154 lbs | 154 | lbs | 69.85322498 |
| A3 | 68 KG | 68 | kg | 68 |
| A4 | 160 pound | 160 | pound | 72.5747792 |
| A5 | 72 kilograms | 72 | kilograms | 72 |