Feature Extraction
In
practice, data rarely comes in the form of ready-to-use matrices.
That's why every task begins with feature extraction. Sometimes, it can
be enough to read the csv file and convert it into numpy.array, but this is a rare exception. Let's look at some of the popular types of data from which features can be extracted.
Texts
Text is a type of data that can come in different formats; there are so many text processing methods that cannot fit in a single article. Nevertheless, we will review the most popular ones.
Before working with text, one must tokenize it. Tokenization implies splitting the text into units (hence, tokens). Most simply, tokens are just the words. But splitting by word can lose some of the meaning – "Santa Barbara" is one token, not two, but "rock'n'roll" should not be split into two tokens. There are ready-to-use tokenizers that take into account peculiarities of the language, but they make mistakes as well, especially when you work with specific sources of text (newspapers, slang, misspellings, typos).
After tokenization, you will normalize the data. For text, this is about stemming and/or lemmatization; these are similar processes used to process different forms of a word. One can read about the difference between them here.
So, now that we have turned the document into a sequence of words, we can represent it with vectors. The easiest approach is called Bag of Words: we create a vector with the length of the vocabulary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector. The process described looks simpler in code:
texts = ["i have a cat", "you have a dog", "you and i have a cat and a dog"]
vocabulary = list(
enumerate(set([word for sentence in texts for word in sentence.split()]))
)
print("Vocabulary:", vocabulary)
def vectorize(text):
vector = np.zeros(len(vocabulary))
for i, word in vocabulary:
num = 0
for w in text:
if w == word:
num += 1
if num:
vector[i] = num
return vector
print("Vectors:")
for sentence in texts:
print(vectorize(sentence.split()))
Vocabulary: [(0, 'i'), (1, 'and'), (2, 'you'), (3, 'a'), (4, 'have'), (5, 'dog'), (6, 'cat')] Vectors: [1. 0. 0. 1. 1. 0. 1.] [0. 0. 1. 1. 1. 1. 0.] [1. 2. 1. 2. 1. 1. 1.]
Here is an illustration of the process:

This is an extremely naive implementation. In practice, you need to consider stop words, the maximum length of the vocabulary, more efficient data structures (usually text data is converted to a sparse vector), etc.
When using algorithms like Bag of Words, we lose the order of the words in the text, which means that the texts "i have no cows" and "no, i have cows" will appear identical after vectorization when, in fact, they have the opposite meaning. To avoid this problem, we can revisit our tokenization step and use N-grams (the sequence of N consecutive tokens) instead.
from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(ngram_range=(1, 1)) vect.fit_transform(["no i have cows", "i have no cows"]).toarray()
array([[1, 1, 1],
[1, 1, 1]])
vect.vocabulary_
{'no': 2, 'have': 1, 'cows': 0}
vect = CountVectorizer(ngram_range=(1, 2)) vect.fit_transform(["no i have cows", "i have no cows"]).toarray()
array([[1, 1, 1, 0, 1, 0, 1],
[1, 1, 0, 1, 1, 1, 0]])
vect.vocabulary_
{'no': 4,
'have': 1,
'cows': 0,
'no have': 6,
'have cows': 2,
'have no': 3,
'no cows': 5}
Also note that one does not have to use only words. In some cases, it is possible to generate N-grams of characters. This approach would be able to account for similarity of related words or handle typos.
from scipy.spatial.distance import euclidean
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(3, 3), analyzer="char_wb")
n1, n2, n3, n4 = vect.fit_transform(
["andersen", "petersen", "petrov", "smith"]
).toarray()
euclidean(n1, n2), euclidean(n2, n3), euclidean(n3, n4)
(np.float64(2.8284271247461903), np.float64(3.1622776601683795), np.float64(3.3166247903554))
Adding onto the Bag of Words idea: words that are rarely found in the corpus (in all the documents of this dataset) but are present in this particular document might be more important. Then it makes sense to increase the weight of more domain-specific words to separate them out from common words. This approach is called TF-IDF (term frequency-inverse document frequency), which cannot be written in a few lines, so you should look into the details in references such as this wiki. The default option is as follows:
\(\large idf(t,D) = \log\frac{\mid D\mid}{df(d,t)+1}\)
\(\large tfidf(t,d,D) = tf(t,d) \times idf(t,D)\)
Ideas similar to Bag of Words can also be found outside of text problems e.g. bag of sites in the Catch Me If You Can competition, bag of apps, bag of events, etc.

Using these algorithms, it is possible to obtain a working solution for a simple problem, which can serve as a baseline. However, for those who do not like the classics, there are new approaches. The most popular method in the new wave is Word2Vec, but there are a few alternatives as well (GloVe, Fasttext, etc.).
Word2Vec is a special case of the word embedding algorithms. Using Word2Vec and similar models, we can not only vectorize words in a high-dimensional space (typically a few hundred dimensions) but also compare their semantic similarity. This is a classic example of operations that can be performed on vectorized concepts: king - man + woman = queen.

It is worth noting that this model does not comprehend the meaning of the words but simply tries to position the vectors such that words used in common context are close to each other. If this is not taken into account, a lot of fun examples will come up.
Such models need to be trained on very large datasets in order for the vector coordinates to capture the semantics. A pretrained model for your own tasks can be downloaded here.
Similar methods are applied in other areas such as bioinformatics. An unexpected application is food2vec. You can probably think of a few other fresh ideas; the concept is universal enough.
Images
Working with images is easier and harder at the same time. It is easier because it is possible to just use one of the popular pretrained networks without much thinking but harder because, if you need to dig into the details, you may end up going really deep. Let's start from the beginning.
In a time when GPUs were weaker and the "renaissance of neural networks" had not happened yet, feature generation from images was its own complex field. One had to work at a low level, determining corners, borders of regions, color distributions statistics, and so on. Experienced specialists in computer vision could draw a lot of parallels between older approaches and neural networks; in particular, convolutional layers in today's networks are similar to Haar cascades. If you are interested in reading more, here are a couple of links to some interesting libraries: skimage and SimpleCV.
Often for problems associated with images, a convolutional neural network is used. You do not have to come up with the architecture and train a network from scratch. Instead, download a pretrained state-of-the-art network with the weights from public sources. Data scientists often do so-called fine-tuning to adapt these networks to their needs by "detaching" the last fully connected layers of the network, adding new layers chosen for a specific task, and then training the network on new data. If your task is to just vectorize the image (for example, to use some non-network classifier), you only need to remove the last layers and use the output from the previous layers:
# # Install Keras and tensorflow (https://keras.io/)
# from keras.applications.resnet50 import ResNet50, preprocess_input
# from keras.preprocessing import image
# from scipy.misc import face
# import numpy as np
# resnet_settings = {'include_top': False, 'weights': 'imagenet'}
# resnet = ResNet50(**resnet_settings)
# # What a cute raccoon!
# img = image.array_to_img(face())
# img
# # In real life, you may need to pay more attention to resizing # img = img.resize((224, 224)) # x = image.img_to_array(img) # x = np.expand_dims(x, axis=0) # x = preprocess_input(x) # # Need an extra dimension because model is designed to work with an array # # of images - i.e. tensor shaped (batch_size, width, height, n_channels) # features = resnet.predict(x)

Here's a classifier trained on one dataset and adapted for a different one by "detaching" the last layer and adding a new one instead.
Nevertheless, we should not focus too much on neural network techniques. Features generated by hand are still very useful: for example, for predicting the popularity of a rental listing, we can assume that bright apartments attract more attention and create a feature such as "the average value of the pixel". You can find some inspiring examples in the documentation of relevant libraries.
If there is text on the image, you can read it without unraveling a complicated neural network. For example, check out pytesseract.
# requires `tesseract` to be installed in the system import pytesseract from PIL import Image import requests from io import BytesIO ##### Just a random picture from search img = "http://ohscurrent.org/wp-content/uploads/2015/09/domus-01-google.jpg" img = requests.get(img) img = Image.open(BytesIO(img.content)) text = pytesseract.image_to_string(img) print(text)
It's good to keep in mind that pytesseract is not a "silver bullet".
img = "https://habrastorage.org/webt/mj/uv/6o/mjuv6olsh1x9xxe1a6zjy79u1w8.jpeg" img = requests.get(img) img = Image.open(BytesIO(img.content)) print(pytesseract.image_to_string(img))
‘BEDROOM 1exI2 DINING AREA 11" 100" uvinc Room 120" 182" KITCHEN 102" x T10" x1t0"
Another case where neural networks cannot help is extracting features from meta-information. For images, EXIF stores many useful meta-information: manufacturer and camera model, resolution, use of the flash, geographic coordinates of shooting, software used to process image and more.
Geospatial data
Geographic data is not so often found in problems, but it is still useful to master the basic techniques for working with it, especially since there are quite a number of ready-to-use solutions in this field.
Geospatial data is often presented in the form of addresses or coordinates of (Latitude, Longitude). Depending on the task, you may need two mutually-inverse operations: geocoding (recovering a point from an address) and reverse geocoding (recovering an address from a point). Both operations are accessible in practice via external APIs from Google Maps or OpenStreetMap. Different geocoders have their own characteristics, and the quality varies from region to region. Fortunately, there are universal libraries like geopy that act as wrappers for these external services.
If you have a lot of data, you will quickly reach the limits of external API. Besides, it is not always the fastest to receive information via HTTP. Therefore, it is necessary to consider using a local version of OpenStreetMap.
If you have a small amount of data, enough time, and no desire to extract fancy features, you can use reverse_geocoder in lieu of OpenStreetMap:
import reverse_geocoder as revgc results = revgc.search(list(zip(df.latitude, df.longitude))) pprint(results[:2])
Loading formatted geocoded file...
{'admin1': 'New York',
'admin2': 'Queens County',
'cc': 'US',
'lat': '40.74482',
'lon': '-73.94875',
'name': 'Long Island City'},
{'admin1': 'New York',
'admin2': 'Queens County',
'cc': 'US',
'lat': '40.74482',
'lon': '-73.94875',
'name': 'Long Island City'}
When working with geocoding, we must not forget that addresses may contain typos, which makes the data cleaning step necessary. Coordinates contain fewer misprints, but its position can be incorrect due to GPS noise or bad accuracy in places like tunnels, downtown areas, etc. If the data source is a mobile device, the geolocation may not be determined by GPS but by WiFi networks in the area, which leads to holes in space and teleportation. While traveling along in Manhattan, there can suddenly be a WiFi location from Chicago.
WiFi location tracking is based on the combination of SSID and MAC-addresses, which may correspond to different points e.g. federal provider standardizes the firmware of routers up to MAC-address and places them in different cities. Even a company's move to another office with its routers can cause issues.
The point is usually located among infrastructure. Here, you can really unleash your imagination and invent features based on your life experience and domain knowledge: the proximity of a point to the subway, the number of stories in the building, the distance to the nearest store, the number of ATMs around, etc. For any task, you can easily come up with dozens of features and extract them from various external sources. For problems outside an urban environment, you may consider features from more specific sources e.g. the height above sea level.
If two or more points are interconnected, it may be worthwhile to extract features from the route between them. In that case, distances (great circle distance and road distance calculated by the routing graph), number of turns with the ratio of left to right turns, number of traffic lights, junctions, and bridges will be useful. In one of my own tasks, I generated a feature called "the complexity of the road", which computed the graph-calculated distance divided by the GCD.
Date and time
You would think that date and time are standardized because of their prevalence, but, nevertheless, some pitfalls remain.
Let's start with the day of the week, which are easy to turn into 7
dummy variables using one-hot encoding. In addition, we will also create
a separate binary feature for the weekend called is_weekend.
df['dow'] = df['created'].apply(lambda x: x.date().weekday()) df['is_weekend'] = df['created'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
Some tasks may require additional calendar features. For example, cash withdrawals can be linked to a pay day; the purchase of a metro card, to the beginning of the month. In general, when working with time series data, it is a good idea to have a calendar with public holidays, abnormal weather conditions, and other important events.
Q: What do Chinese New Year, the New York marathon, and the Trump inauguration have in common?
A: They all need to be put on the calendar of potential anomalies.
Dealing with hour (minute, day of the month …) is not as simple as it
seems. If you use the hour as a real variable, we slightly contradict
the nature of data: 0<23 while 0:00:00 02.01> 01.01 23:00:00.
For some problems, this can be critical. At the same time, if you
encode them as categorical variables, you'll breed a large numbers of
features and lose information about proximity – the difference between
22 and 23 will be the same as the difference between 22 and 7.
There also exist some more esoteric approaches to such data like projecting the time onto a circle and using the two coordinates.
def make_harmonic_features(value, period=24):
value *= 2 * np.pi / period
return np.cos(value), np.sin(value)
This transformation preserves the distance between points, which is important for algorithms that estimate distance (kNN, SVM, k-means …)
from scipy.spatial import distance euclidean(make_harmonic_features(23), make_harmonic_features(1))
0.5176380902050424
euclidean(make_harmonic_features(9), make_harmonic_features(11))
0.5176380902050414
euclidean(make_harmonic_features(9), make_harmonic_features(21))
2.0
However, the difference between such coding methods is down to the third decimal place in the metric.
Time series, web, etc.
Regarding time series — we will not go into too much detail here (mostly due to my personal lack of experience), but I will point you to a useful library that automatically generates features for time series.
If you are working with web data, then you usually have information
about the user's User Agent. It is a wealth of information. First, one
needs to extract the operating system from it. Secondly, make a feature is_mobile. Third, look at the browser.
# Install pyyaml ua-parser user-agents
import user_agents
ua = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/56.0.2924.76 Chrome/56.0.2924.76 Safari/537.36"
ua = user_agents.parse(ua)
print("Is a bot? ", ua.is_bot)
print("Is mobile? ", ua.is_mobile)
print("Is PC? ", ua.is_pc)
print("OS Family: ", ua.os.family)
print("OS Version: ", ua.os.version)
print("Browser Family: ", ua.browser.family)
print("Browser Version: ", ua.browser.version)
Is a bot? False Is mobile? False Is PC? True OS Family: Ubuntu OS Version: () Browser Family: Chromium Browser Version: (56, 0, 2924)
As in other domains, you can come up with your own features based on intuition about the nature of the data. At the time of this writing, Chromium 56 was new, but, after some time, only users who haven't rebooted their browser for a long time will have this version. In this case, why not introduce a feature called "lag behind the latest version of the browser"?
In addition to the operating system and browser, you can look at the referrer (not always available), http_accept_language, and other meta information.
The next useful piece of information is the IP-address, from which
you can extract the country and possibly the city, provider, and
connection type (mobile/stationary). You need to understand that there
is a variety of proxy and outdated databases, so this feature can
contain noise. Network administration gurus may try to extract even
fancier features like suggestions for using VPN. By the way, the data from the IP-address is well combined with http_accept_language: if the user is sitting at the Chilean proxies and browser locale is ru_RU, something is unclean and worth a look in the corresponding column in the table (is_traveler_or_proxy_user).
Any given area has so many specifics that it is too much for an individual to absorb completely. Therefore, I invite everyone to share their experiences and discuss feature extraction and generation in the comments section.