Searching and Hashing

It is important to understand what search is and when it is appropriate. This page explains sequential and binary search, and their implementation. There is also the matter of hashing as a storage and search technique. In so doing, we introduce the unordered map and how to implement a map abstract data type using hashing. Read Sections 6.1-6.5. You may also find Sections 6.6-6.11 useful for practice.

5. Hashing

5.1. Hash Functions

Given a collection of items, a hash function that maps each item into a unique slot is referred to as a perfect hash function. If we know the items and the collection will never change, then it is possible to construct a perfect hash function (refer to the exercises for more about perfect hash functions). Unfortunately, given an arbitrary collection of items, there is no systematic way to construct a perfect hash function. Luckily, we do not need the hash function to be perfect to still gain performance efficiency.

One way to always have a perfect hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. This guarantees that each item will have a unique slot. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large. For example, if the items were nine-digit Social Security numbers, this method would require almost one billion slots. If we only want to store data for a class of 25 students, we will be wasting an enormous amount of memory.

Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. We will consider a few of them here.

The folding method for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value. For example, if our item was the phone number 436-555-4601, we would take the digits and divide them into groups of 2 (43,65,55,46,01). After the addition, , we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case is 1, so the phone number 436-555-4601 hashes to slot 1. Some folding methods go one step further and reverse every other piece before the addition. For the above example, we get which gives .

Another numerical technique for constructing a hash function is called the mid-square method. We first square the item, and then extract some portion of the resulting digits. For example, if the item were 44, we would first compute

By extracting the middle two digits, 93, and performing the remainder step, we get 5 ( ). Table 5 shows items under both the remainder method and the mid-square method. You should verify that you understand how these values were computed.

We can also create hash functions for character-based items such as strings. The word "cat" can be thought of as a sequence of int values. The corresponding int value can be found by declaring an int and using it to store a char. You can also cast the value as an int using int()

string h = "hello";
char c = h[0];
int i = c;

cout<<h<<endl;
cout<<c<<endl;
cout<<i<<endl;

>>hello
>>h
>>104

We can then take these three ordinal values, add them up, and use the remainder method to get a hash value (see Figure 6). Listing 1 shows a function called hash that takes a string and a table size and returns the hash value in the range from 0 to tablesize-1.

../_images/stringhash.png

Figure 6: Hashing a String Using Ordinal Values

Listing 1

#include <iostream>
#include <string>
using namespace std;

// uses ordinal values, of strings and using positional values to weight them
//to generate A hash value
int hashfunc(string a, int tablesize) {
int sum=0;
for (unsigned int pos=0; pos<a.length(); pos++) {
sum += int(a[pos]); // getting ordinal values, and using positional values to weight them
//adding them up, and using the remainder method to get a hash value.
}

return sum%tablesize;
}

int main() {
cout<<hashfunc("First!" , 10)<<endl;
cout<<hashfunc("Second!", 10)<<endl;
cout<<hashfunc("Third!" , 10)<<endl;

return 0;
}

A simple C++ string hash function (simplehash)

It is interesting to note that when using this hash function, anagrams will always be given the same hash value. To remedy this, we could use the position of the character as a weight. Figure 7 shows one possible way to use the positional value as a weighting factor. The modification to the hash function is left as an exercise.

../_images/stringhash2.png

Figure 7: Hashing a String Using Ordinal Values with Weighting

You may be able to think of a number of additional ways to compute hash values for items in a collection. The important thing to remember is that the hash function has to be efficient so that it does not become the dominant part of the storage and search process. If the hash function is too complex, then it becomes more work to compute the slot name than it would be to simply do a basic sequential or binary search as described earlier. This would quickly defeat the purpose of hashing.