Searching and Hashing
Site: | Saylor Academy |
Course: | CS102: Introduction to Computer Science II |
Book: | Searching and Hashing |
Printed by: | Guest user |
Date: | Thursday, 3 April 2025, 12:49 PM |
Description
It is important to understand what search is and when it is appropriate. This page explains sequential and binary search, and their implementation. There is also the matter of hashing as a storage and search technique. In so doing, we introduce the unordered map and how to implement a map abstract data type using hashing. Read Sections 6.1-6.5. You may also find Sections 6.6-6.11 useful for practice.
1. Objectives
-
To understand what search is and when it is appropriate.
-
To be able to explain and to implement sequential search and binary search.
-
To understand the idea of hashing as a search technique.
-
To introduce the unordered map abstract data type.
-
To implement a map abstract data type using hashing.
Source: Runestone Academy, https://runestone.academy/runestone/books/published/cppds/SearchHash/toctree.html This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.
2. Searching
In the next two chapters we will turn our attention to some of the most common problems that arise in computing: searching and sorting. In this chapter we will study searching and return to sorting in the next chapter.
Searching is the algorithmic process of finding a particular item in a collection of items. A search typically answers either True
or
False
as to whether the item is present. On occasion it may be modified to return where the item is found. For our purposes here, we will simply concern ourselves with the
question of membership.
In many languages, libraries provide very easy ways to ask whether an item is in a container of items. In Python one uses the in
operator on a list:
>>> 15 in [3,5,2,4,1]
False
>>> 3 in [3,5,2,4,1]
True
>>>
In C++, the STD vector library provides the find
command which works on vectors and even on arrays (although much more awkwardly.) Here is an example using a vector:
>>> #include <vector>
>>> int myints<int> = {3, 5, 2, 4, 1};
>>> cout << find(myints.begin(), myints.end(), 15);
false
>>> cout << find(myints.begin(), myints.end(), 3);
true
How does this work? Even though this is easy to write, an underlying process must be carried out to do the search.
A function can be created for C++ arrays by passing in the array, the size of the array, and the value to search for as arguments.
bool isIn(int alist[], int size, int value) {
for (unsigned i=0; i<size; i++) {
if (alist[i] == value) {
return true;
}
}
return false;
}
int main() {
int myarr[] = {3, 5, 2, 4, 1};
cout<<isIn(myarr, 5, 15)<<endl;
cout<<isIn(myarr, 5, 3)<<endl;
return 0;
}
This works, but it is not the only way to search! It turns out that there are multiple different ways to search for the item. What we are interested in here is how these algorithms work and how they compare to one another.
3. The Sequential Search
When data items are stored in a container type such as a Python list or a C++ array/vector, we say that they have a linear or sequential relationship. Each data item is stored in a position relative to the others. In Python lists, these relative positions are the index values of the individual items. In C++ arrays, these are simply adjacent memory locations each equally sized to fit the data type of the container. Since these index values are ordered, it is possible for us to visit them in sequence. This process gives rise to our first searching technique, the sequential search.
Figure 1 shows how this search works. Starting at the first item in the list, we simply move from item to item, following the underlying sequential ordering until we either find what we are looking for or run out of items. If we run out of items, we have discovered that the item we were searching for was not present.
Figure 1: Sequential Search of a List of Integers
Both the Python and C++ implementations for this algorithm are shown in CodeLens 1 and ActiveCode 1 respectively. The function
needs the list and the item we are looking for and returns a boolean value as to whether it is present. The boolean variable found
is initialized to False
and is assigned the value True
if we discover the item in the list (Or vector, in the case of C++).
xxxxxxxxxx
vector<int> testvector(arr,arr+(sizeof(arr)/sizeof(arr[0])));
using namespace std;
// Checks to see if item is in a vector
// retruns true or false (1 or 0)
//using sequential Search
bool sequentialSearch(vector<int> avector, int item) {
unsigned int pos = 0;
bool found = false;
while (pos < avector.size() && !found) {
if (avector[pos] == item) {
found = true;
} else {
pos++;
}
}
return found;
}
int main() {
// Vector initialized using an array
C++ Sequential Search of an Unordered list (search1_cpp)
3.1. Analysis of Sequential Search
If the item is not in the list, the only way to know it is to compare it against every item present. If there are items, then the sequential search requires comparisons to discover that the item is not there. In the case where the item is in the list, the analysis is not so straightforward. There are actually three different scenarios that can occur. In the best case we will find the item in the first place we look, at the beginning of the list. We will need only one comparison. In the worst case, we will not discover the item until the very last comparison, the nth comparison.
What about the average case? On average, we will find the item about halfway into the list; that is, we will compare against n2 items. Recall, however, that as n gets large, the coefficients, no matter what they are, become insignificant in our approximation, so the complexity of the sequential search, is O (n) . Table 1 summarizes these results.
Table 1: Comparisons Used in a Sequential Search of an Unordered List
We assumed earlier that the items in our collection had been randomly placed so that there is no relative order between the items. What would happen to the sequential search if the items were ordered in some way? Would we be able to gain any efficiency in our search technique?
Assume that the list of items was constructed so that the items were in ascending order, from low to high. If the item we are looking for is present in the list, the chance of it being in any one of the n positions is still the same as before. We will still have the same number of comparisons to find the item. However, if the item is not present there is a slight advantage. Figure 2 shows this process as the algorithm looks for the item 50. Notice that items are still compared in sequence until 54. At this point, however, we know something extra. Not only is 54 not the item we are looking for, but no other elements beyond 54 can work either since the list is sorted. In this case, the algorithm does not have to continue looking through all of the items to report that the item was not found. It can stop immediately. CodeLens 2 shows this variation of the sequential search function.

Figure 2: Sequential Search of an Ordered List of Integers
#include <iostream>
#include <vector>
using namespace std;
// Checks to see if item is in a vector
// retruns true or false (1 or 0)
// using ordered sequential Search
bool orderedSequentialSearch(vector<int> avector, int item) {
unsigned int pos = 0;
bool found = false;
bool stop = false;
while (pos < avector.size() && !found && !stop) {
if (avector[pos] == item) {
found = true;
} else {
if (avector[pos] > item) {
stop = true;
} else {
pos++;
}
}
}
return found;
}
int main() {
// Vector initialized using an array
int arr[] = {0, 1, 2, 8, 13, 17, 19, 32, 42};
vector<int> testvector(arr,arr+(sizeof(arr)/sizeof(arr[0])));
cout << orderedSequentialSearch(testvector, 3) << endl;
cout << orderedSequentialSearch(testvector, 13) << endl;
return 0;
}
C++ Sequential Search of an Ordered vector (search2_cpp)
Table 2 summarizes these results. Note that in the best case we might discover that the item is not in the vector by looking at only one item. On average, we will know after looking through only n/2 items. However, this technique is still . In summary, a sequential search is improved by ordering the vector only in the case where we do not find the item.
4. The Binary Search
It is possible to take greater advantage of the ordered vector if we are clever with our comparisons. In the sequential search, when we compare against the first item, there are at most
more items to look through if the first item is not what we are looking for. Instead of searching the vector in sequence, a binary search will start by examining the middle item. If that item is the one we are searching for, we are done. If it is not the correct item, we can use the ordered nature of the vector to eliminate half of the remaining items. If the item we are searching for is greater than the middle item, we know that the entire lower half of the vector as well as the middle item can be eliminated from further consideration. The item, if it is in the vector, must be in the upper half.
We can then repeat the process with the upper half. Start at the middle item and compare it against what we are looking for. Again, we either find it or split the vector in half, therefore eliminating another large part of our possible search space. Figure 3 shows how this algorithm can quickly find the value 54. The complete function is shown in CodeLens 3.
Figure 3: Binary Search of an Ordered vector of Integers
#include <iostream>
#include <vector>
using namespace std;
// Checks to see if item is in a vector
// retruns true or false (1 or 0)
// using binary Search
bool binarySearch(vector<int> avector, int item) {
int first = 0;
int last = avector.size() - 1;
bool found = false;
while (first <= last && !found) {
int midpoint = (first + last) / 2;
if (avector[midpoint] == item) {
found = true;
} else {
if (item < avector[midpoint]) {
last = midpoint - 1;
} else {
first = midpoint + 1;
}
}
}
return found;
}
int main() {
// Using static array to initialize a vector
static const int arr[] = {0, 1, 2, 8, 13, 17, 19, 32, 42};
vector<int> avector(arr, arr + sizeof(arr) / sizeof(arr[0]));
cout << binarySearch(avector, 3) << endl;
cout << binarySearch(avector, 13) << endl;
return 0;
}
There is a vector initializer within C++ that can be used much like python slices, however this can only be used when new vectors are created.
#include <iostream>
#include <vector>
using namespace std;
// Checks to see if item is in a vector
// retruns true or false (1 or 0)
// using binary Search and
// seperating the vector in halves
bool binarySearch(vector<int> alist, int item) {
if (alist.size() == 0) {
return false;
} else {
int midpoint = alist.size() / 2;
if (alist[midpoint] == item) {
return true;
} else {
if (item < alist[midpoint]) {
vector<int> lefthalf(alist.begin(), alist.begin() + midpoint);
return binarySearch(lefthalf, item);
} else {
vector<int> righthalf(
alist.begin() + midpoint + 1, alist.end());
return binarySearch(righthalf, item);
}
}
}
}
int main() {
// Using static array to initialize a vector
static const int arr[] = {0, 1, 2, 8, 13, 17, 19, 32, 42};
vector<int> alist(arr, arr + sizeof(arr) / sizeof(arr[0]));
cout << binarySearch(alist, 3) << endl;
cout << binarySearch(alist, 13) << endl;
return 0;
}
A Recursive Binary Search (binary_search_cpp_recursive)
4.1. Analysis of Binary Search
To analyze the binary search algorithm, we need to recall that each comparison eliminates about half of the remaining items from consideration. What is the maximum number of comparisons this algorithm will require to check the entire list? If we start with n items, about items will be left after the first comparison. After the second comparison, there will be about n/4. Then n/8, n/16, and so on. How many times can we split the list? Table 3 helps us to see the answer.
When we split the list enough times, we end up with a list that has just one item. Either that is the item we are looking for or it is not. Either way, we are done. The number of comparisons necessary to get to this point is i where =1. Solving for i gives us . The maximum number of comparisons is logarithmic with respect to the number of items in the list. Therefore, the binary search is .
One additional analysis issue needs to be addressed. In the recursive solution shown above, the recursive call,
binarySearch(alist[:midpoint],item)
uses the slice operator to create the left half of the list that is then passed to the next invocation (similarly for the right half as well). The analysis that we did above assumed that the slice operator takes constant time. This means that the binary search using slice will not perform in strict logarithmic time. Luckily this can be remedied by passing the list along with the starting and ending indices. The indices can be calculated as we did in Listing 3. This is especially relevant in C++, where we are initializing a new vector for each split of our list. To truly optimize this algorithm, we could use an array and manually keep track of start and end indices of our array. Below is an example of such an implementation.
#include <iostream>
using namespace std;
//Checks to see if item is in a vector
//retruns true or false (1 or 0)
//using binary Search and
//uses start and end indices
bool binarySearch(int arr[], int item, int start, int end) {
if (end >= start) {
int mid = start + (end - start) / 2;
if (arr[mid] == item)
return true;
if (arr[mid] > item)
return binarySearch(arr, item, start, mid - 1);
else {
return binarySearch(arr, item, mid + 1, end);
}
}
return false;
}
bool binarySearchHelper(int arr[], int size, int item) {
return binarySearch(arr, item, 0, size);
}
int main(void) {
int arr[] = {0, 1, 2, 8, 13, 17, 19, 32, 42};
int arrLength = sizeof(arr) / sizeof(arr[0]);
cout << binarySearchHelper(arr, arrLength, 3) << endl;
cout << binarySearchHelper(arr, arrLength, 13) << endl;
return 0;
}
Optimized Binary Search (binary_search_cpp_array)
Even though a binary search is generally better than a sequential search, it is important to note that for small values of n, the additional cost of sorting is probably not worth it. In fact, we should always consider whether it is cost effective to take on the extra work of sorting to gain searching benefits. If we can sort once and then search many times, the cost of the sort is not so significant. However, for large lists, sorting even once can be so expensive that simply performing a sequential search from the start may be the best choice.
5. Hashing
In previous sections we were able to make improvements on our search algorithms by taking advantage of information about where items are stored in the collection with respect to one another. For example, by knowing that a list was ordered, we could search in logarithmic time using a binary search. In this section we will attempt to go one step further by building a data structure that can be searched in
by using hashing.
In order to do this, we will need to know even more about where the items might be when we go to look for them in the collection. If every item is where it should be, then the search can use a single comparison to discover the presence of an item. We will see, however, that this is typically not the case.
A hash table is a collection of items which are stored in such a way as to make it easy to find them later. Each position of the hash table, often called a slot, can hold an item and is named by an integer value starting at 0. For example, we will have a slot named 0, a slot named 1, a slot named 2, and so on. Initially, the hash table contains no items so every slot is empty. We can implement a hash table by using a list with each value intialized to an empty string. Figure 4 shows a hash table of size .
In other words, there are m slots in the table, named 0 through 10.
Figure 4: Hash Table with 11 Empty Slots
The mapping between an item and the slot where that item belongs in the hash table is called the hash function. The hash function will take any item in the collection and return an integer in the range of slot names, between 0 and m-1.
Assume that we have the set of integer items 54, 26, 93, 17, 77, and 31. Our first hash function, sometimes referred to as the "remainder method," simply takes an item and divides it by the table size, returning the remainder as its hash value (
). Table 4 gives all of the hash values for our example items. Note that this remainder method (modulo arithmetic) will typically be present in some form in all hash functions, since the result must be in the range of slot names.


You can probably already see that this technique is going to work only if each item maps to a unique location in the hash table. For example, if the item 44 had been the next item in our collection, it would have a hash value of 0 ( ). Since 77 also had a hash value of 0, we would have a problem. According to the hash function, two or more items would need to be in the same slot. This is referred to as a collision (it may also be called a "clash"). Clearly, collisions create a problem for the hashing technique. We will discuss them in detail later.
5.1. Hash Functions
Given a collection of items, a hash function that maps each item into a unique slot is referred to as a perfect hash function. If we know the items and the collection will never change, then it is possible to construct a perfect hash function (refer to the exercises for more about perfect hash functions). Unfortunately, given an arbitrary collection of items, there is no systematic way to construct a perfect hash function. Luckily, we do not need the hash function to be perfect to still gain performance efficiency.
One way to always have a perfect hash function is to increase the size of the hash table so that each possible value in the item range can be accommodated. This guarantees that each item will have a unique slot. Although this is practical for small numbers of items, it is not feasible when the number of possible items is large. For example, if the items were nine-digit Social Security numbers, this method would require almost one billion slots. If we only want to store data for a class of 25 students, we will be wasting an enormous amount of memory.
Our goal is to create a hash function that minimizes the number of collisions, is easy to compute, and evenly distributes the items in the hash table. There are a number of common ways to extend the simple remainder method. We will consider a few of them here.
The folding method for constructing hash functions begins by dividing the item into equal-size pieces (the last piece may not be of equal size). These pieces are then added together to give the resulting hash value. For example, if our item was the phone number 436-555-4601, we would take the digits and divide them into groups of 2 (43,65,55,46,01). After the addition, , we get 210. If we assume our hash table has 11 slots, then we need to perform the extra step of dividing by 11 and keeping the remainder. In this case is 1, so the phone number 436-555-4601 hashes to slot 1. Some folding methods go one step further and reverse every other piece before the addition. For the above example, we get which gives .
Another numerical technique for constructing a hash function is called the mid-square method. We first square the item, and then extract some portion of the resulting digits. For example, if the item were 44, we would first compute
By extracting the middle two digits, 93, and performing the remainder step, we get 5 ( ). Table 5 shows items under both the remainder method and the mid-square method. You should verify that you understand how these values were computed.We can also create hash functions for character-based items such as strings. The word "cat" can be thought of as a sequence of int values. The corresponding int value can be found by declaring an int and using it to store a char. You can also cast the value as an int using int()
string h = "hello"; char c = h[0]; int i = c; cout<<h<<endl; cout<<c<<endl; cout<<i<<endl; >>hello >>h >>104
We can then take these three ordinal values, add them up, and use the remainder method to get a hash value (see Figure 6). Listing 1 shows a function called hash
that takes a string and a table size and returns the hash value in the range
from 0 to tablesize
-1.
Figure 6: Hashing a String Using Ordinal Values
Listing 1
#include <iostream>
#include <string>
using namespace std;
// uses ordinal values, of strings and using positional values to weight them
//to generate A hash value
int hashfunc(string a, int tablesize) {
int sum=0;
for (unsigned int pos=0; pos<a.length(); pos++) {
sum += int(a[pos]); // getting ordinal values, and using positional values to weight them
//adding them up, and using the remainder method to get a hash value.
}
return sum%tablesize;
}
int main() {
cout<<hashfunc("First!" , 10)<<endl;
cout<<hashfunc("Second!", 10)<<endl;
cout<<hashfunc("Third!" , 10)<<endl;
return 0;
}
A simple C++ string hash function (simplehash)
It is interesting to note that when using this hash function, anagrams will always be given the same hash value. To remedy this, we could use the position of the character as a weight. Figure 7 shows one possible way to use the positional value as a weighting
factor. The modification to the hash
function is left as an exercise.
You may be able to think of a number of additional ways to compute hash values for items in a collection. The important thing to remember is that the hash function has to be efficient so that it does not become the dominant part of the storage and search process. If the hash function is too complex, then it becomes more work to compute the slot name than it would be to simply do a basic sequential or binary search as described earlier. This would quickly defeat the purpose of hashing.
5.2. Collision Resolution
We now return to the problem of collisions. When two items hash to the same slot, we must have a systematic method for placing the second item in the hash table. This process is called collision resolution. As we stated earlier, if the hash function is perfect, collisions will never occur. However, since this is often not possible, collision resolution becomes a very important part of hashing.
One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty. Note that we may need to go back to the first slot (circularly) to cover the entire hash table. This collision resolution process is referred to as open addressing in that it tries to find the next open slot or address in the hash table. By systematically visiting each slot one at a time, we are performing an open addressing technique called linear probing.
Figure 8 shows an extended set of integer items under the simple remainder method hash function (54,26,93,17,77,31,44,55,20). Table 4 above shows the hash values for the original items. Figure 5 shows the original contents. When we attempt to place 44 into slot 0, a collision occurs. Under linear probing, we look sequentially, slot by slot, until we find an open position. In this case, we find slot 1.
Again, 55 should go in slot 0 but must be placed in slot 2 since it is the next open position. The final value of 20 hashes to slot 9. Since slot 9 is full, we begin to do linear probing. We visit slots 10, 0, 1, and 2, and finally find an empty slot at position 3.
Once we have built a hash table using open addressing and linear probing, it is essential that we utilize the same methods to search for items. Assume we want to look up the item 93. When we compute the hash value, we get 5. Looking in slot 5 reveals
93, and we can return
True
. What if we are looking for 20? Now the hash value is 9, and slot 9 is currently holding 31. We cannot simply return False
since we know that there could have been collisions. We are now forced to do a sequential search,
starting at position 10, looking until either we find the item 20 or we find an empty slot.
A disadvantage to linear probing is the tendency for clustering; items become clustered in the table. This means that if many collisions occur at the same hash value, a number of surrounding slots will be filled by the linear probing resolution. This will have an impact on other items that are being inserted, as we saw when we tried to add the item 20 above. A cluster of values hashing to 0 had to be skipped to finally find an open position. This cluster is shown in Figure 9.
One way to deal with clustering is to extend the linear probing technique so that instead of looking sequentially for the next open slot, we skip slots, thereby more evenly distributing the items that have caused collisions. This will potentially reduce the clustering that occurs. Figure 10 shows the items when collision resolution is done with a "plus 3" probe. This means that once a collision occurs, we will look at every third slot until we find one that is empty.
The general name for this process of looking for another slot after a collision is rehashing. With simple linear probing, the rehash function is
where . The "plus 3" rehash can be defined as . In general,. It is important to note that the size of the "skip" must be such that all the slots in the table will eventually be visited. Otherwise, part of the table will be unused. To ensure this, it is often suggested that the table size be a prime number. This is the reason we have been using 11 in our examples.
A variation of the linear probing idea is called quadratic probing. Instead of using a constant "skip" value, we use a rehash function that increments the hash value by 1, 3, 5, 7, 9, and so on. This means that if the first hash value is h, the successive values are
, , , , and so on. In other words, quadratic probing uses a skip consisting of successive perfect squares. Figure 11 shows our example values after they are placed using this technique.An alternative method for handling the collision problem is to allow each slot to hold a reference to a collection (or chain) of items. Chaining allows many items to exist at the same location in the hash table. When collisions happen, the item is still placed in the proper slot of the hash table. As more and more items hash to the same location, the difficulty of searching for the item in the collection increases. Figure 12 shows the items as they are added to a hash table that uses chaining to resolve collisions.
When we want to search for an item, we use the hash function to generate the slot where it should reside. Since each slot holds a collection, we use a searching technique to decide whether the item is present. The advantage is that on the average there are likely to be many fewer items in each slot, so the search is perhaps more efficient. We will look at the analysis for hashing at the end of this section.
5.3. Implementing the Map Abstract Data Type
One of the most useful C++ data structures is the map. Recall that a map is an associative data type where you can store key–data pairs. The key is used to look up the associated data value.
The map abstract data type is defined as follows. The structure is an unordered collection of associations between a key and a data value. The keys in a map are all unique so that there is a one-to-one relationship between a key and a value. The operations are given below.
Table 1: Map Operations
Map Operation |
Description |
---|---|
|
Create a new, empty map. It returns an empty map collection |
|
Add a new key-value pair to the map. If the key is already in the map then replace the old value with the new value |
|
Given a key, return the value stored in the map or |
|
Delete the key-value pair from the map |
|
Return the number of key-value pairs stored in the map |
|
Return |
One of the great benefits of a map is the fact that given a key, we can look up the associated data value very quickly. In order to provide this fast look up capability, we need an implementation that supports an efficient search. We could use a list with sequential or binary search but it would be even better to use a hash table as described above since looking up an item in a hash table can approach performance.
In Listing 2 we use two lists to create a
HashTable
class that implements the Map abstract data type. One list, called slots
, will hold
the key items and a parallel list, called data
, will hold the data values. When we look up a key, the corresponding position in the data list will hold the associated data
value. We will treat the key list as a hash table using the ideas presented earlier. Note that the initial size for the hash table has been chosen to be 11. Although this is arbitrary, it is important that the size be a prime number so that the
collision resolution algorithm can be as efficient as possible.
Listing 2
class HashTable{
public:
static const int size=11;
int slots[size];
string data[size];
In Listing 3, hashfunction
implements the simple remainder method. The collision resolution technique is linear probing with a "plus 1" rehash
function. The put
function assumes that there will eventually be an empty slot unless the key is already present in the self.slots
.
It computes the original hash value and if that slot is not empty, iterates the
rehash
function until an empty slot occurs. If a nonempty slot already contains the key, the old data value is replaced with the new data value. Dealing with the situation
where there are no empty slots left is an exercise.
Listing 3
def put(self,key,data):
hashvalue = self.hashfunction(key,len(self.slots))
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = data
else:
if self.slots[hashvalue] == key:
self.data[hashvalue] = data #replace
else:
nextslot = self.rehash(hashvalue,len(self.slots))
while self.slots[nextslot] != None and self.slots[nextslot] != key:
nextslot = self.rehash(nextslot,len(self.slots))
if self.slots[nextslot] == None:
self.slots[nextslot] = key
self.data[nextslot] = data
else:
self.data[nextslot] = data #replace
def hashfunction(self,key,size):
return key%size
def rehash(self,oldhash,size):
return (oldhash+1)%size
int hashfunction(int key) {
return key%size;
}
int rehash(int oldhash) {
return (oldhash+1)%size;
}
void put(int key, string val){
int hashvalue = hashfunction(key);
int count = 0;
if (data[hashvalue]=="") {
slots[hashvalue] = key;
data[hashvalue] = val;
} else {
if (slots[hashvalue] == key) {
data[hashvalue] = val;
} else {
int nextslot = rehash(hashvalue);
while (data[nextslot]!="" && slots[nextslot] != key) {
nextslot = rehash(nextslot);
count++;
if (count>size) {
cout<<"TABLE FULL"<<endl;
return;
}
}
if (data[nextslot]=="") {
slots[nextslot]=key;
data[nextslot]=val;
} else {
data[nextslot] = val;
}
}
}
}
Likewise, the get
function (see Listing 4) begins by computing the initial hash value. If the value is not in the initial slot, rehash
is used to locate the next possible position. Notice that line 15 guarantees that the search will terminate by checking to make sure that we have not returned to the initial slot. If that happens, we have exhausted all possible slots and the item
must not be present.
The final methods of the HashTable
class provide additional map functionality. We overload the __getitem__ and __setitem__ methods to allow access using``[]``. This means that
once a HashTable
has been created, the familiar index operator will be available. We leave the remaining methods as exercises.
Listing 4
def get(self,key):
startslot = self.hashfunction(key,len(self.slots))
data = None
stop = False
found = False
position = startslot
while self.slots[position] != None and not found and not stop:
if self.slots[position] == key:
found = True
data = self.data[position]
else:
position=self.rehash(position,len(self.slots))
if position == startslot:
stop = True
return data
def __getitem__(self,key):
return self.get(key)
def __setitem__(self,key,data):
self.put(key,data)
string get(int key) {
int startslot = hashfunction(key);
string val;
bool stop=false;
bool found=false;
int position = startslot;
while(data[position]!="" && !found && !stop) {
if (slots[position]==key) {
found = true;
val = data[position];
} else {
position=rehash(position);
if (position==startslot) {
stop=true;
}
}
}
return val;
}
The following session shows the HashTable
class in action. First we will create a hash table and store some items with integer keys and string data values.
int main() {
HashTable h;
h.put(54, "cat");
h.put(26, "dog");
h.put(93, "lion");
h.put(17, "tiger");
h.put(77, "bird");
h.put(31, "cow");
h.put(44, "goat");
h.put(55, "pig");
h.put(20, "chicken");
cout<<h<<endl;
return 0;
}
>> Output:
77: bird
44: goat
55: pig
20: chicken
26: dog
93: lion
17: tiger
0:
0:
31: cow
54: cat
Next we will access and modify some items in the hash table. Note that the value for the key 20 is being replaced.
...
h.put(20,"chicken");
h.put(17,"tiger");
h.put(20,"duck");
cout<<h<<endl;
...
>> Output:
77: bird
44: goat
55: pig
20: duck
26: dog
93: lion
17: tiger
65535:
0:
31: cow
54: cat
The complete hash table example can be found in ActiveCode 1.
xxxxxxxxxx
using namespace std;
class HashTable{
public:
static const int size=11; // initial size of hash table is prime to help with collision resolution
int slots[size]; // list to hold key items
string data[size]; // list to hold data values
int hashfunction(int key) { // implements remainder method
return key%size;
}
// Computes original hashvalue, and if slot is
// not empty iterates until empty slot is found
int rehash(int oldhash) {
return (oldhash+1)%size;
}
// Function that assumes there will eventually be
// an empty slot unless the key is alread present in the slot
void put(int key, string val){
int hashvalue = hashfunction(key);
Title for the C++ Window (complete_hash_cpp)
xxxxxxxxxx
class HashTable:
def __init__(self):
self.size = 11 # initial size of hash table is prime to help with collision resolution
self.slots = [None] * self.size # list to hold key items
self.data = [None] * self.size # list to hold data values
def put(self,key,data):
hashvalue = self.hashfunction(key,len(self.slots))
if self.slots[hashvalue] == None:
self.slots[hashvalue] = key
self.data[hashvalue] = data
else:
if self.slots[hashvalue] == key:
self.data[hashvalue] = data #replace
else:
nextslot = self.rehash(hashvalue,len(self.slots))
while self.slots[nextslot] != None and \
self.slots[nextslot] != key:
nextslot = self.rehash(nextslot,len(self.slots))
if self.slots[nextslot] == None:
self.slots[nextslot]=key
self.data[nextslot]=data
Complete Hash Table Example (complete_hash_py)
5.4. Analysis of Hashing
We stated earlier that in the best case hashing would provide a , constant time search technique. However, due to collisions, the number of comparisons is typically not so simple. Even though a complete analysis of hashing is beyond the scope of this text, we can state some well-known results that approximate the number of comparisons necessary to search for an item.
The most important piece of information we need to analyze the use of a hash table is the load factor, . Conceptually, if is small, then there is a lower chance of collisions, meaning that items are more likely to be in the slots where they belong. If
is large, meaning that the table is filling up, then there are more and more collisions. This means that collision resolution is more difficult, requiring more comparisons to find an empty slot. With chaining, increased collisions means an increased number of items on each chain.
As before, we will have a result for both a successful and an unsuccessful search. For a successful search using open addressing with linear probing, the average number of comparisons is approximately and an unsuccessful search gives If we are using chaining, the average number of comparisons is for the successful case, and simply comparisons if the search is unsuccessful.
6. Summary
A sequential search is
- for ordered and unordered lists.
- A binary search of an ordered list is
- in the worst case.
- Hash tables can provide constant time searching.
7. Discussion Questions
Using the hash table performance formulas given in the chapter, compute the average number of comparisons necessary when the table is
10% full
25% full
50% full
75% full
90% full
99% full
At what point do you think the hash table is too small? Explain.
Modify the hash function for strings to use positional weightings.
We used a hash function for strings that weighted the characters by position. Devise an alternative weighting scheme. What are the biases that exist with these functions?
Research perfect hash functions. Using a list of names (classmates, family members, etc.), generate the hash values using the perfect hash algorithm.
8. Programming Exercises
Set up a random experiment to test the difference between a sequential search and a binary search on a list of integers.
Use the binary search functions given in the text (recursive and iterative). Generate a random, ordered list of integers and do a benchmark analysis for each one. What are your results? Can you explain them?
Implement the binary search using recursion without the slice operator. Recall that you will need to pass the list along with the starting and ending index values for the sublist. Generate a random, ordered list of integers and do a benchmark analysis.
Overload the
cout
operator (<<) for the hash table Map ADT implementation.Overload the
cin
operator (>>) for the hash table Map ADT implementation.How can you delete items from a hash table that uses chaining for collision resolution? How about if open addressing is used? What are the special circumstances that must be handled? Implement the
del
(~) operator for theHashTable
class.In the hash table map implementation, the hash table size was chosen to be 101. If the table gets full, this needs to be increased. Re-implement the
put
method so that the table will automatically resize itself when the loading factor reaches a predetermined value (you can decide the value based on your assessment of load versus performance).Implement quadratic probing as a rehash technique.
Using a random number generator, create a list of 500 integers. Perform a benchmark analysis using some of the sorting algorithms from this chapter. What is the difference in execution speed?
Implement the bubble sort using simultaneous assignment.
A bubble sort can be modified to "bubble" in both directions. The first pass moves "up" the list, and the second pass moves "down". This alternating pattern continues until no more passes are necessary. Implement this variation and describe under what circumstances it might be appropriate.
Implement the selection sort using simultaneous assignment.
Perform a benchmark analysis for a shell sort, using different increment sets on the same vector.
One way to improve the quick sort is to use an insertion sort on lists that have a small length (call it the "partition limit"). Why does this make sense? Re-implement the quick sort and use it to sort a random list of integers. Perform an analysis using different list sizes for the partition limit.
Implement the median-of-three method for selecting a pivot value as a modification to
quickSort
. Run an experiment to compare the two techniques.
9. Glossary
- binary search
search method in which one repeatedly divides a sorted data structure in half and determines if the item is in one half of it until the item is found or deemed not in the data.
- chaining
a method of collision resolution in which each slot in a hash table holds a reference to a collection of items
- collision
a conflict of having two or more items sharing the same slot in a hash table
- collision resolution
a systematic method for resolving hash table collisions
- clustering
items being mapped in a hash table near each other because of collisions and linear probing resulting in items with collisions being put together
- folding method
a method for constructing a hash function by dividing the item into equally sized pieces and then adding the pieces together to get a hash value. The value is then divided by the size of the hash table and the remainder becomes the slot for that item.
- hashing
generating a value given an input that can be used to find the input by searching for the value.
- hash function
the mapping between an item and its slot in a hash table
- hash table
a collection of items which are stored in such a away as to make it easy to find them
- linear probing
an open addressing technique in which each slot is visited one at a time systematically
- load factor
Represents how full a hash table is. Its the number of items in a hash table divided by the size of the table.
- map
an associate data type that stores key-data pairs
- mid-square method
a method for constructing a hash function by squaring the item and then using some portion of the result.
- open addressing
a collision resolution that tries to find the next open slot/address in the hash table
- perfect hash function
a hash function that maps each item to a unique hash slot
- quadratic probing
a variation of linear probing in which rehashing is done using successive squared values
- rehashing
putting an item into a hash table after a collision
- searching
the algorithmic process of finding a particular item in a collection of items
- sequential search
search method in which one follows the underlying ordering of items in a collection of data to find a specific item
- slot
a position in a hash table