Hashing
Searching techniques:
-
Sequential search: O(n)
-
Binary search O(log2(n))
There is at least one condition where a binary search does not work very well – when the data that you are searching for is on a disk, and not in main memory. When accessing data on a disk, the slowest part is accessing the disk drive, and the goal is to minimize the number of disk accesses needed to find the desired data.
Obviously, the best possible case would be if you could find the data on the first try every time! There is no method for doing this, but there is a way of coming close – hashing.
Hashing: the process of taking a key and applying an algorithm to it to come up with an address.
Ideal hashing
Consider a town with a population of 9,000 where everybody's telephone number begins with the same 3-digit prefix. We want to be able to use everybody's phone number as a key to look up the person's name and address. And we want to find the name and address on the first guess.
One solution is to create an array of 10,000 items (numbered from 0 to 9,999) and use the last 4 digits of the phone numbers (phone number mod 10,000) as keys.
So if your number is 555-1234, your information would be stored in location 1234 in the array. If the number of names and addresses is close to 10,000 (as it is in this case), almost all of the array would be full, and we would have very little wasted space. In return for the wasted space, we would be able to go directly to the item that we are looking for on the first try!
It doesn't get any better than this! We will always find the item we are looking for on the first try.
However, not all keys will convert into an address (or array index) as easily.
Typical Hashing
In real life, things don't always work out this easily. Consider the same example as above, but for a much smaller town, say with a population of 700. If we allocate 10,000 memory locations, we will be wasting 9,300 of them (93%).
One possible solution is to apply the same algorithm that we did before, except make our array size 700 (0 to 699). Since we need to map (convert) each key into a number in the range 0 to 699, we could take the last 4 digits of the phone number (mod 10,000) and divide it by 700 and take the remainder (mod 700) to get a number in the range 0 to 699. However, unless the phone numbers are 555-0000 through 555-0699 (or something similar), we are going to have a problem. It is very likely that we will have two phone numbers that, when divided by 700, produce the same remainder.
Example
The phone numbers 555-0123 and 555-0823 both produce a remainder of 123 when divided by 700. Only one of them can be put in location 123 in the array. So what happens when we try to put another one in the array? We get a collision.
Collision: the mapping of two keys into the same hash index.
Collisions are bad.
In the real world, perfect data occurs very infrequently. In the real world, collisions are going to occur. So we need to decide:
-
How to minimize their occurrence.
-
How to handle them when they do occur.
General characteristics of a hashing function
A good hashing function should:
-
Minimize collisions
-
Distribute entries uniformly throughout the hash table
-
Be fast to compute
Java's built-in hashCode method
The Object class has a built-in method called hashCode, which returns an integer. The integer is based on the object's address in main memory. However, this is not a good hashing algorithm to use. The reason is that we could have two different objects that have the same values. These objects would hash into different locations in a hash table (because their addresses are different), even though they should hash into the same location in the hash table (because their values are the same).
So we should override the built-in hashCode method with one of our own. Our own hash function should:
-
Provide equal hash codes for Objects that have equal values.
-
Always produce the same hash code for the same data values.
-
Evenly distribute the keys through the range of possible hash indexes.
Hash codes for strings
A hashing function is supposed to convert a key into an integer. If the key is a string, we need to find a way to convert the string into an integer. The most common way is to take the ASCII or Unicode value of the character (note that the low order byte of the Unicode character is the ASCII code). We could just add up the values, but unfortunately this doesn't distribute the keys very evenly. A better solution is to use the following formula:
u0gn-1 + u1gn-2 + ... + un-2g1 + un-1g0
where the u's represent the codes of the characters of the string, and the g is some constant, and n is the number of characters in the string.
This formula can be converted into the following which reduces the number of arithmetic operations that need to be done:
(...((u0g + u1)g + u2)g + ... + un-2)g + un-1
This is written in Java like this:
public class HashTest
{
public static void main (String[] args)
{
Scanner input = new Scanner(System.in);
System.out.print("Enter a string: ");
String s = input.nextLine();
int g = 31;
int hash = 0;
int n = s.length();
for (int i = 0; i < n; i++)
{
System.out.println("Code = " + (int) s.charAt(i));
hash = g * hash + s.charAt(i);
System.out.println("Hash = " + hash);
}
}
}
Note:
-
The variable g could be any number.
-
The int value of the charAt function is the same as the ASCII code value.
-
You can use charAt(i) in an arithmetic expression and Java will use its integer value!
The output from this program is:
Enter a string: Hello
Code = 72
Hash = 72
Code = 101
Hash = 2333
Code = 108
Hash = 72431
Code = 108
Hash = 2245469
Code = 111
Hash = 69609650
Note that for long strings, you can get an overflow. Java will ignore integer overflows and just keep the low-order bits of the result! You do not get an error message!
The String class has a built-in hashCode function that uses a value of 31 for g.
The hash code for the string "Hello" is 69,609,650.
Hash codes for primitive types
If the type is byte, short, or char, cast it into an int.
If the type is long, you can cast it into an int by dividing by 232 and taking the remainder (this will give you the low-order 32 bits of the number, and will fit into an integer).
If the type is double, you can cast it into a long like this:
long bits = Double.doubleToLongBits(n);
Hashing method: Folding
Another alternative is folding. Folding involves dividing the key into several parts and then combining the parts.
Example
We can get the leftmost 32 bits by using the right-shift operator in Java: >>. So to shift a long variable called key to the right 32 bits, we would write this:
key >> 32
We can then add these bits to the rightmost bits and cast it to a 32-bit integer:
hashCode = (int) key + (key >> 32);
Example
We can also create a hash code for doubles. However, casting a double to a long or an int would only make use of the bits that make up the integer, and would discard the fractional part. Keeping the fractional part would be more likely to generate a "more random" number (it seems like the more bits, the more random the resulting hash value should be). The Double wrapper class has the following method that will convert a double into a long integer:
long bits = Double.doubleToLongBits(key);
int hashCode = (int) (bits ^ (bits >> 32));
Things to note:
-
A double value in Java is represented by a 64-bit bit pattern.
-
A long value in Java is also represented by a 64-bit bit pattern.
-
Since they are both the same size, we can convert the double to an int without losing any bits. The result will be a 64-bit long value.
-
We can then shift the number 32 bits to the right and perform an exclusive-or operation on the leftmost 32 bits with the right-most 32 bits.
There are two types of or operations – an inclusive or and an exclusive-or. Inclusive-or means one or the other or both. The exclusive-or means one or the other, but not both.
The truth table for the exclusive-or function is:
A
|
B
|
A xor B
|
0
|
0
|
0
|
0
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
0
|
The last row is where it differs from the inclusive or.
TABLE of frequency of occurrence of the digits 0..9 for 2800 five-digit keys:
digit: 1 2 3 4 5
0 2026 250 218 1012 260
1 618 395 391 185 382
2 128 263 389 299 271
3 23 298 330 52 302
4 5 298 330 52 302
5 335 299 101 387
6 303 339 18 199
7 289 308 124 301
8 267 267 999 245
9 400 259 0 353
Analysis:
"By this method a frequency count is performed in regard to the number of times each of the 10 digits occurs in each of the positions included in the record key. For example, consider the following table showing the number of times each digit occurred in a five-position numeric key for 2,800 records. In this tabulation we can observe that digits 0..9 occur with approximately uniform distribution in key positions 2, 3, and 5; therefore, if a 3-digit address were required, the digits in these three positions in the record keys could be used. Given that there are 2,800 records, however, a four-digit address would be required. Suppose we desire the first digit to be a 0, 1, 2, or 3 only. Such assignment can be made with about equal frequency for each digit by using a rule such as the following: assign a 0 when digits in positions 2 and 3 both contain odd numbers, a 1 if position 2 is odd and position 3 is even, a 2 if position 2 is even and position 3 is odd, or a 3 if positions 2 and 3 both contain even numbers. Thus, the address for key 16258 would be 3628: the 3 from the fact that positions 2 and 3 both contain even numbers and the 628 from key positions 2, 3, and 5. Other rules for prefixing additional digits can be formulated for different circumstances. In any event, the digit analysis method relies on the digits in some of the key positions being approximately equally distributed. If such is not the case, the method cannot be used with good results. " (Philippakis)
Hashing Method: Mid-Square Method
The record key is multiplied by itself, and the product is truncated from both left and right so as to form a number equal to the desired address length. Thus, key 36,258 would be squared to give 1,314,642,564. To form a four-digit address, this number would be truncated from both the left and the right, resulting in the address 4642.
Converting a Hash Code into an Index for the Hash Table
The most common way to convert a hash code into an index into a hash table is to divide the hash code by the size of the hash table and take the remainder:
index = key % tableSize;
Resolving collisions
The problem with hashing is that we are mapping a LARGE key space into a SMALL address space. Whenever we do this, we are bound to have COLLISIONS (sometimes called clashes). When a collision occurs, we have a problem. We still have to put the item into the table (or dictionary), but we can't put it into the location that it hashed to because it's already occupied. So the challenge is to figure out where to put it.
COLLISION/CLASH: when 2 records hash into the same address.
So, how do you resolve the problem???
You need to make sure your table is big enough for all of the data that you will want to put into it. That means that you must have some idea of how much data you will be storing. If we leave more than enough space, all we have to do is find an extra space. One solution to this problem is a Linear probe.
Open Addressing with a Linear probe
LINEAR PROBE: a search of subsequent memory locations until an open slot is found.
A linear probe simply looks for the next available location. If the key hashes into location index, then we look at location index + 1. If that is occupied, we look at location index + 2, etc., until we find a free location.
Note that if we are at the end of the array, we must "wrap around" to position 0 and continue from there. That is, we treat the array as if it were circular.
For a linear probe to work, at creation time, mark each hash table element as being unused (e.g. by putting the value null there).
Now assume that two values hash into the same location. The first one will get to occupy that location. The second one, however, must look at the next few (hopefully) places and find an unoccupied place.
Example: Adding and Retrieving (from Carrano)
Assume that all 4 of the following data/key pairs hash to the same location (52):
-
"555-1214", "150 Main Street"
-
"555-8132", "75 Center Court"
-
"555-4294", "205 Ocean Road"
-
"555-2072", "82 Campus Way"
We will put "150 Main Street" into location 52.
We will put "75 Center Court" into location 53 (if it is available).
We will put "205 Ocean Road" into location 54 (if it is available).
We will put "82 Campus Way" into location 55 (if it is available).
Now, consider what happens when we want to retrieve the data associated with phone number 555-2072.
Our hash function will take us to location 52! How do we know that this is the correct data? We don't!
The only way that you can know that you have retrieved the correct data is if you also store the key with the data!
Example: Deleting
Suppose we remove the objects in locations 53 and 54 by replacing their values with null. Now when we go to retrieve the value for the key "555-2072", if we stop when we find a null value, we won't find it!
There are two kinds of empty spaces in a hash table:
-
Spaces that have never been occupied (and should end a search), and
-
Spaces that have been occupied (and should NOT end a search).
Therefore, when we remove an item from a hash table, we should NOT replace it with null, but we should mark it in some way as being available.
Clustering
When you add records and a lot of them end up in the same part of the table, this can severely slow down your searching. You will end up with areas of your table that have clusters of filled entries, and other areas of your table that have very few entries. This is called clustering.
A way to scatter out the records that hash into the same location is to use a quadratic probe instead of a linear probe. Instead of looking at position k+1, k+2, k+3, etc., a quadratic probe looks at positions k+1, k+4, k+9, k+16, k+25, etc.
A Potential Problem with Open Addressing
It is possible with the methods that we just described that, after many additions and deletions, ALL of the records in a table will be marked either as occupied or available and that no table entries are marked as empty, or null. Then, if we have a clash and have to search for another location to put the data, we may actually end up searching the entire table! This is not good – it will be very slow.
Separate Chaining
Another alternative is to allow more than one data item to go in a single table entry. When you do this, each location in the table is called a bucket and each table entry points to a linked list of data items that all hashed into this spot.
The Load Factor
Our first example used a perfect hashing function – there were never any collisions, and never any unused table locations. In real life, this never happens.
We will have collisions, but we want to minimize the number of collisions that will occur. There are several things we can do to make this more likely:
-
Make the table larger than the number of keys (so there will ALWAYS be unused locations).
-
Try to develop a hashing function that will truly evenly distribute the keys over the hash table.
-
Use a prime number for the size of the hash table.
Hash Table Size
How big should you make the hash table?
If you use open addressing, you should try to keep the table less than half full.
If you use separate chaining, it doesn't matter.
It is also recommended that the size of the hash table be a prime number.
Advantage of Hashing: Lookup is fast.
Disadvantage of Hashing: The data cannot be retrieved in sorted order.
1/8/2017 of
Share with your friends: |