Hashing II

Summary

Open Addressing

In Hashing I we used chaining (linked lists at each bucket). What if we were constrained to just a single array? Instead of getting a single hash value, each key gets a probing sequence, a permutation of the table indices:

where is most preferred, is next, etc. We try each in order until we find an empty spot.

Insert(k,v) - find smallest with . Put there.
Search(k) - walk through until we hit (key not found) or find (return it).
Delete(k) - find the key and set that slot to .

The problem with Delete is that setting a slot to can break later searches: if key was inserted after key and was placed past ‘s slot (because was in the way), deleting creates a gap that causes Search(k') to stop too early.

The fix is to introduce a flag. Only terminates a search, but slots can be reused by Insert:

Insert(k,v) - find smallest with or . Put there.
Search(k) - walk the probing sequence, skipping slots, until or the key is found.
Delete(k) - find the key and set that slot to .

Caveat

The naive Insert above can create duplicates: if key already exists further along the probe sequence (past a slot), inserting into the slot places a second copy. The correct Insert must scan the entire probe sequence (up to ) to confirm is absent, then go back and write into the first or slot it saw.

A bad probing sequence is when all keys receive the same permutation. In that case, all keys compete for the same ordering of slots, and runtime degrades to .

Definition

Uniform Permutation Hashing Assumption: For every key, its probing sequence is an independently generated uniformly random permutation.

We claim that under this assumption, if the load factor , then all operations run in . Intuitively, if is close to 1, the table is nearly full, so we probe many slots before finding an empty one. We prove this for Insert; the others follow similarly.

Let . We want to show .

Step 1: rewrite the expectation using a tail sum. For any nonnegative integer-valued RV:

Step 2: bound each tail probability. For , let be the event that slot is occupied. Then (the first probed slots were all full). We use the chain rule for conditional probability:

Now bound each factor. There are occupied slots out of total. Since the probing sequence is a random permutation:

(the first probed slot is equally likely to be any of slots, of which are occupied)
(given the first slot was occupied, we’ve removed it from consideration; of the remaining slots, at most are occupied)
In general,

Each factor is at most (since when ). So:

Step 3: sum the geometric series.

In practice, uniform permutation hashing is not efficiently implementable. So we use approximations:

Linear Probing: . Great cache locality, but one collision leads to a cascade of collisions (clustering).
Double Hashing: . Worse cache locality but closer to random permutations.
Quadratic Probing: in some sense “in between”.

When is too large or small, we double or halve , respectively. The Amortized Analysis remains the same.

Static Dictionaries

Here, our input is arbitrary key-value pairs given up front. We output a dictionary that only supports Search (immutable). The goal: preprocessing , space , and worst-case search always. This is perfect hashing, where no collisions occur at all.

Attempt 1: Large Table

Lemma

Suppose and is universal. Then for all ,

Proof

We upper bound the probability of any collision via a union bound. There are pairs of keys, and by universality each pair collides with probability :

So .

Algorithm: sample from universal with . Build array of length and set . If a collision occurs, resample and retry (expected tries). Search is just a direct lookup: .

This achieves search, but uses space. Can we do better?

Two-Level Hashing (FKS Scheme)

The idea: use a first-level hash with buckets (like chaining), then within each bucket , build a separate collision-free table of size using the lemma above. Each bucket gets its own hash function .

To search for key : compute to find the bucket, then look up in that bucket’s table. Both steps are .

The total space is . The first term is since . The second term is the sum over all buckets of the size of each bucket’s collision-free sub-table. Recall from the lemma above that a bucket with keys needs a table of size to guarantee no collisions. So we need :

Note that can be much larger than . For example, if all keys land in one bucket, . But if keys spread evenly ( for each bucket), then . The following lemma shows that with a universal hash, we’re close to the even case:

Lemma

Let and be universal. Define the load of bucket as

Then

Proof

We show , which gives the result by Markov’s inequality ().

The key trick is rewriting as a count of ordered pairs. If keys land in bucket , then is the number of ordered pairs (allowing ) where both and hash to . We can split this into:

Diagonal (): there are such pairs (each key paired with itself)

Off-diagonal (): these are actual two-key collisions in bucket

So:

Sum over all buckets:

The first sum is (every key lands in exactly one bucket). The inner sum over just checks whether and collide. Take expectations:

Since : .

So with probability , the total space for the second-level tables is , giving overall space. If we’re unlucky, resample and rebuild (expected tries, each taking time).

Recitation

Universal Hashing Family

It is hard to find one function that satisfies the simple uniform hashing assumption. As a result, we instead find a family of functions with the following property:

Summary of Solutions

In computer science, we generally want Insert(k), Delete(k), and Search(k). Some implementations include:

Doubly linked list - Insert and Delete are constant, Search is . Space is . We don’t like linear time for Search
Direct Access Array - Everything takes constant time, but space is . This is way too huge. This is why we need hashing. We want to have near-constant time for everything. We always assume .
Chaining - Insert, Delete are . Search is amortized, where , our loading factor. Space is .
Open Addressing - We define a probing scheme . This basically means we check the next spot if the first one is taken, etc. All operations now run in with space.
- In order to resolve the issue of deleting elements, we introduce DELETED which is used when we remove elements. So when we search, we skip over DELETED elements.
- We also have a permutation assumption, which means each permutation is equally likely. We also have this idea of perfect hashing. None of the solution so far were optimal, since we only had expected constant time with a load factor. We can do better ( worst case) if we work with a static dictionary, stored in space. We do this with nested hash tables.

Run the top level hash function
We create another hash table of size . The probability that two things collide is .

Alexander's Notes

Explorer

Hashing II

Open Addressing

Static Dictionaries

Attempt 1: Large Table

Two-Level Hashing (FKS Scheme)

Recitation

Universal Hashing Family

Summary of Solutions

Graph View

Table of Contents

Recent Notes

Variance & Covariance

Expected Value

Representation Learning

Saving, Capital Accumulation, and Output

The Facts of Growth

Technological Progress and Growth

Recursive Backtracking

Introduction to Open Economy