Summary


Open Addressing

In Hashing I we used chaining (linked lists at each bucket). What if we were constrained to just a single array? Instead of getting a single hash value, each key gets a probing sequence, a permutation of the table indices:

(𝑘,0),(𝑘,1),,(𝑘,𝑚1){0,,𝑚1}

where (𝑘,0) is most preferred, (𝑘,1) is next, etc. We try each in order until we find an empty spot.

  • Insert(k,v) - find smallest 0𝑝𝑚1 with 𝑇[(𝑘,𝑝)]=NULL. Put (𝑘,𝑣) there.
  • Search(k) - walk through (𝑘,0),(𝑘,1), until we hit NULL (key not found) or find 𝑇[(𝑘,𝑝)].key=𝑘 (return it).
  • Delete(k) - find the key and set that slot to NULL.

The problem with Delete is that setting a slot to NULL can break later searches: if key 𝑘 was inserted after key 𝑘 and was placed past 𝑘‘s slot (because 𝑘 was in the way), deleting 𝑘 creates a NULL gap that causes Search(k') to stop too early.

The fix is to introduce a DELETED flag. Only NULL terminates a search, but DELETED slots can be reused by Insert:

  • Insert(k,v) - find smallest 𝑝 with 𝑇[(𝑘,𝑝)]=NULL or DELETED. Put (𝑘,𝑣) there.
  • Search(k) - walk the probing sequence, skipping DELETED slots, until NULL or the key is found.
  • Delete(k) - find the key and set that slot to DELETED.

Caveat

The naive Insert above can create duplicates: if key 𝑘 already exists further along the probe sequence (past a DELETED slot), inserting into the DELETED slot places a second copy. The correct Insert must scan the entire probe sequence (up to NULL) to confirm 𝑘 is absent, then go back and write into the first DELETED or NULL slot it saw.

A bad probing sequence is when all keys receive the same permutation. In that case, all keys compete for the same ordering of slots, and runtime degrades to 𝑂(𝑛).

Definition

Uniform Permutation Hashing Assumption: For every key, its probing sequence is an independently generated uniformly random permutation.

We claim that under this assumption, if the load factor 𝛼=𝑛𝑚<1, then all operations run in 𝑂(11𝛼). Intuitively, if 𝛼 is close to 1, the table is nearly full, so we probe many slots before finding an empty one. We prove this for Insert; the others follow similarly.

Let 𝑋=number of slots we probe before finding an empty one. We want to show 𝔼[𝑋]11𝛼.

Step 1: rewrite the expectation using a tail sum. For any nonnegative integer-valued RV:

𝔼[𝑋]=𝑚𝑗=1[𝑋𝑗]

Step 2: bound each tail probability. For 0𝑝𝑚1, let 𝐸𝑝 be the event that slot (𝑘,𝑝) is occupied. Then {𝑋𝑗}=𝐸0𝐸1𝐸𝑗2 (the first 𝑗1 probed slots were all full). We use the chain rule for conditional probability:

[𝑋𝑗]=[𝐸0][𝐸1|𝐸0][𝐸2|𝐸0𝐸1][𝐸𝑗2|𝐸0𝐸𝑗3]

Now bound each factor. There are 𝑛 occupied slots out of 𝑚 total. Since the probing sequence is a random permutation:

  • [𝐸0]=𝑛𝑚 (the first probed slot is equally likely to be any of 𝑚 slots, 𝑛 of which are occupied)
  • [𝐸1|𝐸0]=𝑛1𝑚1 (given the first slot was occupied, we’ve removed it from consideration; of the remaining 𝑚1 slots, at most 𝑛1 are occupied)
  • In general, [𝐸𝑝|𝐸0𝐸𝑝1]𝑛𝑝𝑚𝑝

Each factor is at most 𝑛𝑚=𝛼 (since 𝑛𝑝𝑚𝑝𝑛𝑚 when 𝑛𝑚). So:

[𝑋𝑗]𝛼𝑗1

Step 3: sum the geometric series.

𝔼[𝑋]𝑚𝑗=1𝛼𝑗1𝑗=0𝛼𝑗=11𝛼

In practice, uniform permutation hashing is not efficiently implementable. So we use approximations:

  • Linear Probing: (𝑘,𝑝)=(1(𝑘)+𝑝)mod𝑚. Great cache locality, but one collision leads to a cascade of collisions (clustering).
  • Double Hashing: (𝑘,𝑝)=(1(𝑘)+𝑝2(𝑘))mod𝑚. Worse cache locality but closer to random permutations.
  • Quadratic Probing: in some sense “in between”.

When 𝛼 is too large or small, we double or halve 𝑚, respectively. The Amortized Analysis remains the same.

Static Dictionaries

Here, our input is 𝑛 arbitrary key-value pairs (𝑘1,𝑣1),,(𝑘𝑛,𝑣𝑛) given up front. We output a dictionary that only supports Search (immutable). The goal: preprocessing 𝑂(𝑛), space 𝑂(𝑛), and worst-case 𝑂(1) search always. This is perfect hashing, where no collisions occur at all.

Attempt 1: Large Table

Lemma

Suppose 𝑚𝑛2 and is universal. Then for all 𝑘1,,𝑘𝑛𝒰,

[no collisions]12

Algorithm: sample from universal with 𝑚=𝑛2. Build array 𝑇 of length 𝑚 and set 𝑇[(𝑘𝑖)]=(𝑘𝑖,𝑣𝑖). If a collision occurs, resample and retry (expected 2 tries). Search is just a direct lookup: 𝑇[(𝑘)].

This achieves 𝑂(1) search, but uses 𝑂(𝑛2) space. Can we do better?

Two-Level Hashing (FKS Scheme)

The idea: use a first-level hash 1 with 𝑚=𝑛 buckets (like chaining), then within each bucket 𝑙, build a separate collision-free table of size 𝑁2𝑙 using the lemma above. Each bucket gets its own hash function 2,𝑙.

To search for key 𝑘: compute 𝑙=1(𝑘) to find the bucket, then look up 𝑇𝑙[2,𝑙(𝑘)] in that bucket’s table. Both steps are 𝑂(1).

The total space is 𝑂(𝑚+𝑚1𝑙=0𝑁2𝑙). The first term is 𝑂(𝑛) since 𝑚=𝑛. The second term is the sum over all buckets of the size of each bucket’s collision-free sub-table. Recall from the lemma above that a bucket with 𝑁𝑙 keys needs a table of size 𝑁2𝑙 to guarantee no collisions. So we need 𝑁2𝑙=𝑂(𝑛):

Note that 𝑁2𝑙 can be much larger than 𝑛. For example, if all 𝑛 keys land in one bucket, 𝑁2𝑙=𝑛2. But if keys spread evenly (𝑁𝑙1 for each bucket), then 𝑁2𝑙𝑛. The following lemma shows that with a universal hash, we’re close to the even case:

Lemma

Let 𝑚=𝑛 and 1 be universal. Define the load of bucket 𝑙 as

𝑁𝑙#{𝑖:1(𝑘𝑖)=𝑙}

Then

11[𝑚1𝑙=0𝑁2𝑙>4𝑛]<12

So with probability 12, the total space for the second-level tables is 4𝑛, giving 𝑂(𝑛) overall space. If we’re unlucky, resample 1 and rebuild (expected 2 tries, each taking 𝑂(𝑛) time).

Recitation

Universal Hashing Family

It is hard to find one function that satisfies the simple uniform hashing assumption. As a result, we instead find a family of functions with the following property:

𝐻((𝑘)=(𝑘))1𝑚

Summary of Solutions

In computer science, we generally want Insert(k), Delete(k), and Search(k). Some implementations include:

  • Doubly linked list - Insert and Delete are constant, Search is 𝑂(𝑛). Space is 𝑂(𝑛). We don’t like linear time for Search
  • Direct Access Array - Everything takes constant time, but space is 𝑂(|𝑈|). This is way too huge. This is why we need hashing. We want to have near-constant time for everything. We always assume 𝑚𝑛.
  • Chaining - Insert, Delete are 𝑂(1). Search is 𝑂(1+𝛼) amortized, where 𝛼=𝑛𝑚, our loading factor. Space is 𝑂(𝑛).
  • Open Addressing - We define a probing scheme :𝑈×{0,,𝑚1}{0,,𝑚1}. This basically means we check the next spot if the first one is taken, etc. All operations now run in 𝑂(11𝛼) with 𝑂(𝑛) space.
    • In order to resolve the issue of deleting elements, we introduce DELETED which is used when we remove elements. So when we search, we skip over DELETED elements.
    • We also have a permutation assumption, which means each permutation is equally likely. We also have this idea of perfect hashing. None of the solution so far were optimal, since we only had expected constant time with a load factor. We can do better (𝑂(1) worst case) if we work with a static dictionary, stored in 𝑂(𝑛) space. We do this with nested hash tables.
  1. Run the top level hash function
  2. We create another hash table of size 𝑛2𝑖. The probability that two things collide is <12.