Summary

Hash tables implement dictionaries in expected 𝑂(1) time per operation by using a random hash function to map keys into buckets. Deterministic hash functions always have adversarial worst cases, so we randomize. A universal hash family guarantees low collision probability for any pair of keys, which (combined with chaining) gives 𝑂(1+𝑛𝑚) expected time. We construct such a family using modular arithmetic with random coefficients over a prime modulus.


We will discuss implementing dictionaries with hashing. This will combine Randomized Algorithms with Amortized Analysis.

Dictionaries/Maps

Dictionaries involve associations between pairs of objects.

Key𝑘Value𝑣

We want to support the following interfaces:

  • Insert(k,v) - add key-value pair at index (𝑘)
  • Delete(k) - if there is an item 𝑘, set (𝑘) to NULL
  • Search(k) - return entry at index (𝑘) We define the following notation:
  • 𝒰= universe of all possible keys
  • 𝑛= # of objects in dictionary
  • 𝑚= “size” of dictionary - what does this mean? We will explore We will assume that the size of 𝒰 is way bigger than 𝑛, and for convenience, everything in 𝒰 is nonnegative integer.

Direct addressing

Allocate an array of size |𝒰|. Each key 𝑘 maps directly to index 𝑘. All 3 operations are 𝑂(1), but the space is 𝑂(|𝒰|), which is huge.

Doubly Linked List

  • Insert(k,v) - add node with (𝑘,𝑣) at head
  • Delete(k) - traverse list to find 𝑥 s.t. 𝑥.KEY=𝑘 and update pointers
  • Search(k) - traverse list to find 𝑥 Constant insertion but linear deletion and search. Space is better: 𝑂(𝑛).

Balanced Search Trees

Create a balanced binary search tree. All operations take 𝑂(log𝑛), a decent compromise.

ApproachInsertSearch/DeleteSpace
Direct addressing𝑂(1)𝑂(1)𝑂(|𝒰|)
Doubly linked list𝑂(1)𝑂(𝑛)𝑂(𝑛)
Balanced BST𝑂(log𝑛)𝑂(log𝑛)𝑂(𝑛)
Hashing (goal)𝑂(1) exp.𝑂(1) exp.𝑂(𝑛)

Hashing

We like that array lookups only take constant time. What if we kept that but used a hash function to reduce the universe size?

:𝒰{0,,𝑚1}

Each operation involves (𝑘). But how do we handle collisions, given we have keys with the same hash value? Additionally, how do we choose a “good” hash function? What even makes it “good”?

Collision Resolution

Chaining

We combine the array approach with the doubly linked list approach.

  • Insert(k,v) - store (𝑘,𝑣) at head of doubly linked list at index (𝑘)
  • Delete(k) - go to (𝑘) and delete node with key 𝑘
  • Search(k) - go to (𝑘) and search inside

Example

But what is the complexity of Delete and Search?

𝑂(1+𝛼)𝛼𝑛𝑚

𝑛𝑚 is the load factor. This is 𝑂(1) when 𝑚=𝑂(𝑛). However, an adversary can realize what the hash function is and add elements that hash to the same place. Then, the worst-case runtime of Search and Delete is 𝑂(𝑛).

Corollary

Any deterministically chosen hash function will fail in the worst case

The solution to this problem is randomness. We assume the adversary can see our function, but he cannot see the random bits our algorithm uses. We aim to prove the following. For every 𝑘1,,𝑘𝑛𝒰:

Pr[all operations on𝑘1,,𝑘𝑛take time𝑂(1+𝛼)]0.999

Why not strengthen the quantifiers to something like

Pr[𝑘1,,𝑘𝑛𝒰,all operations on𝑘1,,𝑘𝑛take time𝑂(1+𝛼)]0.999?

Because this is impossible. 𝒰 is much larger than 𝑚, so by the pigeonhole principle there must exist keys that collide under any fixed . No single hash function can spread every possible set of 𝑛 keys evenly. The original guarantee is still strong: the adversary picks the keys before seeing our random choice of , so for any fixed input the hash function works well with high probability.

We will prove an easier goal:

𝔼[running time of any operation]𝑂(1+𝛼)

Random Hash Function

Let’s try picking a uniformly random function :𝒰{0,,𝑚1}. In other words, for every 𝑘𝒰, independently sample (𝑘)Unif{0,,𝑚1}.

We claim that for all buckets 0𝑙𝑚1 and all sequences of 𝑛 keys, if we let 𝑁𝑙=#{𝑗:(𝑘𝑗)=𝑙} be the number of keys landing in bucket 𝑙, then

𝔼[𝑁𝑙]=𝛼=𝑛𝑚

To prove, let 𝑋𝑗=1[(𝑘𝑗)=𝑙]. Using the same indicator variable technique from Quick Sort: 𝔼[𝑁𝑙]=𝔼[𝑛𝑗=1𝑋𝑗]=𝑛𝑗=1𝔼[𝑋𝑗]=𝑛𝑚 by linearity of expectation.

The practical problem with this is that we need an efficient way to store and query the mapping 𝑘(𝑘). This is as hard as the original problem!

Definition

A family of hash functions {all:𝒰{0,,𝑚1}} is uniform if

[(𝑘)=𝑙]=1𝑚𝑘𝒰0𝑙𝑚1

Is this enough to guarantee few collisions and a fast running time? No. Consider the family of constant functions ={0,1,,𝑚1} where 𝑙(𝑘)=𝑙 for all 𝑘. This family is uniform (a random from sends any key to each bucket with probability 1𝑚), but once a single 𝑙 is chosen, every key lands in bucket 𝑙. Uniformity only constrains individual keys, not the relationship between different keys.

Definition

Fix an arbitrary sequence of keys 𝑘1,,𝑘𝑛𝒰. Let

𝐶𝑗#{𝑗𝑗:(𝑘𝑗)=(𝑘𝑗)}

Search and Delete run in 𝑂(1+𝐶𝑗) time.

We claim that for a uniformly random function, we have

𝔼[𝐶𝑗]=𝑛1𝑚𝛼𝑗=1,,𝑛

To prove, let 𝑌𝑗,𝑗=1[(𝑘𝑗)=(𝑘𝑗)]. Then

&= sum_(j' != j) EE_h [Y_(j,j')] \
&= sum_(j' != j) PP[h(k_(j')) = h(k_j)] \
&= sum_(j' != j) 1/m \
&= (n-1)/m$$

> [!info] Definition
> 
>A hash family $cal(H) subset.eq {"all" h: cal(U) -> {0,...,m-1}}$ is **universal** if
>$$
>PP_(h ~ cal(H)) [h(k) = h(k')] <= 1/m quad forall k != k' in cal(U)
>$$
>
>This is the key upgrade over uniformity: it constrains the joint behavior of *pairs* of keys (related to [[Conditional Probability & Independence#independence|pairwise independence]]).

> [!tip] Corollary
> 
>Let $cal(H)$ be universal and assume $H ~ cal(H)$ require $O(1)$ time to evaluate on any key $k in cal(U)$. Then for any key, the expected runtime of `Search` and `Delete` is $O(1 + alpha)$. 

The space complexity of this method is $O(m + n)$. Assuming we have a universal hash family, we can guarantee: 
![[attachments/Pasted image 20260224120547.png]]

Our new goal is to make a universal hash family that evaluating $h(k)$ takes constant time and we can represent all $h$ in $O(n+m)$ space. In practice, we pick a "complicated enough" deterministic function. 

### Universal Hash Families
Since $cal(U)$ is nonnegative, how about we use *modular arithmetic*? The simplest way would be to make $h(k) = k mod m$. This is deterministic, so it doesn't work. Let's try $h_s (k) = k + s mod m$ for a uniformly random shift $0 <= s <= m-1$. This still is bad as it has the same collision process since $s$ just permutes the buckets. (i.e. if two keys collide under some function, they collide under all functions)

What if we try $h_t (k) = k dot t mod m$ for a uniformly random $0 <= t <= m-1$? Notice that for many $k$, only a fraction of the buckets are used. This seems pretty wasteful. But this only happens when the multiplicative shift shares common factors. How about we use prime $m$? 

> [!info] Definition
> 
>**Bertrand's Postulate**: For all $n$ there exists a prime $p$ satisfying $n < p < 2n - 2$.

Let $m$ be prime. We will use a mix of addition and multiplication with random coefficients, modulo $m$. 

> [!example] Example
> 
>$m=7$ and $|cal(U)| = 343 = 7^3$. All keys in $cal(U)$ can be written in base 7. In general, write $k = (k^((2)), k^((1)), k^((0)))$. 
>
>Sample $a ~ "Unif"{0,...,6}^3$. Set $h_a (k) := (a^((2)) dot k^((2)) + a^((1)) dot k^((1)) + a^((0)) dot k^((0))) mod 7$.

In general, let $m$ be prime and $|cal(U)| = m^r$. Write each key in base $m$: $k = (k^((r-1)), ..., k^((1)), k^((0)))$. Sample $a ~ "Unif"{0,...,m-1}^r$ and define

h_a (k) := (sum_(i=0)^(r-1) a^((i)) dot k^((i))) mod m


> [!tip] Theorem
> 
> This family $cal(H) = {h_a : a in {0,...,m-1}^r}$ is universal.

**Proof.** Fix two distinct keys $k != k'$. Since $k != k'$, there exists some coordinate $i^*$ where $k^((i^*)) != k^((i^*)')$. We want to show $PP_a [h_a (k) = h_a (k')] <= 1/m$.

The collision condition $h_a (k) = h_a (k')$ means

sum_(i=0)^(r-1) a^((i)) (k^((i)) - k^((i)’)) equiv 0 mod m


Isolate the coordinate $i^*$:

a^((i^)) (k^((i^)) - k^((i^)’)) equiv -sum_(i != i^) a^((i)) (k^((i)) - k^((i)’)) mod m


Since $m$ is prime and $k^((i^*)) - k^((i^*)') != 0 mod m$, the factor $(k^((i^*)) - k^((i^*)'))$ has a multiplicative inverse mod $m$. So for any fixed choice of all the other $a^((i))$ with $i != i^*$, there is exactly one value of $a^((i^*)) in {0,...,m-1}$ that satisfies the equation. Since $a^((i^*))$ is uniform over $m$ values:

PP_a [h_a (k) = h_a (k’) | a^((i)), i != i^*] = 1/m


This holds for every choice of the other coordinates, so $PP_a [h_a (k) = h_a (k')] = 1/m <= 1/m$. $square$

> [!question] Why do we need $m$ prime?
> 
> The inverse of $(k^((i^*)) - k^((i^*)'))$ only exists mod $m$ when $gcd(k^((i^*)) - k^((i^*)'), m) = 1$. If $m$ is prime, every nonzero element has an inverse. If $m$ is composite, some nonzero differences would lack inverses, and the equation could have more (or fewer) than one solution for $a^((i^*))$, breaking the $1/m$ bound.