Gaurav's Blog

return rand();

Birthday Paradox and Hash Functions

| Comments

The Birthday Problem is concerned with computing the probability that, in a set of $n$ randomly chosen people, at least one pair of people will have the same birthday (let’s call it a ‘collision’).

Intuitively, what would be your guess for the probability of collision to become > 0.5? Given that there are 365 possible days, I would imagine $n$ to be quiet large for this to happen. Let’s compute it by hand.

What is the probability that there isn’t a collision?

  • The total number of combinations of birthdays possible for the set of $n$ people is $365^n$.
  • However, for the birthdays to not overlap, each one should pick $n$ different birthdays out of the 365 possible ones. This is equal to $_{365}P_{n}$ (pick any $n$ out of the 365 days, and allow permutations).

  • $P(\text{no collision}) = \frac{_{365}P_{n}}{365^n}$.

  • $P(\text{collision}) = 1 - \frac{_{365}P_{n}}{365^n}$.

Plotting the graph for collision to happen, let’s see where this becomes true.

So it seems that the collision happens with a probability >= 0.5 after $n$ is greater than 23. The paradox in the Birthday Paradox is that we expect $n$ to be quite large, where as it seems you need only $\approx \sqrt{365}$ people.

In general, it has been proven that the if there are $n$ balls, and $b$ bins, then the probability of any bin having > 1 ball is >= 0.5, when $n \approx \sqrt{b}$.

Considering hash functions to be placing balls in bins, if the number of distinct elements that could be fed to the hash function are $n$, to ensure that the probability of collision remains < 0.5, the length of the hash in number of bits required for the hash function, $l$, should be such that $2^l > n^2$.

This means, if you expect $2^{32}$ distinct elements, make sure to use at least a 64 bit hash, if you want the probability of collision to be < 0.5

Deathbed Formulae

| Comments

Professor Bender asked us to remember three ‘Deathbed Formulae’, i.e., if someone asks you about them on your deathbed, you should be able to recall these.

  1. $\large{\left(\frac{y}{x}\right)^{x}} \leq \large{y \choose x} \leq \large{\left(\frac{ey}{x}\right)^{x}}$.

  2. For a large value of $n$, $\left(1 - \large{\frac{1}{n}}\right)^{n} \approx \large{\frac{1}{e}}$.

  3. For a large value of $n$, $\left(1 + \large{\frac{1}{n}}\right)^{n} \approx e$.

These were super-helpful in the Graduate-level Algorithms and Advanced Algorithms courses, which were heavy on randomized algorithms.

I might post about some interesting bits related to randomized algorithms some time soon, so wanted to share these pre-emptively.

Modulo a Large Prime: Why?

| Comments

The Karp-Rabin algorithm for string matching (which is pretty elegant, and I strongly prefer it to KMP in most settings), uses the modulo operator when computing the hash of the strings. What is interesting is, it recommends using a prime number as the modulus for doing this.

Similarly, when computing which hash-table bucket a particular item goes to, the common way to do it is: $b = h(x)\bmod n$. Where $h(x)$ is the hash function output, $n$ is the number of buckets you have.

Why do we need a prime modulus?

In hash functions, one should expect to receive pathological inputs. Assume, $n = 8$. What happens, if we receive $h(x)$ such as that they are all multiples of $4$? That is, $h(x)$ is in $[4, 8, 12, 16, 20, …]$, which in $\bmod 8$ arithmetic will be $[4, 0, 4, 0, 4, …]$. Clearly, only 2 buckets will be used, and the rest 6 buckets will be empty, if the input follows this pattern. There are several such examples.

As a generalization, if the greatest common factor of $h(x)$ and $n$ is $g$, and the input is going to be of the form $[h(x), 2h(x), 3h(x), …]$, then the number of buckets that will be used is $\large \frac{n}{g}$. This is easily workable on paper.

We ideally want to be able to use all the buckets. Hence, the number of buckets used, $\large \frac{n}{g}$ $= n$, which implies $g = 1$.

This means, the input and the modulus ($n$) should be co-prime (i.e., share no common factors). Given, we can’t change the input, we can only change the modulus. So we should choose the modulus such that it is co-prime to the input.

For the co-prime requirement to hold for all inputs, $n$ has to be a prime. Now it will have no common factors with any input (except it’s own multiples), and $g$ would be 1.

Therefore, we need the modulus to be prime in such settings.

Let me know if I missed out on something, or my intuition here is incorrect.