Gaurav's Blog

return rand();

Dynamic Programming: You Can Do It Half Asleep!

| Comments

That was a click-baity title. :)

But seriously, people make a big deal out of ‘Dynamic Programming’, in the context of software engineering interviews. Also, the name sounds fancy but for most problems in such interviews, you can go from a naive recursive solution to an efficient solution, pretty easily.

Any problem that has the following properties can be solved with Dynamic Programming:

  1. Overlapping sub-problems: When you might need the solution to the sub-problems again.
  2. Optimal substructure: Optimizing the sub-problems can help you get the optimal solution to the bigger problem.

You just have to do two things here:

  1. Check if you can apply the above criteria.
  2. Get a recursive solution to the problem.

That’s it.

Usually the second part is harder. After that, it is like clockwork, and the steps remain the same almost all the time.

Example

Assume, your recursive solution to say, compute the n-th fibonacci number, is:

Step 1: Write this as a recursive solution first

1
2
3
4
5
6
7
int fib(int n) {
  if (n == 0 || n == 1) {
    return 1;
  } else {
    return fib(n-1) + fib(n-2);
  }
}

Now, this is an exponential time solution. Most of the inefficiency comes in because we recompute the solutions again and again. Draw the recursion tree as an exercise, and convince yourself that this is true.

Also, when you do this, you at least get a naive solution out of the way. The interviewer at least knows that you can solve the problem (perhaps, not efficiently, yet).

Step 2: Let’s just simply cache everything

Store every value ever computed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int cache[20];
int fib(int n) {
  // Pre-fill all the values of cache as -1.
  memset(cache, -1, sizeof(cache));
  return fibDP(n);
}

int fibDP(int n) {
  // Check if we have already computed this value before.
  if (cache[n] != -1) {
    // Yes, we have. Return that.
    return cache[n];
  }

  // This part is just identical to the solution before.
  // Just make sure that we store the value in the cache after computing
  if (n == 0 || n == 1) {
    cache[n] = 1;
  } else {
    cache[n] = fibDP(n-1) + fibDP(n-2);
  }

  return cache[n];
}

Let us compute how many unique calls can we make to fibDP?

  • There is one parameter, n.
  • Hence, n unique values of n can be passed to fibDP.
  • Hence, n unique calls.

Now, realize two things:

  1. We would never compute the value of the function twice for the same value, ever.
    • So, given $n$, we would call the function $n$ times, as seen above.
    • Each time with $O(1)$ work in each function call.
    • Total = $n$ calls with $O(1)$ work each => $O(n)$ total time complexity.
  2. We are using up extra space.
    • We use as much extra space as:
    • Number of possible unique calls to the recursive function * space required for each value.
    • Since there are $n$ unique calls possible with an int value, space would be $O(n)$.
    • I have hard-coded a limit of 20 in my code. We can also use a Vector etc.

That’s it. We just optimized the recursive code from a $O(2^n)$ time complexity, $O(n)$ space complexity (recursive call stack space) to an $O(n)$ time, $O(n)$ space (recursive + extra space).

Example with a higher number of parameters

1
2
3
4
5
6
7
int foo(int n, int m) {
  if (n <= 0 || m <= 0) {
   return 1;
  }

  return foo(n-1, m) + foo(n, m-1) + foo(n-1, m-1);
}

Time complexity: $O(3^{n+m})$ [Work it out on paper, why this would be the complexity, if you are not sure.]

DP Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int cache[100][100];
int foo(int n, int m) {
  memset(cache, -1, sizeof(cache));
  return fooDP(n, m);
}

int fooDP(int n, int m) {
  if (n <= 0 || m <= 0) {
   return 1;
  }

  if (cache[n][m] == -1) {
    cache[n][m] = fooDP(n-1, m) + fooDP(n, m-1) + fooDP(n-1, m-1);
  }
  return cache[n][m];
}
  • Number of unique calls: $O(nm)$
  • Space Complexity: $O(nm)$
  • Time Complexity: $O(nm)$

Assume I tweak foo and add an $O(n \log m)$ work inside each call, that would just be multiplied for the time complexity, i.e.,

Time complexity = O(unique calls) * O(work-per-call)

$Space Complexity = O(unique calls) * O(space per call)

Now just reinforce these ideas with this question

  • Given a rectangular grid of size N x M,
  • What is the length of the longest path from bottom-left corner (0, 0) to top-right corner (N - 1, M - 1), assuming you can go up, right, diagonally?

Extra Credit

What we saw is called top-down DP, because we are taking a bigger problem, breaking it down into sub-problems and solving them first. This is basically recursion with memoization (we ‘memoize’ (fancy word for caching) the solutions of the sub-problems).

When you absolutely, totally nail the recursive solution, some interviewers might want a solution without recursion. Or, probably want to optimize the space complexity even further (which is not often possible in the recursive case). In this case, we want a bottom-up DP, which is slightly complicated. It starts by solving the smallest problems iteratively, and builds the solution to bigger problems from that.

Only if you have time, go in this area. Otherwise, even if you mention to the interviewer that you know there is something called bottom-up DP which can be used to do this iteratively, they should be at least somewhat okay. I did a short blog-post on converting a top-down DP to a bottom-up DP if it sounds interesting.

Back to Basics: Reservoir Sampling

| Comments

If you have a dataset that is either too large to fit in memory, or new data is being added to it on an ongoing basis (effetively we don’t know its size), there will be the need to have algorithms, which have a:

  1. Space Complexity that is independent of the size of the dataset, and,
  2. Time Complexity per item of the dataset that is either constant, or otherwise is very small.

Examples of such datasets could be clicks on Google.com, friend requests on Facebook.com, etc.

One simple problem that could be posed with such datasets is:

Pick a random element from the given dataset, ensuring that the likelihood of picking any element is the same.

Of course, it is trivial to solve if we know the size of your dataset. We can simply pick a random number in $(0, n-1)$ ($n$ being the size of your dataset). And index to that element in your dataset. That is of course not possible with a stream, where we don’t know $n$.

Reservoir Sampling does this pretty elegantly for a stream $S$:

1
2
3
4
5
6
7
sample, len := null, 0
while !(S.Done()):
  x := S.Next()
  len = len + 1
  if (Rand(len) == 0):
    sample = x
return sample

The idea is simple. Every new element you encounter, replace your current choice with this new element with a probability of $\large\frac{1}{l}$. Where $l$ is the total length of the stream (including this element), encountered so far. When the stream ends, you return the element that you had picked at the end.

However, we need to make sure that the probability of each element being picked is the same. Let’s do a rough sketch:

  1. When the length of the stream is $1$, the probability of the first element being picked is $1.0$ (trivial to check this).
  2. Assuming that if this idea works for a stream of size $l$, we can prove that it works for a stream of size $l+1$.
    • Since it holds for a stream of size $l$, all elements before the $l+1$-th element have the same likelihood, i.e., $\large\frac{1}{l}$.
    • The new element will replace the selected element with a probability of $\large\frac{1}{l+1}$, but the existing element will remain with a probability of $\large\frac{l}{l+1}$.
    • Hence the probability of each of the elements seen so far being selected will be, $\large\frac{1}{l} . \large\frac{l}{l+1} = \large\frac{1}{l+1}$.
    • The probability of the new element being selected, as we stated earlier is $\large\frac{1}{l+1}$.
    • Hence, the probability will be the same for all the elements, assuming it holds for a stream of size $l$.
  3. Given that the property holds for $n = 1$, and (2) is true, we can prove by induction that this property holds for all $n \gt 1$.

There could be a weighted variant of this problem. Where, each element has an associated weight with it. At the end, the probability of an item $i$ being selected should be $\large\frac{w_i}{W}$. Where, $w_i$ is the weight of the $i$-th element, and $W$ is the sum of the weights of all the elements.

It is straight-forward to extend our algorithm to respect the weights of the elements:

  1. Instead of len, keep W.
  2. Increment W by $w_i$ instead of just incrementing it by 1.
  3. The new element replaces the existing element with a probability of $\large\frac{w_i}{W}$.

In a manner similar to the proof above, we can show that this algorithm will also respect the condition that we imposed on it. Infact the previous algorithm is a special case of this one, with all $w_i = 1$.

Credits:

[1] Jeff Erickson’s notes on streaming algorithms for the general idea about streaming algorithms.

Paper: Neural Graph Machines

| Comments

Semi-Supervised Learning is useful in scenarios where you have some labeled data, and lots of unlabeled data. If there is a way to cluster together training examples by a measure of similarity, and we can assume that examples close to each other in these clusterings are likely to have the same labels, then Semi-Supervised Learning can be pretty useful.

Essentially the premise is that labeling all the data is expensive, and we should learn as much as we can from as small a dataset as possible. For their data-labeling needs, the industry either relies on Mechanical Turks or full-time labelers on contract (Google Search is one example where they have a large team of human raters). Overall, it is costly to build a large labeled dataset. Therefore, if we can minimize our dependence on labeled data, and learn from known / inferred similarity within the dataset, that would be great. That’s where Semi-Supervised Learning helps.

Assume there is a graph-structure to our data, where a node is a datum / row, which needs to be labeled. And an edge exists between two nodes if they are similar, along with a weight. In this case, Label Propagation is a classic technique which has been used commonly.

I read this paper from Google Research which does a good job of generalizing and summarizing similar work done by Weston et. al, 2012, around training Neural Nets augmented by such a graph-based structure.

To summarize the work very quickly, the network tries to do two things:

a. For labeled data, try to predict the correct label (of course),

b. For the entire data set, try to learn a representation of each datum (embedding) in the hidden layers of the neural net.

For nodes which are adjacent to each other, the distance between their respective embeddings should be small, and the importance of keeping this distance small is proportional to the edge weight. That is, if there are two adjacent nodes with a high edge weight, if the neural net doesn’t learn to create embeddings such that these two examples are close to each other, there would be a larger penalty, than if the distance was smaller / the edge weight was lower.

Check the figure above. The blue layer is the hidden layer used for generating the embedding, whose output is represented by $h_{\theta}(X_i)$. $y$ is the final output. If $X_i$ and $X_j$ are close-by, the distance between them is represented by $d(h_{\theta}(X_i), h_{\theta}(X_j))$.

The cost-function is below. Don’t let this scare you. The first term is just the total loss from the predictions of the network for labeled data. The next three terms are for tweaking the importance of distances between the various (labeled / unlabeled) -(labeled / unlabeled) pairs, weighed by their respective edge weights, $w_{uv}$.

I skimmed through the paper to get the gist, but I wonder that the core contribution of such a network is to go from a graph-based structure to an embedding. If we were to construct an embedding directly from the graph structure, and train a NN separately using the embedding as the input, my guess is it should fetch similar results with a less complex objective function.