# A Fortune Cookie Server in Go

I really like the concept of Fortune Cookies in *nix systems, and I absolutely love the Hindi movie Andaz Apna Apna. So, I thought I would write up a simple fortune-cookie server which serves random quotes from the movie. A lot of my friends liked it.

So, I thought it will be even nicer if I could generalize it, and add a bunch of other movies and TV serials, that are popular. So I wrote up Elixir, which is a generic fortune-cookie server written in Go. This new fortune-cookie server was hosted on rand(quotes), and had quotes from movies like Quentin Tarantino’s movies, Lord of the Rings, and the popular TV shows Breaking Bad, and Game of Thrones.

Using Elixir, it is extremely simple to write a fortune-cookie server which serves quotes from multiple quote databases. All you need to do is create the a file that contains the quotes you want to serve, one per line, and give it a name like foo.quotes. Place it in the directory where the server was started from, and those quotes would be serve from the /foo endpoint.

To make it more fun, /foo?f=cowsay returns the quote in the cowsay format! Something like this

You can create many such quotes databases, and also add/delete/modify them while the server is running, and the server will pick up the changes.

A full-featured fortune-cookie server would look something like this:

Implementation Note: To implement the feature of keeping tab on the quote databases, without having to restart the server, one way was to use the Inotify subsystem in Linux, using Go. But this didn’t work for OSX. So I wrote up a quick and dirty implementation which does ioutil.ReadDir(".") periodically, and filters all the files which have a .quotes extension.

# More Writing This Year

It has almost been a year since I wrote something. For the most part, I have been too busy to share something. Since the past one year, I have been working on HBase at Facebook, which is the backbone of Facebook Messages, and many other important services at Facebook. Also, I’ve moved to Mountain View, California, and I absolutely love the surroundings.

I have been trying my hand at different things, like learning some ML, Go, trying to learn how to play an Electric Guitar, and other things. One thing I want to follow this year would be to continue doing new things, and keep sharing my experiences.

Finally, I have also ditched WordPress in favor of Octopress, a Markdown-style blogging framework built on top of Jekyll. What this means is, I just need to worry about the content, which I can write in simple Markdown format. Octopress generates a completely static website for me to use. I don’t have to setup the Apache-MySQL monstrosity to serve a simple blog.

However, the transition isn’t very smooth.

• I had to get my posts from WordPress into a Markdown format. For this, I used exitwp.
• I had to copy the images I had uploaded to my WP setup to my Octopress sources directory, and manually change the image tags to point to the right location in all the markdown files.
• For LaTeX, I am using MathJax with Octopress. I have only lost two things in this transition:
• Obviously, I lost the ability to receive comments natively, and lost the older comments on my posts. This is fine by me, since I don’t receive too many comments anyways. I will enable Disqus on the blog for future comments.
• Also, WP has this ridiculous URL scheme for posts, which is something like yourblog.com/?p=XYZ, where XYZ is a number, while Octopress has a more sensible, :year/:month/:date/:title scheme. Google had indexed my blog according to the older scheme and now, and anybody who has linked to a specific blog post, will now be redirected to the main page. In short, its not pleasant.

However, the big win is that it is super-easy for me to write posts and host a blog. Earlier, this blog was put up on a free shared hosting site, and it was very clumsy to manage my blog. And as of the time of writing this post, I am hosting multiple blogs on a very lightweight VPS, and as long as I don’t get DDoS-ed, this machine is more than capable of hosting several such static-only blogs. Because, after all, how hard is it to serve static HTML :)

I have a lot to share about what I learnt in the last year, so keep looking :)

# Putting My Twitter Friends and Followers on the Map - II (Using D3.js)

I had done a visualisation using R, where I plotted the locations of my friends and followers on Twitter on a map. I re-did this visualisation using D3.js. It is a fantastic visualisation tool, which a lot of features. I could only explore a fraction of the API for my purpose, but you can see a lot of cool examples on their page.

The final output was something like this (a better look here):

I wanted to resize the bubbles when there are a lot of points in the vicinity, but I’ll leave that for later.

You can have a look at the D3.js code, and the python script which gets the lat-long pairs using the Twitter API, and the Google GeoCoding API (Yahoo! decided to commercialize their GeoCoding API sadly).

(The map data and some inspiration for the code, is from here)

# Scalable Bloom Filters in Go

I have been trying to learn some Go in my free time. While, I was trying to code up a simple Bloom Filter, I realized that once a Bloom Filter gets too full (i.e., the false positive rate becomes high), we cannot add more elements to it. We cannot simply resize this original Bloom Filter, since we would need to rehash all the elements which were inserted, and we obviously don’t maintain that list of elements.

A solution to this is to create a new Bloom Filter of twice the size (or for that matter, any multiple >=2), and add new elements to the new filter. When we need to check if an element exists, we need to check in the old filter, as well as the new filter. If the new filter gets too full, we create another filter which is a constant factor (>=2) greater in size than the second filter. bit.ly uses a solution similar to this (via Aditya).

We can see that if we have N elements to insert in our ‘Scalable’ Bloom Filter, we would need log N filters (base r, where r is the multiple mentioned above). It is also easy to derive the cumulative false positive rate for this new Bloom Filter.

If the false positive rate of each of the individual constituent bloom filters is f, the probability that we do not get a false positive in one of the filters is (1-f).

Therefore the probability that we do not get a false positive in any of the q filters is, (1-f)^q.

Hence, the probability that we get a false positive in any of these q filters is:

1 - (1-f)^q.

Some rough estimates show that this cumulative false positive rate is around: q * f (only if f is small). Where, q is about log N, as we noted above. Therefore, if you have four filters, each with a false positive rate of 0.01, the cumulative false positive rate is about 4 * 0.01 = 0.04. This is exactly what we want.

What is beautiful in this construction is, the false positive rate is independent of how fast the filter sizes grow. If you maintain a good (small) false positive rate in each of the constituent filter, you can simply add up their false positive rates to get an estimate of the cumulative false positive rate (only if f is small).

You can simply grow your filters fast (like around 5x), each time one filter becomes too full, so as to keep the number of filters small.

I have implemented this ‘Scalable Bloom Filter’ (along with the vanilla Bloom Filter and the Counting Bloom Filter) in Go. Please have a look, and share any feedback.

Today I was trying to link together code which uses C++ Templates. The usual accepted pattern is to put the declarations in the header, and the definitions in a separate .hpp / .cpp file. However, I was unable to get them to link it together.

To my surprise, I discovered (or re-discovered, probably) that, when dealing with C++ Templates, you need to put the definitions in the header file. A good explanation of why it is so, is here.

# Five Awesome Shell Utilities You Might Not Know About

pgrep I always used to do “ps -ax | awk ‘/procname{print $1}’”, until I learnt that, we could simply do “pgrep procname”, and it will list the PIDs of all the processes with in their names. pkill Similarly, I used to do “ps -ax | awk ‘/procname{print$1}’ | xargs kill”. As you must have guessed, this kills all the processes with names having ‘procname’ in them. But, a much simpler way is to just do “pkill procname”

zcat (and other related utilities) A lot of times, we need to grep through an archive. For this, we usually copy the archive somewhere else, grep on the resulting files, and then delete these files. zcat is much simpler, in the sense that, it uncompresses an archive and displays the result on the standard output. Now you can pipe the output to grep. Or, you can directly use zgrep! See some other related utilities here.

netcat netcat is a damn neat networking utility, which reads and writes data across the network using TCP. This is pretty nifty because we can pipe the output of a command to a process running on a different machine. This is extremely useful for monitoring. Thanks to Dhruv for introducing this one.

strace This utility can be used to print the list of systems calls (along with their arguments), being made by a program while it is running. How cool is that! See the output of ‘strace ls’ here.

# Latency Numbers Every Programmer Should Know

Here are some latency numbers that Peter Norvig thinks every engineer should know, and I whole heartedly agree (I was recently asked questions, which required the knowledge of these numbers, and my guesstimate was quite off the mark).

Edit: Here is a version with latency figures in ms, where appropriate.

Update: Here is a version showing the change in the numbers over the years.

# Putting My Twitter Friends and Followers on the Map

I was quite impressed by the Visualizing Friendships post on the Facebook Engineering blog. So I decided to try out some data visualization myself. Of course, I am no longer an intern at Facebook (I interned there this summer. A post coming up soon), so I don’t have access to the millions of edges used in the map. So, I decided to do something similar for Twitter.

To plot anything I needed to find where my friends and followers were located. I used the Twitter API to find the list of my friends and followers. Then for each of the users, I found where they were located. This is not quite simple, since I want the location to be in the Latitude - Longitude format, and not everyone mentions their real locations in their Twitter profile.

1. It is slow
2. It places a lot of tight restrictions on how many results you can get at once.

But I waded through both of them, by (a) being patient (b) batching up requests in groups of size 100. This got me the public data about my friends whose profiles were publicly accessible. Now, a lot of them are living in places which might have multiple names, can be misspelled, do not accurately pinpoint the location of the person etc. For example, a friend who lives in Mumbai, can write ‘Mumbai’, ‘Bombay’ (the old name of Mumbai), ‘Aamchi Mumbai’ (a popular phrase, which translates to ‘Our Mumbai’), or ‘Maharashtra’ (the state), etc. Thankfully, I found the Yahoo! Placefinder API, which solves this problem more or less. We can query for the lat-long pair of a place, and it will return its best guesses for the same.

Once I did that, I could use R (thanks to Aditya for the suggestion) to plot the lat-long pairs on the World Map. The output isn’t quite pleasing to the eye, but it does the job.

You can find the script that gets the Lat-Long pairs here.

Edit: Since Yahoo! is commercializing their PlaceFinder API, I think the Google GeoCoding API should be a suitable replacement. (Thanks to Dhruv for the suggestion).

# Aliases

If one uses the terminal quite a lot, it is essential that one finds out fast ways of doing routine stuff. Aliases are one of the many tricks in the hat.

Here is how you can set up a bash alias:

alias smallcmd='very long and verbose command'


You can then type

smallcmd


and it will replace it by the longer version. To make this permanent, add the aliases to the end of your .bashrc file.

I have even aliased ‘git pull’, ‘git push’, ‘git commit’, and ‘git status’ to three letter mnemonics, which are really handy since I do a lot of work using git.

Secondly SSH aliases are something I didn’t know, until very recently. Instead of typing long unwieldy username and hostnames, one can simply add an alias for the machine in their SSH config file, usually in (~/.ssh/config). For example:

Host ubuntuvm
HostName 172.16.29.143
User reddragon


Now, I can just do

ssh ubuntuvm


to SSH to my Virtual Machine!

# Profiling in Python

I was trying to implement the Miller-Rabin algorithm for one of my home-works, in Python. I had already done this in C++ to solve PON, but I wanted to try out Python, and see how fast I can get it to. I wrote a function to do $a\^b$ mod $n$ which used repeated squaring to get the result. However, the implementation wasn’t fast enough. Dhruv suggested that I profile the code.

Here is how you profile a python script: python -m cProfile script.py

And this is what I came to know

Note how the fastModularExp() function took a staggering 16 seconds. Now we came to know that the builtin pow function in Python already does fast modular exponentiation. And since it is a part of a C library, it is quite fast. After doing this, I submitted it as a solution for PON and got accepted with 0.35 seconds, which was faster than my 0.90 submission with C++ :-)