Gaurav's Blog

How I stopped worrying about determinism and started loving randomness.

Lambda: A LISP Interpreter Written in Go for an iOS App

| Comments

In the summer of 2015, I wanted to work on a side-project, that I can quickly use, instead of spending lots of weekends, and it getting no where. Luckily, I found the right project on Peter Norvig’s website. Here I will try to describe, how to write a simple LISP interpreter, and how to use that in your iOS app.

To those who are new to LISP, it is pretty simple to explain. LISP programs are based on something called ‘s-expressions’. An s-expression looks like this: $(\mathrm{operator}\ \mathrm{operand_1}\ \mathrm{operand_2}\ \mathrm{operand_3}\ …)$.

For example:

  • $(+\ 1\ 2)$ is an s-expression, where $+$ is the operator, and $1$ and $2$ are the operands. Hence, it is equivalent to $1\ +\ 2$.
  • Similarly, $(*\ 2\ 3\ 4)$ in LISP evaluates to $2\ *\ 3\ *\ 4$.

The operands can themselves be recursively computed too.

For example, $(+$ $1$ $(*$ $2$ $3))$ is a valid expression. First we evaluate the inner $(*$ $2$ $3)$ part, then the original expression resolves to $(+$ $1$ $6)$, which then evaluates to $7$. This can go on recursively.

For someone who wants to design an interpreter, LISP is the ideal real-life language to start with. This is for two reasons:

  1. There are minimal symbols to interpret. Only $($ and $)$ and the operators that you define.
  2. The parsing is straight-forward and recursive. Zero syntactic-sugar.

I could only stay motivated, and bring this project to a closure, because you can pick a very small subset of LISP, and still do a lot of interesting things.

What I Built

To keep you motivated about reading the article, lets do a demo first and then we can go into details about how I built this.

Here is the GitHub repository for the interpreter, and the app on iTunes (Lambda Lisp). Feel free to file issues / contribute.

If you are still reading: Let’s build an interpreter!

Lexing

Lexing involves finding lexemes, or syntactical tokens, which can then be combined to interpret a grammatical sentence. In the expression $(+$ $1$ $2)$, the tokens are [$($, $+$, $1$, $2$, $)$]. Sophisticated compilers use lex or flex for finding these tokens, handling white-space, attaching a token type to each of them, etc.

I did not want to bloat up my simple interpreter by using lex / flex. I found this nifty one-line bare-bones Lexer in Peter Norvig’s article:

Peter Norvig’s one-line lexer
1
2
3
def tokenize(chars):
    "Convert a string of characters into a list of tokens."
    return chars.replace('(', ' ( ').replace(')', ' ) ').split()

Essentially, what this does is to handle white-space (somewhat). It basically adds spaces around the brackets, and then splits the expression on white-space.

We need to do the replacement for all operators, but otherwise it works well. This is because LISP is simple enough that attaching types to tokens (and error-ing out, if required) can be done at the time of parsing. This is how I did it in Go, just for completeness sake.

My one-line lexer in Go
1
2
3
4
5
func tokenize(exp string) []string {
  return strings.Fields(
    strings.Replace(strings.Replace(exp, "(", " ( ", -1), ")", " ) ", -1),
  )
}

The recurrent theme in the design of this interpreter, is to be lazy and push the harder problems to the next layer.

Constructing an AST

Given an expression, we would need to make sure that the expression follows a structure, or a Grammar. This means two things in our case:

  1. The s-expression should be well formed. The brackets should be balanced, should not be out of order, etc.
  2. Operators such as $+$, $*$, etc. get the right number of operands, have the right type of operands, etc.

At this stage, we are only concerned about well-formedness of the s-expression. We don’t care if the $+$ operator received incompatible operands, for instance. This means that given an expression like $(+$ $1)$, we would mark this expression to be okay at this point, because the expression is well-formed. We will catch the problem of too few operands to $+$, at a later time.

We can start solving the problem of checking well-formedness of the expression by using an Abstract Syntax Tree (or AST). An AST is a way of representing the syntactic structure of code. In this tree, the leaf nodes are atomic values, and all the non-leaf nodes are operators. Recursion can be naturally expressed using an AST.

This is how we can represent a node of this tree in the code:

Definition of ASTNode
1
2
3
4
5
type ASTNode struct {
  value    string     // Only valid for leaf nodes.
  isValue  bool       // Checks if this is a value (also if children == nil).  
  children []*ASTNode // Children of this AST Node.
}

To actually verify the well-formedness of the expression and build the AST, we would go about it this way:

Building the AST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// This method gets a list of tokens, and returns:
// 1. The ASTNode of the tree so formed.
// 2. Unused tokens in the end of the array.
// 3. Any error while constructing the tree.
// Removed some error handling to make it a little brief.
func buildAST(tokens []string) (*ASTNode, []string, error) {
  var token = ""
  tokensLen := len(tokens)
  // If it is an empty list of tokens, the AST is a nil node
  if tokensLen == 1 {
    token, tokens = pop(tokens)
    if !isValue(token) {
      return nil, tokens, errStr("value", token)
    }

    // Create a one-node AST.
    node := new(ASTNode)
    node.isValue = true
    node.value = token
    node.children = nil
    return node, tokens, nil
  } else {
    token, tokens = pop(tokens)
    // Expect an opening bracket to start.
    if token != openBracket {
      return nil, tokens, errStr(openBracket, token)
    }

    node := new(ASTNode)
    node.isValue = false
    // Create a slice with 0 length initially.
    node.children = make([]*ASTNode, 0)

    tokensLen = len(tokens)
    for len(tokens) != 0 && tokens[0] != closedBracket {
      var childNode *ASTNode = nil
      var err error = nil
      // If we get an opening brace, it means we need to recursively get the
      // AST of the sub-expression from here on.
      if tokens[0] != openBracket {
        token, tokens = pop(tokens)
        childNode, _, err = buildAST([]string{token})
      } else {
        childNode, tokens, err = buildAST(tokens)
      }

      if err != nil {
        return nil, tokens, err
      }
      node.children = append(node.children, childNode)
    }

    // Expect a closing bracket when ending.
    token, tokens = pop(tokens)
    if token != closedBracket {
      return nil, tokens, errStr(token, closedBracket)
    }
    return node, tokens, nil
  }
}

You can see how the grammar for interpreting the s-expression grammar is hard-coded in the code here. We expect the expression to be either a single value, or something like $(\mathrm{operator}\ \mathrm{o_1}\ \mathrm{o_2}\ …\ )$, where $\mathrm{o_i}$ can be an atomic value, or a nested expression.

Note that we construct the AST slightly differently. The operator is also part of the children.

Parsing & Evaluation

We combine the parsing and evaluation of the AST into one stage. The result of evaluating an AST is an Atom, which can either have a Value or an errror.

Defining Atom
1
2
3
4
type Atom struct {
  Err error
  Val Value
}

Here is a stripped down AST evaluation code:

Evaluating the AST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
func evalAST(env *LangEnv, node *ASTNode) Atom {
  var retVal Atom
  if node.isValue {
    retVal.Value, retVal.Err = getValue(env, node.value)
    return retVal
  }

  // Assuming that the first child is an operand
  symbol := node.children[0].value
  operator := env.getOperator(symbol)

  // Skipped error handling related to operator & arguments for brevity.
  operands := make([]Atom, 0)
  for i := 1; i < len(node.children); i++ {
    v := evalAST(env, node.children[i])
    if v.Err != nil {
      return v
    }
    operands = append(operands, v)
  }

  v := operator.handler(env, operands)
  if v.Err != nil {
    return v
  }
  retVal.Val = v.Val
  return retVal
}

Basic evaluation is very simple. We have a struct called LangEnv, which is the ‘environment’ data-structure storing amongst other things, defined operators. When evaluating an AST, if it is a single node, the value of the node is the result. Otherwise, we simply lookup the operator in the environment using getOperator, then resolve the operands recursively, and pass the operands to the operator. The operand deals with making sure that the operands are sane.

An operator looks something like this:

Defining Operator
1
2
3
4
5
6
type Operator struct {
  symbol           string
  minArgCount      int
  maxArgCount      int
  handler          (func(*LangEnv, []Atom) Atom)
}

As seen, symbol is the name of operator, so for the binary addition it can be “+”. handler is the function which will actually do the heavy lifting we have been avoiding all this while. It takes in a slice of Atoms (and a LangEnv, more on that later) and returns an Atom as a result.

Now, the fun stuff.

Type System

Remember Atom has a Value inside? Value is an interface, and any type which wants to be a Value, needs to implement the following methods.

What should a Value look like?
1
2
3
4
5
6
7
type Value interface {
  Str() string                 // Returns a string representation of the value.
  getValueType() valueType     // Return the valueType (enum of all Values).
  to(valueType) (Value, error) // Convert to a different Value type.  
  ofType(string) bool          // Check if a given string is of this type.
  newValue(string) Value       // Create a new Value of this type.
}

This is enough power to figure out which value is of which type. In LangEnv we keep a list of builtin Value types, such as intValue, floatValue, stringValue, etc.

To deduce the type of a value, we simply do this:

Type deduction
1
2
3
4
5
6
7
8
9
func getValue(env *LangEnv, token string) (Value, error) {
  types := builtinTypes()
  for _, t := range types {
    if t.ofType(token) {
      return t.newValue(token), nil
    }
  }
  return nil, errors.New(fmt.Sprintf("Could not get type for token: %s", token))
}

Now imagine an expression like $(+$ $1.2$ $3)$.

$1.2$ resolves to floatValue, and $3$ resolves to intValue. How would we implement the handler method for the $+$ operator to add these two different types? You might say, that this would involve casting of $3$ to floatValue. Now how do we decide what values to cast, and what type should we cast them to?

This is how we do it. In the method, typeCoerce, we try to find out which is the right value type to cast all our values to. It is declared like:

1
2
func typeCoerce(operatorName string, operands *[]Atom, typePrecendenceMap map[valueType]int)
  (valueType, error)

This is what we do inside typeCoerce:

  1. We get the type : count mapping for all the param values.
  2. If there is only one type, there is nothing to do, return the corresponding type. Every value belongs to the same type.
  3. If there are multiple types in step 1, pick the one with the highest precedence.
  4. Try and cast all operand values to that type. Error out if any of them resists.

Hence, the $+$ operator could be implemented this way:

Defining the ‘+’ operator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
op :=
  &Operator{
    symbol:      add,
    minArgCount: 2,
    maxArgCount: 100,     // Just an arbit large value.
    handler: func(env *LangEnv, operands []Atom) Atom {
      var retVal Atom
      var finalType valueType
      finalType, retVal.Err = typeCoerce(add, &operands, numValPrecedenceMap)
      if retVal.Err != nil {
        return retVal
      }

      switch finalType {
      case intType:
        var finalVal intValue
        finalVal.value = 0
        for _, o := range operands {
          v, _ := o.Val.(intValue)
          finalVal.value = finalVal.value + v.value
        }
        retVal.Val = finalVal
        break

      case floatType:
        var finalVal floatValue
        finalVal.value = 0
        for _, o := range operands {
          v, _ := o.Val.(floatValue)
          finalVal.value = finalVal.value + v.value
        }
        retVal.Val = finalVal
        break
      }
      return retVal
    },
  }
// Add the operator to the LangEnv's opMap (operator map)
addOperator(env.opMap, op)

Here we basically just call typeCoerce on the operands, and if its possible to cast them to one single type, we do that casting, and actually perform the addition in the new type.

The $+$ operator can be used to add strings as well. However, we don’t want something like $($$+$ $3.14\ \mathrm{“foo”})$. The typeCoerce method can be trivially extended to support a list of type valid type precedence maps, and all operands need to belong to the same precedence map. In this case, it could be { {intType: 1, floatType: 2}, {stringType: 1 } }. This particular list ensures that we don’t add ints and strings for example, because they don’t belong to the same precedence map.

Note that the entire implementation of the operator is defined in the Operator struct’s handler method. Whether or not the operator supports this sort of casting, or decides to roll its own, or not use it at all, is the prerogative of the operator.

Defining Variables

A typical variable definition could look like this: $(\mathrm{defvar}\ \mathrm{x}\ 3.0)$.

Now, defvar is an operator too. It expects the first argument to be of varType (matches the regex of a variable), and the value can be anything (except varType). We just need to check if the type conditions match, and the variable is not a defined operator. If both are okay, we define the variable in our LangEnv’s varMap.

We need to change the part in our evalAST method which to support variable lookup.

Changes in evalAST to support variables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
func evalAST(env *LangEnv, node *ASTNode) Atom {
  // ...
    for i := 1; i < len(node.children); i++ {
      v := evalAST(env, node.children[i])
      if v.Err != nil {
        return v
      }
      // <!-- Lookup code begins
      if v.Val.getValueType() == varType {
        v.Val, v.Err = getVarValue(env, v.Val)
        if v.Err != nil {
          return v
        }
      }
      // Lookup code ends --!>
      operands = append(operands, v)
    }
  // ...
}

Here we can assume we have a helper method called getVarValue, which looks up the value of a variable in the varMap, or throws an error if required (this is also simple to implement).

Defining Methods

Defining methods is even more fun! We use the defun operator. Consider this:

1
(defun circle-area (r) (* 3.14 r r))

The first operand is the name of the method, the second is a list of variables that you would pass to the method, and the last is the actual method, assuming that the variables you need are already defined. (circle-area 10) after this definition should return 314.

Calculating the area of a rectangle is pretty much similar.

1
(defun rect-area (x y) (* x y))

We need a couple of things for function calls to work fine:

  • Create a new astValue which can be used for keeping ASTs. So far we were keeping ints, floats and so on.
  • Have a way to tell evalAST to not evaluate the AST in the defun arguments. This is because in circle-area, the (* 3.14 r r) itself is the value (AST value).
  • The defun operator needs to add an operator to the opMap, with the same name as the method, and define its handler method.
  • The handler would need to expect the same number of params as specified in the definition.
  • Up until now, our variables have been global in scope. Assume that there is a (defvar x 3.0) defining the variable x, followed by (defun foo (x) (+ 1 x)) which defines a method uses a param labelled x. The interpreter may look at the varMap and think that the programmer wants to use the global x, which is $3.0$. The actual intention of the programmer is to use the parameter x.

For this to work correctly, we would need: * A new LangEnv to be created, inside the handler. * First copy the same varMap as the parent LangEnv passed to the handler. * Then copy the params passed to the handler. Any duplicates will be overwritten, but all global definitions would be preserved. The variable defined in the inner scope would be visible. * Inside the handler, we will call evalAST to evaluate the AST we were provided in the method definition with the new LangEnv * We also keep track of the recursion depth in LangEnv, and it is incremented every time a recursive call is made. If it exceeds a large value (100000 for now), we can error out, so as to salvage the interpreter at least.

This is the only complicated part of the interpreter. Those interested in the code can check it out here.

Recursion

Apart from making sure that we have some sort of recursion depth limit enforced, recursion does not need any special handling. Except, defining some new operators like cond (the if-else equivalent), which are required for writing something useful.

Here is the implementation of the factorial function:

1
2
3
4
5
6
7
8
(defun fact (x)
  (cond                     ; conditional block
    ((= x 0) 1)             ; if x == 0, return 1
    (true                   ; else
      (* x (fact (- x 1)))  ; return x * fact(x - 1)
    )
  )
)

fact(10) returns 3628800 as expected.

Porting to iOS

Once I had the interpreter working fine, I wanted to run this on an iOS app. Why? Just for fun. It turns out with Go 1.5, a new tool called gomobile. Hana Kim gave a talk about this at GopherCon 2015.

What it does is, it compiles your Go package into a static library, and generates Objective-C bindings for it, and wraps them together in a nice iOS friendly .framework package.

There are a few restrictions regarding not being able to return complex types such as structs within structs, but apart from that it was fairly easy to use in my bare-bones app. Here is the code for the app, and we have already seen the demo earlier.

(Thanks to Dhruv Matani for reviewing this blog-post.)

A Gentle Intro to Libevent

| Comments

These past two days in some free time, I decided to explore this nifty C library called libevent.

Following the theme from the previous post, the first question is: ‘Why do we need it?’. If you are familiar with network programming, or any multithreaded programming which involves blocking IO, you already know the problem at hand. Right from the hardware level to the software level, a common problem that happens is: IO is slower than the CPU. If we have several tasks to finish, and the current task being executed is waiting for a blocking IO to finish, we should ideally work on the other tasks and let that blocking IO finish in the background, and check on it later.

When we have several such operations happening in the background, we need a way to figure out when a particular operation (such as read, write, accept a connection), can be performed without blocking, so that we can quickly do that, and return to other things. select(2), poll(2), epoll(4), kqueue(2) (on *BSD systems), are one of the several ways to do it. In essence, you register a set of file descriptors that you are interested in, and then call one of these ‘backends’. They would usually block until either one of the fd-s that you are interested in, is ready for data to be read or written to it. If none of them is ready, it would block for a configured amount of time and then return.

The problem that libevent solves is, it provides an easy to use library for notifying when an event happens on the file descriptors which you consider interesting. It also hides the real backend (select, epoll, kqueue) being used, and this helps you avoid writing platform-dependent code (eg., kqueue works only on *BSD) and if there were a new and improved backend in the future, your code would not change. It is like the JVM for asynchronous event notification system.

I only have experience with select, so my context is limited. Using select is very tedious.

select.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Essentially a bitmask
fd_set readfds;

while (true) {
  // Zero out the bitmask
  FD_ZERO(&readfds);

  // Enable the fd that we are interested in
  FD_SET(sockfd, &readfds);

  int ret = select(sockfd + 1, &readfds, NULL, NULL, &timeout);
  if (ret < 0) {
    fprintf(stderr, "select() returned %d\n", ret);
  }

  // Is the bit corresponding to the fd enabled?
  // If yes, that fd is ready to be read from.
  if (FD_ISSET(sockfd, &readfds)) {
    // Do what we want here
  }
}

In essence, what we do here is to create a set of file descriptors (fd_set). We then run a loop where every time we set all the file descriptors we are interested in, into that set. Then we call select(), and it either times out, or one of the bits in that set would be set. We have to check for each of the file descriptors. This makes it a little ugly. Other backends might be more pleasant to use, but libevent is way ahead of select in terms of usability. Here is some sample code:

libevent-1.cpp
1
2
3
4
event_base* eb = event_base_new();
event* t = event_new(eb, sockfd, EV_READ | EV_PERSIST, acceptConnCob, NULL);
event_add(t, NULL);
event_base_dispatch(eb);

An event_base represents a single event loop like the one that we saw in the select example. To subscribe to changes in a file descriptor, we will first create an event. This can be done using event_new, which takes in the event base we created, the file descriptor, flags signalling when the event is active, the callback method and arguments to the callback method. In this particular example, we ask that the acceptConnCob method be called when the file descriptor is ready for reading (EV_READ) and persist this event, i.e, even when the event fires, automatically add it for the next time (EV_PERSIST is to be used here). Note that we had to add the file descriptors in every iteration of the while loop of the select example, so using the EV_PERSIST flag is a slight convenience here. Once, I have created the event, I need to add that event to the event_base it should be processed by, along with a timeout, using the event_add method. If the fd doesn’t become active by the specified time, the callback will be called anyways. Finally, to get things running, we will ‘dispatch’ the event base, which will spawn a new thread to run the event loop.

Note that nowhere have I specified which backend to use. I don’t need to. However, there are ways to prefer or avoid certain backends in libevent, using the event_config struct. Refer to the link in the end.

I can add multiple events, and there are a lot of options that you can use. For instance, I can create a timer by passing -1 as a file descriptor with the EV_TIMEOUT and EV_PERSIST flags, and the required timeout in event_add. This would call the timeout callback every ‘timeout’ seconds. Example:

libevent-2.cpp
1
2
3
event* e = event_new(eb, -1, EV_TIMEOUT | EV_PERSIST, pickFortuneCookieCob, NULL);
timeval twoSec = {2, 0};
event_add(e, &twoSec);

I created a simple fortune cookie server (one of my favorite demo examples), where I have a set of messages, and if someone asks me for a fortune cookie, I will give them the current fortune cookie. Every few seconds, I will pick a new fortune cookie to be returned. This is implemented by having two events, one for accepting connections and the other for having a timer. The code is here.

One small thing to keep in mind is that if the callbacks themselves to do something heavy, then it defeats the purpose of using libevent. This is because the callbacks are called from the same thread as the actual event loop. The longer the callback runs, the longer the event loop is not able to process other events.

libevent allows you to do a lot of customizations. In the above example, I have added callbacks to override the logging mechanism, so that I can use glog (Google’s logging library). There are several other features such buffered events and a lot of utility methods (such as creating simple http servers), that you can find in the libevent book.

There are other similar async event notification systems such as libev, and libuv, etc. but I haven’t had the time to figure out the differences. I hope to cover the very interesting wrapper around libevent in folly, Facebook’s open source C++ library, in a future post.

Notes of a Software Engineer - Understand the Problem

| Comments

This is my first post on this blog, which doesn’t discuss any specific problem / technology. As I have been spending time writing code, I feel that it is a good practice to sit back and introspect once in a while. This post has some thoughts, which I hope will make people sit up and notice if they are making the same mistakes. Because this is a long-ish post, some might not read it completely, or some might not be able to relate to the content right now. Regardless, I feel eventually a significant number of us will make the mistakes that I have made in the past, and learn from it first hand. Whatever the case may be, if this makes any sense to you, please share your thoughts and experiences with me :)

As Engineers I feel we are often excited to work on new and ambitious projects. I am specifically talking about non-trivial projects which break new ground, and/or have a reasonable change of not succeeding. The latter could be because it is often the case that, these projects are complicated enough and its hard to be exact with respect to the benefits. These projects might also touch certain areas of the system which are hazy in general.

‘Hazy’ doesn’t really imply that that particular area / part of the system, is naturally hard to understand. It could be just that we don’t know the problem well enough, and how it interacts with those ‘hazy’ areas. I cannot stress enough that it is critical to understand the problem really well before hand. It seems clichéd, and has been repeated so many times, that it will probably not make a good enough impact. So, I will repeat this again in detail, so it stays with you and me, a little longer.

Understand The Problem

As per Prof. Bender, when giving a presentation, making sure that people understand why we did what we did, is the most important thing. Extending this backwards, ever wondered if that problem really needs to be solved in the first place? A lot of times, as a new CS graduate, working on my first full-time unsupervised big tasks, I would really be in awe of the supposed problem. Looking with rose-tinted glasses, you feel that this is what you had told the recruiter and interviewers that you wanted to do in the job. Excellent, lets start working on it. And if you do this, and just jump into this directly, you are going to have a bad time.

Often I did not spend enough time understanding why exactly was I doing what I was doing. Do benchmarks show that this is really needed? Do I have a good enough prototype which shows that if I do what I am going to do, it will give us significant benefits? Do people need this? Has this problem been solved before? What is the minimum I can do to solve this reasonably, and move on to other bigger problems?

This proactive research is what I feel is the difference between new and experienced engineers. In fact, I think, in some cases senior engineers write LESS code than the less experienced ones and still get more things done. Its now clear to me, that the actual coding should only take 10% of the time allocated to the project. If I spend enough time doing my due-diligence and am ‘lazy’, I can simply prune some potential duds much before they turn into huge time sinks. If I spend some more time on the problems which actually require my time, I can figure out things I can do to reduce the scope of the problem, or cleverly use pre-built solutions to do part/most of the work. All this can only come if we (and I will repeat again) Understand. The. Problem.

(Please let me know if you agree or disagree with me about what I said. I would love to hear back).

[0] Sloth picture courtesy: http://en.wikipedia.org/wiki/File:SlothDWA.jpg