# A Gentle Intro to PyTorch

| Comments

PyTorch is a fairly new deep-learning framework released by Facebook, which reminds me of the JS framework frenzy. But having played around with PyTorch a slight bit, it already feels fun.

To keep things short, I liked it because:

1. Unlike TensorFlow it allows me to easily print Tensors on the screen (no seriously, this is a big deal for me since I usually take several iterations to get a DL implementation right).
2. TensorFlow adds a layer between Python and TensorFlow. TensorFlow even has it’s own variable scope. This is way too much abstraction, that I don’t appreciate for my experimental interests.
3. Interop with numpy is easy in PyTorch, with the simple .numpy() suffix to convert a Tensor to a numpy array.
4. Unlike Torch, it is not in Lua (also doesn’t need the LuaRocks package manager).
5. Unlike Caffe2, I don’t have to write C++ code and write build scripts.

PyTorch’s website has a 60 min. blitz tutorial, which is laid out pretty well.

Here is the summary to get you started on PyTorch:

• torch.Tensor is your np.array (the NumPy array). torch.Tensor(3,4) will create a Tensor of shape (3,4).
• All the functions are pretty standard. Such as torch.rand can be used to generate random Tensors.
• Indexing in Tensors is pretty similar to NumPy as well.
• .numpy() allows converting Tensor to a numpy array.
• For the purpose of a compute graph, PyTorch lets you create Variables which are similar to placeholder in TF.
• Creating a compute graph and computing the gradient is pretty easy (and automatic).

This is all it takes to compute the gradient, where x is a variable:

Doing backprop simply with the backward method call on the scalar out, computes gradients all the way to x. This is amazing!

• For NN, there is an nn.Module which wraps around the boring boiler-plate.
• A simple NN implementation is below for solving the MNIST dataset, instead of the CIFAR-10 dataset that the tutorial solves:

As seen, in the __init__ method, we just need to define the various NN layers we are going to be using. Then the forward method just runs through them. The view method is analogous to the NumPy reshape method.

The gradients will be applied after the backward pass, which is auto-computed. The code is self-explanatory and fairly easy to understand.

• Torch also keeps track of how to retrieve standard data-sets such as CIFAR-10, MNIST, etc.
• After getting the data, from the data-loader you can proceed to play with it. Below are rest of the pieces required to complete the implementation (almost all of it is from the tutorial):

The criterion object is used to compute your loss function. optim has a bunch of convex optimization algorithms such as vanilla SGD, Adam, etc. As promised, simply calling the backward method on the loss object allows computing the gradient.

• Assuming you are working on the tutorial. Try to solve the tutorial for MNIST data instead of CIFAR-10.
• Instead of the 3-channel (RGB) image of size 24x24 pixels, the MNIST images are single channel 28x28 pixel images.

Overall, I could get to 96% accuracy, with the current setup. The complete gist is here.

# Back to Basics: Linear Regression

| Comments

Taking inspiration from Werner Vogel’s ‘Back to Basics’ blogposts, here is one of my own posts about fundamental topics. While on a long-haul flight with no internet connectivity, having exhausted the books on my kindle, and hardly any inflight-entertainment, I decided to code up Linear Regression in Python. Let’s look at both the theory and implementation of the same.

## Theory

Essentially, in Linear Regression, we try to estimate a dependent variable $y$, using independent variables $x_1$, $x_2$, $x_3$, $…$, using a linear model.

More formally, $y = b + W_1 x_1 + W_2 x_2 + … + W_n x_n + \epsilon$. Where, $W$ is the weight vector, $b$ is the bias term, and $\epsilon$ is the noise in the data.

It can be used when there is a linear relationship between the input $X$ (input vector containing all the $x_i$, and $y$).

# Paper: Wide & Deep Learning for Recommender Systems

| Comments

There seems to be an interesting new model architecture for ranking & recommendation, developed by Google Research. It uses Logistic Regression & Deep Learning in a single model.

This is different from ensemble models, where a sub-model is trained separately, and it’s score is used as a feature for the parent model. In this paper, the authors learn a wide model (Logistic Regression, which is trying to “memorize”), and a deep model (Deep Neural Network, which is trying to “generalize”), jointly.

The input to the wide network are standard features, while the deep network uses dense embeddings of the document to be scored, as input.

The main benefits as per the authors, are:

1. DNNs can learn to over-generalize, while LR models are limited in how much they can memorize from the training data.

2. Learning the models jointly means that the ‘wide’ and ‘deep’ part are aware of each other, and the ‘wide’ part only needs to augment the ‘deep’ part.

3. Also, training jointly helps reduce the side of the individual models.

They also have a TensorFlow implementation. Also a talk on this topic.

The authors employed this model to recommend apps to be downloaded to the user in Google Play, where they drove up app installs by 3.9% using the Wide & Deep model.

However, the Deep model in itself, drove up installs by 2.9%. It is natural to expect that the ‘wide’ part of the model should help in further improving the metric to be optimized, but it is unclear to me, if the delta could have been achieved by further bulking up the ‘deep’ part (i.e., adding more layers / training bigger dimensional embeddings, which are inputs to the DNNs).