By Adit Deshpande, UCLA.
Starting this week, I’ll be doing a new series called Deep Learning Research Review. Every couple weeks or so, I’ll be summarizing and explaining research papers in specific subfields of deep learning. This week I’ll begin with Generative Adversarial Networks.
According to Yann LeCun, “adversarial training is the coolest thing since sliced bread”. I’m inclined to believe so because I don’t think sliced bread ever created this much buzz and excitement within the deep learning community. In this post, we’ll be looking at 3 papers that built on the pioneering work of Ian Goodfellow in 2014.
Quick Summary of GANs
I briefly mentioned Ian Goodfellow’s Generative Adversarial Network paper in one of my prior blog posts, 9 Deep Learning Papers You Should Know About. The basic idea of these networks is that you have 2 models, a generative model and a discriminative model. The discriminative model has the task of determining whether a given image looks natural (an image from the dataset) or looks like it has been artificially created. The task of the generator is to create natural looking images that are similar to the original data distribution. This can be thought of as a zero-sum or minimax two player game. The analogy used in the paper is that the generative model is like “a team of counterfeiters, trying to produce and use fake currency” while the discriminative model is like “the police, trying to detect the counterfeit currency”. The generator is trying to fool the discriminator while the discriminator is trying to not get fooled by the generator. As the models train through alternating optimization, both methods are improved until a point where the “counterfeits are indistinguishable from the genuine articles”.
Laplacian Pyramid of Adversarial Networks
So, one of the most important uses of adversarial networks is the ability to create natural looking images after training the generator for a sufficient amount of time. These are some samples of what the generator outputted in Goodfellow’s 2014 paper.
As you can see, the generator worked well with digits and faces, but it created very fuzzy and vague images when using the CIFAR-10 dataset.
In order to fix this problem, Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus published the paper titled “Deep Generative Image Models using Lapalacian Pyramid of Adversarial Networks”. The main contribution of the paper is a type of network architecture that produces high-quality generated images that are mistaken for real images almost 40% of the time when assessed by human evaluators.
Before getting into the paper, let’s think about the job of the generator in a GAN. It has to produce a large, complex, and natural image that is good enough to convince a trained discriminator. Not such an easy task in one shot. The way the authors combat this is by using multiple CNN models to sequentially generate images in increasing scales. As Emily Denton said in hertalk on LAPGANs,
The approach of this paper is to build a Laplacian Pyramid of generative models. For those that aren’t familiar, a Laplacian pyramid is basically an image representation that consists of a series of filtered images at successively sparser densities (more info for those interested). The idea is that each level of this pyramid representation contains information about the image at a particular scale. It is a sort of decomposition of the original image.
Let’s review what the inputs and outputs are of a simple GAN. The generator takes in an input noise vector from a distribution and outputs an image. The discriminator takes in this image (or a real image from the training data) and outputs a scalar describing how “real” the image is. Now, let’s look at a conditional GAN (CGAN). Everything remains the same, except that both the discriminator and the generator receive another piece of information as an input. This information is likely in the form of some sort of class label or another image.
The authors propose a set of convnet models and that each layer of the pyramid will have a convnet associated with it. The change is the traditional GAN structure is that instead of having just one generator CNN that creates the whole image, we have a series of CNNs that create the image sequentially by slowly increasing the resolution (aka going along the pyramid) and refining images in a coarse to fine fashion. Each level has its own CNN and is trained on two components. One is a low resolution image and the other is a noise vector (which was the only input in traditional GANs). This is where the idea of CGANs come into play as there are multiple inputs. The output will be a generated image that is then upsampled and used as input to the next level of the pyramid. This method is effective because the generators in each level are able to utilize information from different resolutions in order to create more finely grained outputs in the successive layers.