Best of for AI, Machine Learning, and Deep Learning – December 2017

In this recurring monthly feature, we filter recent research papers appearing on the preprint server for compelling subjects relating to AI, machine learning and deep learning – from disciplines including statistics, mathematics and computer science – and provide you with a useful “best of” list for the past month. Researchers from all over the world contribute to this repository as a prelude to the peer review process for publication in traditional journals. arXiv contains a veritable treasure trove of learning methods you may use one day in the solution of data science problems. We hope to save you some time by picking out articles that represent the most promise for the typical data scientist. The articles listed below represent a fraction of all articles appearing on the preprint server. They are listed in no particular order with a link to each paper along with a brief overview. Especially relevant articles are marked with a “thumbs up” icon. Consider that these are academic research papers, typically geared toward graduate students, post docs, and seasoned professionals. They generally contain a high degree of mathematics so be prepared. Enjoy!

Spatial PixelCNN: Generating Images from Patches

This is a very cool paper in the computer vision space that proposes Spatial PixelCNN, a conditional autoregressive model that generates images from small patches. By conditioning on a grid of pixel coordinates and global features extracted from a Variational Autoencoder (VAE), they’re able to train on patches of images, and reproduce the full-sized image. They show that the technique not only allows for generating high quality samples at the same resolution as the underlying data set, but is also capable of up-scaling images to arbitrary resolutions (tested at resolutions up to 50×) on the MNIST dataset. Compared to a PixelCNN++ baseline, Spatial PixelCNN quantitatively and qualitatively achieves similar performance on the MNIST data set.

Adversarial Patch

A group of Google researchers presents a method to create universal, robust, targeted adversarial image patches in the real world. The patches are universal because they can be used to attack any scene, robust because they work under a wide variety of transformations, and targeted because they can cause a classifier to output any target class. These adversarial patches can be printed, added to any scene, photographed, and presented to image classifiers; even when the patches are small, they cause the classifiers to ignore the other items in the scene and report a chosen target class.

Visualizing the Loss Landscape of Neural Nets

Neural network training relies on our ability to find “good” minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. This paper explores the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods.

Ray RLLib: A Composable and Scalable Reinforcement Learning Library

Reinforcement learning (RL) algorithms involve the deep nesting of distinct components, where each component typically exhibits opportunities for distributed computation. Current RL libraries offer parallelism at the level of the entire program, coupling all the components together and making existing implementations difficult to extend, combine, and reuse. This paper argues for building composable RL components by encapsulating parallelism and resource requirements within individual components, which can be achieved by building on top of a flexible task-based programming model. The authors demonstrate this principle by building Ray RLLib on top of Ray and show how to implement a wide range of state-of-the-art algorithms by composing and reusing a handful of standard components. Ray RLLib is available as part of Ray on GitHub.

Gradients explode – Deep Networks are shallow – ResNet explained

Whereas it is believed that techniques such as Adam, batch normalization and, more recently, SeLU nonlinearities “solve” the exploding gradient problem, this paper shows that this is not the case in general and that in a range of popular MLP architectures, exploding gradients exist and that they limit the depth to which networks can be effectively trained, both in theory and in practice. The authors explain why exploding gradients occur and highlight the collapsing domain problem, which can arise in architectures that avoid exploding gradients. ResNets have significantly lower gradients and thus can circumvent the exploding gradient problem, enabling the effective training of much deeper networks, which they show is a consequence of a surprising mathematical property. By noticing that any neural network is a residual network, this new research devises the residual trick, which reveals that introducing skip connections simplifies the network mathematically, and that this simplicity may be the major cause for their success.

Deep Extreme Cut: From Extreme Points to Object Segmentation

This paper explores the use of extreme points in an object (left-most, right-most, top, bottom pixels) as input to obtain precise object segmentation for images and videos. The authors do so by adding an extra channel to the image in the input of a convolutional neural network (CNN), which contains a Gaussian centered in each of the extreme points. The CNN learns to transform this information into a segmentation of an object that matches those extreme points. The paper demonstrates the usefulness of this approach for guided segmentation (grabcut-style), interactive segmentation, video object segmentation, and dense segmentation annotation.

Bayesian GAN

Generative adversarial networks (GANs) can implicitly learn rich distributions over images, audio, and data which are hard to model with an explicit likelihood. This paper presents a practical Bayesian formulation for unsupervised and semi-supervised learning with GANs. Within this framework, the authors use stochastic gradient Hamiltonian Monte Carlo to marginalize the weights of the generator and discriminator networks. The resulting approach is straightforward and obtains good performance without any standard interventions such as feature matching, or mini-batch discrimination. By exploring an expressive posterior over the parameters of the generator, the Bayesian GAN avoids mode-collapse, produces interpretable and diverse candidate samples, and provides state-of-the-art quantitative results for semi-supervised learning on benchmarks including SVHN, CelebA, and CIFAR-10, outperforming DCGAN, Wasserstein GANs, and DCGAN ensembles.

Deep Unsupervised Clustering Using Mixture of Autoencoders

Unsupervised clustering is one of the most fundamental challenges in machine learning. A popular hypothesis is that data are generated from a union of low-dimensional nonlinear manifolds; thus an approach to clustering is identifying and separating these manifolds. This paper presents a novel approach to solve this problem by using a mixture of autoencoders. The model consists of two parts: 1) a collection of autoencoders where each autoencoder learns the underlying manifold of a group of similar objects, and 2) a mixture assignment neural network, which takes the concatenated latent vectors from the autoencoders as input and infers the distribution over clusters. By jointly optimizing the two parts, the authors simultaneously assign data to clusters and learn the underlying manifolds of each cluster.

Non-convex Optimization for Machine Learning

A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimensional spaces or that train non-linear models such as tensor models and deep networks. This paper leads the reader through several widely used non-convex optimization techniques, as well as applications thereof. The goal is to both, introduce the rich literature in this area, as well as equip the reader with the tools and techniques needed to analyze these simple procedures for non-convex problems.

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. This paper investigates a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, the authors propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition proposed relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer.

Sign up for the free insideBIGDATA newsletter.