Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions

Neural networks rely on convolutions to aggregate spatial information. However, spatial convolutions are expensive in terms of model size and computation, both of which grow quadratically with respect to kernel size. This paper presents a parameter-free, FLOP-free “shift” operation as an alternative to spatial convolutions to increase accuracy by up to 8% with the same FLOPs & recovers accuracy with 1/3 of FLOPs for ResNet’s.

MinimalRNN: Toward More Interpretable and Trainable Recurrent Neural Networks

This article introduces MinimalRNN, a new recurrent neural network architecture that achieves comparable performance as the popular gated RNNs with a simplified structure. It employs minimal updates within RNN, which not only leads to efficient learning and testing but more importantly better interpretability and trainability, plus captures longer range dependencies than existing RNN.

Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN

To address the challenges of many computer vision applications, this article proposes a Knowledge Concentration method, which effectively transfers the knowledge from dozens of specialists (multiple teacher networks) into one single model (one student network) to classify 100K object categories.

S4Net: Single Stage Salient-Instance Segmentation

This paper considers an interesting vision problem—salient instance segmentation. Other than producing approximate bounding boxes, the described network also outputs high-quality instance-level segments. Taking into account the category-independent property of each target, the authors design a single stage salient instance segmentation framework, with a novel segmentation branch. Their new branch regards not only local context inside each detection window but also its surrounding context, enabling us to distinguish the instances in the same scope even with obstruction.

Deep Gaussian Mixture Models

Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this paper, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions.

Beyond Sparsity: Tree Regularization of Deep Models for Interpretability

The lack of interpretability remains a key barrier to the adoption of deep models in many applications. In this paper, the authors show how to explicitly regularize deep models so human users might step through the process behind their predictions in little time. Specifically, they train deep time-series models so their class-probability predictions have high accuracy while being closely modeled by decision trees with few nodes.

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

This article demonstrates that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, the authors employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule.

Data Augmentation Generative Adversarial Networks

This paper shows that a Data Augmentation Generative Adversarial Network (DAGAN) augments standard vanilla classifiers well. It also shows a DAGAN can enhance few-shot learning systems such as Matching Networks.

Fixing Weight Decay Regularization in Adam

This paper notes that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. The authors propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function.

Don’t Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. This paper shows how one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam.

*Sign up for the free insideBIGDATA newsletter.*