# On Herding in Deep Networks

@inproceedings{Maaten2010OnHI, title={On Herding in Deep Networks}, author={L. V. D. Maaten}, year={2010} }

Maximum likelihood learning in Markov Random Fields (MRFs) with multiple layers of hidden units is typically performed using contrastive divergence or one of its variants. After learning, samples from the model are generally used to estimate expectations under the model distribution. Recently, Welling proposed a new approach to working with MRFs with a single layer of hidden units. The approach, called herding, tries to combine the two stages, learning and sampling, into a single stage. Herding… Expand

#### References

SHOWING 1-10 OF 34 REFERENCES

Exploring Strategies for Training Deep Neural Networks

- Computer Science
- J. Mach. Learn. Res.
- 2009

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy helps the optimization by initializing weights in a region near a good local minimum, but also implicitly acts as a sort of regularization that brings better generalization and encourages internal distributed representations that are high-level abstractions of the input. Expand

Using fast weights to improve persistent contrastive divergence

- Mathematics, Computer Science
- ICML '09
- 2009

It is shown that the weight updates force the Markov chain to mix fast, and using this insight, an even faster mixing chain is developed that uses an auxiliary set of "fast weights" to implement a temporary overlay on the energy landscape. Expand

A Fast Learning Algorithm for Deep Belief Nets

- Mathematics, Computer Science
- Neural Computation
- 2006

A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. Expand

Deep Boltzmann Machines

- Computer Science
- AISTATS
- 2009

A new learning algorithm for Boltzmann machines that contain many layers of hidden variables that is made more efficient by using a layer-by-layer “pre-training” phase that allows variational inference to be initialized with a single bottomup pass. Expand

Herding Dynamic Weights for Partially Observed Random Field Models

- Computer Science, Mathematics
- UAI
- 2009

An algorithm to generate complex dynamics for parameters and (both visible and hidden) state vectors is introduced and it is shown that under certain conditions averages compute over trajectories of the proposed dynamical system converge to averages computed over the data. Expand

Modeling image patches with a directed hierarchy of Markov random fields

- Computer Science, Mathematics
- NIPS
- 2007

An efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets is described and it is shown that this type of model is good at capturing the statistics of patches of natural images. Expand

Herding dynamical weights to learn

- Mathematics, Computer Science
- ICML '09
- 2009

A new "herding" algorithm is proposed which directly converts observed moments into a sequence of pseudo-samples. The pseudo-samples respect the moment constraints and may be used to estimate… Expand

Learning Deep Architectures for AI

- Computer Science
- Found. Trends Mach. Learn.
- 2007

The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed. Expand

Scaling learning algorithms towards AI

- Computer Science
- 2007

It is argued that deep architectures have the potential to generalize in non-local ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence. Expand

Training Products of Experts by Minimizing Contrastive Divergence

- Mathematics, Computer Science
- Neural Computation
- 2002

A product of experts (PoE) is an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Expand