Samy Bengio - Publications

Some of the files below are copyrighted. They are provided for your convenience, yet you may download them only if you are entitled to do so by your arrangements with the various publishers.

By Year 2024 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000

2024

[1] E. Boix-Adsera, O. Saremi, E. Abbe, S. Bengio, E. Littwin, and J. Susskind.
When can transformers reason with abstract symbols.
In International Conference on Learning Representations, ICLR, 2024.
.pdf | weblink | abstract]
We investigate the capabilities of transformer large language models (LLMs) on relational reasoning tasks involving abstract symbols. Such tasks have long been studied in the neuroscience literature as fundamental building blocks for more complex abilities in programming, mathematics, and verbal reasoning. For (i) regression tasks, we prove that transformers generalize when trained, but require astonishingly large quantities of training data. For (ii) next-token-prediction tasks with symbolic labels, we show an "inverse scaling law": transformers fail to generalize as their embedding dimension increases. For both settings (i) and (ii), we propose subtle transformer modifications which can reduce the amount of data needed by adding two trainable parameters per head.
[2] H. Zhou, A. Bradley, E. Littwin, N. Razin, O. Saremi, J. Susskind, S. Bengio, and P. Nakkiran.
What algorithms can transformers learn: A study in lenght generalization.
In International Conference on Learning Representations, ICLR, 2024.
.pdf | weblink | abstract]
Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a unifying framework to understand when and how Transformers can exhibit strong length generalization on a given task. Specifically, we leverage RASP (Weiss et al., 2021) -- a programming language designed for the computational model of a Transformer -- and introduce the RASP-Generalization Conjecture: Transformers tend to length generalize on a task if the task can be solved by a short RASP program which works for all input lengths. This simple conjecture remarkably captures most known instances of length generalization on algorithmic tasks. Moreover, we leverage our insights to drastically improve generalization performance on traditionally hard tasks (such as parity and addition). On the theoretical side, we give a simple example where the "min-degree-interpolator" model of learning from Abbe et al. (2023) does not correctly predict Transformers' out-of-distribution behavior, but our conjecture does. Overall, our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.

2023

[1] E. Abbe, S. Bengio, A. Lotfi, and K. Rizk.
Generalization on the unseen, logic reasoning and degree curriculum.
In International Conference on Machine Learning, ICML, 2023.
Outstanding Paper Award [ .ps.gz | .pdf | weblink | abstract]
This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.
[2] S. Abnar, O. Saremi, L. Dinh, S. Wison, M. A. Bautista, C. Huang, V. Thilak, E. Littwin, J. Gu, J. Susskind, and S. Bengio.
Adaptivity and modularity for efficient generalization over task complexity.
ArXiv, 2310.08866, 2023.
.ps.gz | .pdf | weblink | abstract]
Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper- UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70% average savings by effectively using fewer layers).
[3] D. Berrebbi, R. Collobert, S. Bengio, N. Jaitly, and T. Likhomanenko.
Continuous pseudo-labeling from the start.
In International Conference on Learning Representations, ICLR, 2023.
.ps.gz | .pdf | weblink | abstract]
Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform `continuous training' where PLs are generated using a very recent version of the model being trained. Nevertheless, these approaches still rely on bootstrapping the ST using an initial supervised learning phase where the model is trained on labeled data alone. We believe this has the potential for over-fitting to the labeled dataset in low resource settings and that ST from the start of training should reduce over-fitting. In this paper we show how we can do this by dynamically controlling the evolution of PLs during the training process in ASR. To the best of our knowledge, this is the first study that shows the feasibility of generating PLs from the very start of the training. We are able to achieve this using two techniques that avoid instabilities which lead to degenerate models that do not generalize. Firstly, we control the evolution of PLs through a curriculum that uses the online changes in PLs to control the membership of the cache of PLs and improve generalization. Secondly, we find that by sampling transcriptions from the predictive distribution, rather than only using the best transcription, we can stabilize training further. With these techniques, our ST models match prior works without an external language model.
[4] E. Boix-Adsera, E. Littwin, E. Abbe, S. Bengio, and J. Susskind.
Transformers learn through gradual rank increase.
In Advances In Neural Information Processing Systems, NeurIPS, 2023.
.ps.gz | .pdf | weblink | abstract]
We identify incremental learning dynamics in transformers, where the difference between trained and initial weights progressively increases in rank. We rigorously prove this occurs under the simplifying assumptions of diagonal weight matrices and small initialization. Our experiments support the theory and also show that phenomenon can occur in practice without the simplifying assumptions.
[5] S. d'Ascoli, S. Bengio, J. Susskind, and E. Abbe.
Boolformer: Symbolic regression of logic functions with transformers.
ArXiv, 2309.12207, 2023.
.ps.gz | .pdf | weblink | abstract]
In this work, we introduce Boolformer, the first Transformer architecture trained to perform end-to-end symbolic regression of Boolean functions. First, we show that it can predict compact formulas for complex functions which were not seen during training, when provided a clean truth table. Then, we demonstrate its ability to find approximate expressions when provided incomplete and noisy observations. We evaluate the Boolformer on a broad set of real-world binary classification datasets, demonstrating its potential as an interpretable alternative to classic machine learning methods. Finally, we apply it to the widespread task of modelling the dynamics of gene regulatory networks. Using a recent benchmark, we show that Boolformer is competitive with state-of-the art genetic algorithms with a speedup of several orders of magnitude. Our code and models are available publicly.

2022

[1] E. Abbe, S. Bengio, E. Cornacchia, J. Kleinberg, A. Lotfi, M. Raghu, and C. Zhang.
Learning to reason with neural networks: Generalization, unseen data and boolean measures.
In Advances In Neural Information Processing Systems, NeurIPS, 2022.
.ps.gz | .pdf | weblink | abstract]
This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a "reasoning" function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
[2] T. Likhomanenko, R. Collobert, N. Jaitly, and S. Bengio.
Continuous pseudo-labeling in ASR.
ArXiv, 2211.06007, 2022.
.ps.gz | .pdf | weblink | abstract]
Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
[3] C. Zhang, S. Bengio, and Y. Singer.
Are all layers created equal?
Journal of Machine Learning Research, JMLR, 23:1--28, 2022.
.ps.gz | .pdf | weblink | abstract]
Understanding deep neural networks is a major research objective with notable experimental and theoretical attention in recent years. The practical success of excessively large networks underscores the need for better theoretical analyses and justifications. In this paper we focus on layer-wise functional structure and behavior in overparameterized deep models. To do so, we study empirically the layers' robustness to post-training re-initialization and re-randomization of the parameters. We provide experimental results which give evidence for the heterogeneity of layers. Morally, layers of large deep neural networks can be categorized as either "robust" or "critical". Resetting the robust layers to their initial values does not result in adverse decline in performance. In many cases, robust layers hardly change throughout training. In contrast, re-initializing critical layers vastly degrades the performance of the network with test error essentially dropping to random guesses. Our study provides further evidence that mere parameter counting or norm calculations are too coarse in studying generalization of deep models, and "flatness" and robustness analysis of trained models need to be examined while taking into account the respective network architectures.

2021

[1] M. L. Iuzzolino, M. C. Mozer, and S. Bengio.
Improving anytime prediction with parallel cascaded networks and a temporal-difference loss.
In Advances In Neural Information Processing Systems, NeurIPS, 2021.
.ps.gz | .pdf | weblink | abstract]
Although deep feedforward neural networks share some characteristics with the primate visual system, a key distinction is their dynamics. Deep nets typically operate in serial stages wherein each layer completes its computation before processing begins in subsequent layers. In contrast, biological systems have cascaded dynamics: information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade offs even in feedforward architectures. We explore the consequences of biologically inspired parallel hardware by constructing cascaded ResNets in which each residual block has propagation delays but all blocks update in parallel in a stateful manner. Because information transmitted through skip connections avoids delays, the functional depth of the architecture increases over time, yielding anytime predictions that improve with internal-processing time. We introduce a temporal-difference training loss that achieves a strictly superior speed-accuracy profile over standard losses and enables the cascaded architecture to outperform state-of-the-art anytime-prediction methods. The cascaded architecture has intriguing properties, including: it classifies typical instances more rapidly than atypical instances; it is more robust to both persistent and transient noise than is a conventional ResNet; and its time-varying output trace provides a signal that can be exploited to improve information processing and inference.
[2] Y. Jiang, P. Natekar, M. Sharma, S. K. Aithal, D. Kashyap, N. Subramanyam, C. Lassance, D. M. Roy, G. K. Dziugaite, S. Gunasekar, I. Guyon, P. Foret, S. Yak, H. Mobahi, B. Neyshabur, and S. Bengio.
Methods and analysis of the first competition in predicting generalization of deep learning.
In Proceedings of Machine Learning Research, volume 133, pages 170--190, 2021.
.ps.gz | .pdf | weblink | abstract]
Deep learning has been recently successfully applied to an ever larger number of problems, ranging from pattern recognition to complex decision making. However, several concerns have been raised, including guarantees of good generalization, which is of foremost importance. Despite numerous attempts, conventional statistical learning approaches fall short of providing a satisfactory explanation on why deep learning works. In a competition hosted at the Thirty-Fourth Conference on Neural Information Processing Systems (NeurIPS 2020), we invited the community to design robust and general complexity measures that can accurately predict the generalization of models. In this paper, we describe the competition design, the protocols, and the solutions of the top-three teams at the competition in details. In addition, we discuss the outcomes, common failure modes, and potential future directions for the competition.
[3] B. Kim, E. Reif, M. Wattenberg, S. Bengio, and M. C. Mozer.
Neural networks trained on natural scenes exhibit gestalt closure.
Computational Brain and Behavior, 4(3):251--263, 2021.
.ps.gz | .pdf | weblink | abstract]
The Gestalt laws of perceptual organization, which describe how visual elements in an image are grouped and interpreted, have traditionally been thvalidity, we investigate whether deep learning methods infer Gestalt laws froml he statistics of natural scenes. We examine the law of closure, which asserts tat human visual perception tends to "close the gap" by assembling elements thathcan jointly be interpreted as a complete figure or object. We demonstrate thatt state-of-the-art convolutional neural network, trained to classify natural im aes, exhibits closure on synthetic displays of edge fragments, as assessed by sagilarity of internal representations. This finding provides further support forimhe hypothesis that the human perceptual system is even more elegant than the G ttaltists imagined: a single law--adaptation to the statistical structure of theesnvironment--might suffice as fundamental.
[4] Y. Li, S. Si, G. Li, C.-J. Hsieh, and S. Bengio.
Learnable fourier features for multi-dimensional spatial positional encoding.
In Advances In Neural Information Processing Systems, NeurIPS, 2021.
.ps.gz | .pdf | weblink | abstract]
Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where L2 distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.
[5] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
Understanding deep learning (still) requires rethinking generalization.
Communications of the ACM, 64(3):107--115, 2021.
.ps.gz | .pdf | weblink | abstract]
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.
[6] C. Zhang, M. Raghu, J. Kleinberg, and S. Bengio.
Pointer value retrieval: A new benchmark for understanding the limits of neural network generalization.
ArXiv, 2107.12580, 2021.
.ps.gz | .pdf | weblink | abstract]
The successes of deep learning critically rely on the ability of neural networks to output meaningful predictions on unseen data -- generalization. Yet despite its criticality, there remain fundamental open questions on how neural networks generalize. How much do neural networks rely on memorization -- seeing highly similar training examples -- and how much are they capable of human-intelligence styled reasoning -- identifying abstract rules underlying the data? In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. While PVR tasks can consist of visual as well as symbolic inputs, each with varying levels of difficulty, they all have a simple underlying rule. One part of the PVR task input acts as a pointer, giving the location of a different part of the input, which forms the value (and output). We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture. The interaction of position, values and the pointer rule also allow the development of nuanced tests of generalization, by introducing distribution shift and increasing functional complexity. These reveal both subtle failures and surprising successes, suggesting many promising directions of exploration on this benchmark.

2020

[1] Y. Guo, J. Choi, M. Moczulski, S. Feng, S. Bengio, M. Norouzi, and H. Lee.
Memory based trajectory-conditioned policies for learning from sparse rewards.
In Advances In Neural Information Processing Systems, NeurIPS, 2020.
.ps.gz | .pdf | weblink | abstract]
Reinforcement learning with sparse rewards is challenging because an agent can rarely obtain non-zero rewards and hence, gradient-based optimization of parameterized policies can be incremental and slow. Recent work demonstrated that using a memory buffer of previous successful trajectories can result in more effective policies. However, existing methods may overly exploit past successful experiences, which can encourage the agent to adopt sub-optimal and myopic behaviors. In this work, instead of focusing on good experiences with limited diversity, we propose to learn a trajectory-conditioned policy to follow and expand diverse past trajectories from a memory buffer. Our method allows the agent to reach diverse regions in the state space and improve upon the past trajectories to reach new states. We empirically show that our approach significantly outperforms count-based exploration methods (parametric approach) and self-imitation learning (parametric approach with non-parametric memory) on various complex tasks with local optima. In particular, without using expert demonstrations or resetting to arbitrary states, we achieve the state-of-the-art scores under five billion number of frames, on challenging Atari games such as Montezuma¿s Revenge and Pitfall.
[2] S. Hooker, N. Moorosi, G. Clark, S. Bengio, and E. Denton.
Characterising bias in compressed models.
ArXiv, 2010.03058, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
The popularity and widespread use of pruning and quantization is driven by the severe resource constraints of deploying deep neural networks to environments with strict latency, memory and energy requirements. These techniques achieve high levels of compression with negligible impact on top-line metrics (top-1 and top-5 accuracy). However, overall accuracy hides disproportionately high errors on a small subset of examples; we call this subset Compression Identified Exemplars (CIE). We further establish that for CIE examples, compression amplifies existing algorithmic bias. Pruning disproportionately impacts performance on underrepresented features, which often coincides with considerations of fairness. Given that CIE is a relatively small subset but a great contributor of error in the model, we propose its use as a human-in-the-loop auditing tool to surface a tractable subset of the dataset for further inspection or annotation by a domain expert. We provide qualitative and quantitative support that CIE surfaces the most challenging examples in the data distribution for human-in-the-loop auditing.
[3] Y. Jiang, P. Foret, S. Yak, D. M. Roy, H. Mobahi, G. K. Dziugaite, S. Bengio, S. Gunasekar, I. Guyon, and B. Neyshabur.
NeurIPS 2020 competition: Predicting generalization in deep learning.
ArXiv, 2012.07976, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, conventional statistical learning approaches have yet been able to provide a satisfactory explanation on why deep learning works. A recent line of works aims to address the problem by trying to predict the generalization performance through complexity measures. In this competition, we invite the community to propose complexity measures that can accurately predict generalization of models. A robust and general complexity measure would potentially lead to a better understanding of deep learning's underlying mechanism and behavior of deep models on unseen data, or shed light on better generalization bounds. All these outcomes will be important for making deep learning more robust and reliable.
[4] Y. Jiang, B. Neyshabur, H. Mobahi, D. Krishnan, and S. Bengio.
Fantastic generalization measures and where to find them.
In International Conference on Learning Representations, ICLR, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
[5] Y. Li, J. Amelot, X. Zhou, S. Bengio, and S. Si.
Auto completion of user interface layout design using transformer-based tree decoders.
ArXiv, 2001.05308, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
It has been of increasing interest in the field to develop automatic machineries to facilitate the design process. In this paper, we focus on assisting graphical user interface (UI) layout design, a crucial task in app development. Given a partial layout, which a designer has entered, our model learns to complete the layout by predicting the remaining UI elements with a correct position and dimension as well as the hierarchical structures. Such automation will significantly ease the effort of UI designers and developers. While we focus on interface layout prediction, our model can be generally applicable for other layout prediction problems that involve tree structures and 2-dimensional placements. Particularly, we design two versions of Transformer-based tree decoders: Pointer and Recursive Transformer, and experiment with these models on a public dataset. We also propose several metrics for measuring the accuracy of tree prediction and ground these metrics in the domain of user experience. These contribute a new task and methods to deep learning research.
[6] C. Luo, H. Mobahi, and S. Bengio.
Data augmentation via structured adversarial perturbations.
ArXiv, 2011.03010, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
Data augmentation is a major component of many machine learning methods with state-of-the-art performance. Common augmentation strategies work by drawing random samples from a space of transformations. Unfortunately, such sampling approaches are limited in expressivity, as they are unable to scale to rich transformations that depend on numerous parameters due to the curse of dimensionality. Adversarial examples can be considered as an alternative scheme for data augmentation. By being trained on the most difficult modifications of the inputs, the resulting models are then hopefully able to handle other, presumably easier, modifications as well. The advantage of adversarial augmentation is that it replaces sampling with the use of a single, calculated perturbation that maximally increases the loss. The downside, however, is that these raw adversarial perturbations appear rather unstructured; applying them often does not produce a natural transformation, contrary to a desirable data augmentation technique. To address this, we propose a method to generate adversarial examples that maintain some desired natural structure. We first construct a subspace that only contains perturbations with the desired structure. We then project the raw adversarial gradient onto this space to select a structured transformation that would maximally increase the loss when applied. We demonstrate this approach through two types of image transformations: photometric and geometric. Furthermore, we show that training on such structured adversarial images improves generalization.
[7] A. Raghu, M. Raghu, S. Bengio, and O. Vinyals.
Rapid learning or feature reuse? towards understanding the effectiveness of MAML.
In International Conference on Learning Representations, ICLR, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
An important research direction in machine learning has centered around developing meta-learning algorithms to tackle few-shot learning. An especially successful algorithm has been Model Agnostic Meta-Learning (MAML), a method that consists of two optimization loops, with the outer loop finding a meta-initialization, from which the inner loop can efficiently learn new tasks. Despite MAML's popularity, a fundamental open question remains -- is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor. This leads to the ANIL (Almost No Inner Loop) algorithm, a simplification of MAML where we remove the inner loop for all but the (task-specific) head of a MAML-trained network. ANIL matches MAML's performance on benchmark few-shot image classification and RL and offers computational improvements over MAML. We further study the precise contributions of the head and body of the network, showing that performance on the test tasks is entirely determined by the quality of the learned features, and we can remove even the head of the network (the NIL algorithm). We conclude with a discussion of the rapid learning vs feature reuse question for meta-learning algorithms more broadly.
[8] C. Zhang, S. Bengio, M. Hardt, M. C. Mozer, and Y. Singer.
Identity crisis: Memorization and generalization under extreme overparameterization.
In International Conference on Learning Representations, ICLR, 2020.
.ps.gz | .pdf | .djvu | weblink | abstract]
We study the interplay between memorization and generalization of overparametrized networks in the extreme case of a single training example. The learning task is to predict an output which is as similar as possible to the input. We examine both fully-connected and convolutional networks that are initialized randomly and then trained to minimize the reconstruction error. The trained networks take one of the two forms: the constant function ("memorization") and the identity function ("generalization"). We show that different architectures exhibit vastly different inductive bias towards memorization and generalization. An important consequence of our study is that even in extreme cases of overparameterization, deep learning can result in proper generalization.

2019

[1] S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, and M. Varma.
Extreme Classification (Dagstuhl Seminar 18291).
Dagstuhl Reports, 8(7):62--80, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
Extreme classification is a rapidly growing research area within machine learning focusing on multi-class and multi-label problems involving an extremely large number of labels (even more than a million). Many applications of extreme classification have been found in diverse areas ranging from language modeling to document tagging in NLP, face recognition to learning universal feature representations in computer vision, gene function prediction in bioinformatics, etc. Extreme classification has also opened up a new paradigm for key industrial applications such as ranking and recommendation by reformulating them as multi-label learning tasks where each item to be ranked or recommended is treated as a separate label. Such reformulations have led to significant gains over traditional collaborative filtering and content-based recommendation techniques. Consequently, extreme classifiers have been deployed in many real-world applications in industry. Extreme classification has raised many new research challenges beyond the pale of traditional machine learning including developing log-time and log-space algorithms, deriving theoretical bounds that scale logarithmically with the number of labels, learning from biased training data, developing performance metrics, etc. The seminar aimed at bringing together experts in machine learning, NLP, computer vision, web search and recommendation from academia and industry to make progress on these problems. We believe that this seminar has encouraged the inter-disciplinary collaborations in the area of extreme classification, started discussion on identification of thrust areas and important research problems, motivated to improve the algorithms upon the state-of-the-art, as well to work on the theoretical foundations of extreme classification.
[2] V. Birodkar, H. Mobahi, and S. Bengio.
Semantic redundancies in image-classification datasets: The 10% you don't need.
ArXiv, 1901.11409, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
Large datasets have been crucial to the success of deep learning models in the recent years, which keep performing better as they are trained with more labelled data. While there have been sustained efforts to make these models more data-efficient, the potential benefit of understanding the data itself, is largely untapped. Specifically, focusing on object recognition tasks, we wonder if for common benchmark datasets we can do better than random subsets of the data and find a subset that can generalize on par with the full dataset when trained on. To our knowledge, this is the first result that can find notable redundancies in CIFAR-10 and ImageNet datasets (at least 10%). Interestingly, we observe semantic correlations between required and redundant images. We hope that our findings can motivate further research into identifying additional redundancies and exploiting them for more efficient training or data-collection.
[3] V. Birodkar, H. Mobahi, D. Krishnan, and S. Bengio.
A closed-form learned pooling for deep classification networks.
ArXiv, 1906.03808, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
In modern computer vision tasks, convolutional neural networks (CNNs) are indispensable for image classification tasks due to their efficiency and effectiveness. Part of their superiority compared to other architectures, comes from the fact that a single, local filter is shared across the entire image. However, there are scenarios where we may need to treat spatial locations in non-uniform manner. We see this in nature when considering how humans have evolved foveation to process different areas in their field of vision with varying levels of detail. In this paper we propose a way to enable CNNs to learn different pooling weights for each pixel location. We do so by introducing an extended definition of a pooling operator. This operator can learn a strict super-set of what can be learned by average pooling or convolutions. It has the benefit of being shared across feature maps and can be encouraged to be local or diffuse depending on the data. We show that for fixed network weights, our pooling operator can be computed in closed-form by spectral decomposition of matrices associated with class separability. Through experiments, we show that this operator benefits generalization for ResNets and CNNs on the CIFAR-10, CIFAR-100 and SVHN datasets and improves robustness to geometric corruptions and perturbations on the CIFAR-10-C and CIFAR-10-P test sets.
[4] Z. Chen, Y. Li, S. Bengio, and S. Si.
You look twice: Gaternet for dynamic filter selection in CNNs.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
The concept of conditional computation for deep nets has been proposed previously to improve model performance by selectively using only parts of the model conditioned on the sample it is processing. In this paper, we investigate input-dependent dynamic filter selection in deep convolutional neural networks (CNNs). The problem is interesting because the idea of forcing different parts of the model to learn from different types of samples may help us acquire better filters in CNNs, improve the model generalization performance and potentially increase the interpretability of model behavior. We propose a novel yet simple framework called GaterNet, which involves a backbone and a gater network. The backbone network is a regular CNN that performs the major computation needed for making a prediction, while a global gater network is introduced to generate binary gates for selectively activating filters in the backbone network based on each input. Extensive experiments on CIFAR and ImageNet datasets show that our models consistently outperform the original models with a large margin. On CIFAR-10, our model also improves upon state-of-the-art results.
[5] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C.-J. Hsieh.
Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks.
In Conference on Knowledge Discovery and Data Mining, KDD, 2019.
Frontiers of Science Award from ICBS 2023 [ .ps.gz | .pdf | .djvu | weblink | abstract]
Graph convolutional network (GCN) has been successfully applied to many graph-based applications; however, training a large-scale GCN remains challenging. Current SGD-based algorithms suffer from either a high computational cost that exponentially grows with number of GCN layers, or a large space requirement for keeping the entire graph and the embedding of each node in memory. In this paper, we propose Cluster-GCN, a novel GCN algorithm that is suitable for SGD-based training by exploiting the graph clustering structure. Cluster-GCN works as the following: at each step, it samples a block of nodes that associate with a dense subgraph identified by a graph clustering algorithm, and restricts the neighborhood search within this subgraph. This simple but effective strategy leads to significantly improved memory and computational efficiency while being able to achieve comparable test accuracy with previous algorithms. To test the scalability of our algorithm, we create a new Amazon2M data with 2 million nodes and 61 million edges which is more than 5 times larger than the previous largest publicly available dataset (Reddit). For training a 3-layer GCN on this data, Cluster-GCN is faster than the previous state-of-the-art VR-GCN (1523 seconds vs 1961 seconds) and using much less memory (2.2GB vs 11.2GB). Furthermore, for training 4 layer GCN on this data, our algorithm can finish in around 36 minutes while all the existing GCN training algorithms fail to train due to the out-of-memory issue. Furthermore, Cluster-GCN allows us to train much deeper GCN without much time and memory overhead, which leads to improved prediction accuracy---using a 5-layer Cluster-GCN, we achieve state-of-the-art test F1 score 99.36 on the PPI dataset, while the previous best result was 98.71 by [16].
[6] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord.
Unsupervised speech representation learning using wavenet autoencoders.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27:2041--2053, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. The behavior of autoencoder models depends on the kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder (VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of learned representations in terms of speaker independence, the ability to predict phonetic content, and the ability to accurately reconstruct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a regularization scheme that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.
[7] D. Duckworth, A. Neelakantan, B. Goodrich, L. Kaiser, and S. Bengio.
Parallel scheduled sampling.
ArXiv, 1906.04331, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit (pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling aims to mitigate this discrepancy between train and test time by randomly replacing some discrete units in the history with the model's prediction. While teacher-forced training works well with ML accelerators as the computation can be parallelized across time, Scheduled Sampling involves undesirable sequential processing. In this paper, we introduce a simple technique to parallelize Scheduled Sampling across time. We find that in most cases our technique leads to better empirical performance on summarization and dialog generation tasks compared to teacher-forced training. Further, we discuss the effects of different hyper-parameters associated with Scheduled Sampling on the model performance.
[8] Y. Guo, J. Choi, M. Moczulski, S. Bengio, M. Norouzi, and H. Lee.
Efficient exploration with self-imitation learning via trajectory-conditioned policy.
ArXiv, 1907.10247, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a method for learning a trajectory-conditioned policy to imitate diverse demonstrations from the agent's own past experiences. We demonstrate that such self-imitation drives exploration in diverse directions and increases the chance of finding a globally optimal solution in reinforcement learning problems, especially when the reward is sparse and deceptive. Our method significantly outperforms existing self-imitation learning and count-based exploration methods on various sparse-reward reinforcement learning tasks with local optima. In particular, we report a state-of-the-art score of more than 25,000 points on Montezuma's Revenge without using expert demonstrations or resetting to arbitrary states.
[9] Y. Jiang, D. Krishnan, H. Mobahi, and S. Bengio.
Predicting the generalization gap in deep networks with margin distributions.
In International Conference on Learning Representations, ICLR, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.
[10] B. Kim, E. Reif, M. Wattenberg, and S. Bengio.
Do neural networks show gestalt phenomena? an exploration of the law of closure.
ArXiv, 1903.01069, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
One characteristic of human visual perception is the presence of “Gestalt phenomena,” that is, that the whole is something other than the sum of its parts. A natural question is whether image-recognition networks show similar effects. Our paper investigates one particular type of Gestalt phenomenon, the law of closure, in the context of a feedforward image classification neural network (NN). This is a robust effect in human perception, but experiments typically rely on measurements (e.g., reaction time) that are not available for artificial neural nets. We describe a protocol for identifying closure effect in NNs, and report on the results of experiments with simple visual stimuli. Our findings suggest that NNs trained with natural images do exhibit closure, in contrast to networks with randomized weights or networks that have been trained on visually random data. Furthermore, the closure effect reflects something beyond good feature extraction; it is correlated with the network's higher layer features and ability to generalize.
[11] Y. Li, L. Kaiser, S. Bengio, and S. Si.
Area attention.
In International Conference on Machine Learning, ICML, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.
[12] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio.
Transfusion: Understanding transfer learning with applications to medical imaging.
In Advances In Neural Information Processing Systems, NeurIPS, 2019.
.ps.gz | .pdf | .djvu | weblink | abstract]
With the increasingly varied applications of deep learning, transfer learning has emerged as a critically important technique. However, the central question of how much feature reuse in transfer is the source of benefit remains unanswered. In this paper, we present an in-depth analysis of the effects of transfer, focusing on medical imaging, which is a particularly intriguing setting. Here, transfer learning is extremely popular, but data differences between pretraining and finetuing are considerable, reiterating the question of what is transferred. With experiments on two large scale medical imaging datasets, and CIFAR-10, we find transfer has almost negligible effects on performance, but significantly helps convergence speed. However, in all of these settings, convergence without transfer can be sped up dramatically by using only mean and variance statistics of the pretrained weights. Visualizing the lower layer filters shows that models trained from random initialization do not learn Gabor filters on medical images. We use CCA (canonical correlation analysis) to study the learned representations of the different models, finding that pretrained models are surprisingly similar to random initialization at higher layers. This similarity is evidenced both through model learning dynamics and a transfusion experiment, which explores the convergence speed using a subset of pretrained weights.

2018

[1] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio.
On using backpropagation for speech texture generation and voice conversion.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.
[2] G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio.
Large margin deep networks for classification.
In Advances In Neural Information Processing Systems, NeurIPS, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
We present a formulation of deep learning that aims at producing a large margin classifier. The notion of margin, minimum distance to a decision boundary, has served as the foundation of several theoretically profound and empirically successful results for both classification and regression tasks. However, most large margin algorithms are applicable only to shallow models with a preset feature representation; and conventional margin methods for neural networks only enforce margin at the output layer. Such methods are therefore not well suited for deep networks. In this work, we propose a novel loss function to impose a margin on any chosen set of layers of a deep network (including input and hidden layers). Our formulation allows choosing any norm on the metric measuring the margin. We demonstrate that the decision boundary obtained by our loss has nice properties compared to standard classification loss functions. Specifically, we show improved empirical results on the MNIST, CIFAR-10 and ImageNet datasets on multiple tasks: generalization from small training sets, corrupted labels, and robustness against adversarial perturbations. The resulting loss is general and complementary to existing data augmentation (such as random/adversarial input transform) and regularization techniques (such as weight decay, dropout, and batch norm).
[3] S. Escalera, M. Weimer, M. Burtsev, V. Malykh, V. Logacheva, R. Lowe, I. V. Serban, Y. Bengio, A. Rudnicky, A. W. Black, S. Prabhumoye, ¿. Kidzi¿ski, S. P. Mohanty, C. F. Ong, J. L. Hicks, S. Levine, M. Salathé, S. Delp, I. Huerga, A. Grigorenko, L. Thorbergsson, A. D. Nemitz, J. Sandker, S. King, A. S. Ecker, L. A. Gatys, M. Bethge, J. Boyd-Graber, S. Feng, P. Rodriguez, M. Iyyer, H. He, H. Daumé III, S. McGregor, A. Banifatemi, A. Kurakin, I. Goodfellow, and S. Bengio.
Introduction to nips 2017 competition track.
In S. Escalera and M. Weimer, editors, The NIPS '17 Competition: Building Intelligent Systems. Springer, 2018.
weblink ]
[4] L. Kaiser and S. Bengio.
Discrete autoencoders for sequence models.
ArXiv, 1801.09797, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Recurrent models for sequences have been recently successful at many tasks, especially for language modeling and machine translation. Nevertheless, it remains challenging to extract good representations from these models. For instance, even though language has a clear hierarchical structure going from characters through words to sentences, it is not apparent in current language models. We propose to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space. In order to propagate gradients though this discrete representation we introduce an improved semantic hashing technique. We show that this technique performs well on a newly proposed quantitative efficiency measure. We also analyze latent codes produced by the model showing how they correspond to words and phrases. Finally, we present an application of the autoencoder-augmented model to generating diverse translations.
[5] L. Kaiser, A. Roy, A. Vaswani, N. Parmar, S. Bengio, J. Uszkoreit, and N. Shazeer.
Fast decoding in sequence models using discrete latent variables.
In International Conference on Machine Learning, ICML, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. However, they are difficult to parallelize and are thus slow at processing long sequences. RNNs lack parallelism both during training and decoding, while architectures like WaveNet and Transformer are much more parallelizable during training, yet still operate sequentially during decoding. Inspired by [arxiv:1711.00937], we present a method to extend sequence models using discrete latent variables that makes decoding much more parallelizable. We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. To this end, we introduce a novel method for constructing a sequence of discrete latent variables and compare it with previously introduced methods. Finally, we evaluate our model end-to-end on the task of neural machine translation, where it is an order of magnitude faster at decoding than comparable autoregressive models. While lower in BLEU than purely autoregressive models, our model achieves higher scores than previously proposed non-autogregressive translation models.
[6] A. Kurakin, I. Goodfellow, S. Bengio, Y. Dong, F. Liao, M. Liang, T. Pang, J. Zhu, X. Hu, C. Xie, J. Wang, Z. Zhang, Z. Ren, A. Yuille, S. Huang, Y. Zhao, Y. Zhao, Z. Han, J. Long, Y. Berdibekov, T. Akiba, S. Tokui, and M. Abe.
Adversarial attacks and defences competition.
ArXiv, 1804.00097, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
To accelerate research on adversarial examples and robustness of machine learning classifiers, Google Brain organized a NIPS 2017 competition that encouraged researchers to develop new methods to generate adversarial examples as well as to develop new ways to defend against them. In this chapter, we describe the structure and organization of the competition and the solutions developed by several of the top-placing teams.
[7] Y. Li, S. Bengio, and G. Bailly.
Predicting human performance in vertical menu selection using deep learning.
In ACM CHI Conference, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Predicting human performance in interaction tasks allows designers or developers to understand the expected performance of a target interface without actually testing it with real users. In this work, we present a deep neural net to model and predict human performance in performing a sequence of UI tasks. In particular, we focus on a dominant class of tasks, i.e., target selection from a vertical list or menu. We experimented with our deep neural net using a public dataset collected from a desktop laboratory environment and a dataset collected from hundreds of touchscreen smartphone users via crowdsourcing. Our model significantly outperformed previous methods on these datasets. Importantly, our method, as a deep model, can easily incorporate additional UI attributes such as visual appearance and content semantics without changing model architectures. By understanding about how a deep learning model learns from human behaviors, our approach can be seen as a vehicle to discover new patterns about human behaviors to advance analytical modeling.
[8] Y. Li, N. Du, and S. Bengio.
Time-dependent representation for neural event sequence prediction.
In Workshop Track of the International Conference on Learning Representations, ICLR, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Existing sequence prediction methods are mostly concerned with time-independent sequences, in which the actual time span between events is irrelevant and the distance between events is simply the difference between their order positions in the sequence. While this time-independent view of sequences is applicable for data such as natural languages, e.g., dealing with words in a sentence, it is inappropriate and inefficient for many real world events that are observed and collected at unequally spaced points of time as they naturally arise, e.g., when a person goes to a grocery store or makes a phone call. The time span between events can carry important information about the sequence dependence of human behaviors. In this work, we propose a set of methods for using time in sequence prediction. Because neural sequence models such as RNN are more amenable for handling token-like input, we propose two methods for time-dependent event representation, based on the intuition on how time is tokenized in everyday life and previous work on embedding contextualization. We also introduce two methods for using next event duration as regularization for training a sequence prediction model. We discuss these methods based on recurrent neural nets. We evaluate these methods as well as baseline models on five datasets that resemble a variety of sequence prediction tasks. The experiments revealed that the proposed methods offer accuracy gain over baseline models in a range of settings.
[9] L. Logeswaran, H. Lee, and S. Bengio.
Content preserving text generation with attribute controls.
In Advances In Neural Information Processing Systems, NeurIPS, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this work, we address the problem of modifying textual attributes of sentences. Given an input sentence and a set of attribute labels, we attempt to generate sentences that are compatible with the conditioning information. To ensure that the model generates content compatible sentences, we introduce a reconstruction loss which interpolates between auto-encoding and back-translation loss components. We propose an adversarial loss to enforce generated samples to be attribute compatible and realistic. Through quantitative, qualitative and human evaluations we demonstrate that our model is capable of generating fluent sentences that better reflect the conditioning information compared to prior methods. We further demonstrate that the model is capable of simultaneously controlling multiple attributes.
[10] A. S. Morcos, M. Raghu, and S. Bengio.
Insights on representational similarity in neural networks with canonical correlation.
In Advances In Neural Information Processing Systems, NeurIPS, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Comparing different neural network representations and determining how representations evolve over time remain challenging open questions in our understanding of the function of neural networks. Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA, a recently proposed method. We first improve the core method, showing how to differentiate between signal and noise, and then apply this technique to compare across a group of CNNs, demonstrating that networks which generalize converge to more similar representations than networks which memorize, that wider networks converge to more similar solutions than narrow networks, and that trained networks with identical topology but different learning rates converge to distinct clusters with diverse representations. We also investigate the representational dynamics of RNNs, across both training and sequential timesteps, finding that RNNs converge in a bottom-up pattern over the course of training and that the hidden state is highly variable over the course of a sequence, even when accounting for linear transforms. Together, these results provide new insights into the function of CNNs and RNNs, and demonstrate the utility of using CCA to understand representations.
[11] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uskoreit.
Tensor2tensor for neural machine transalation.
ArXiv, 1803.07416, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
[12] C. Zhang, O. Vinyals, R. Munos, and S. Bengio.
A study on overfitting in deep reinforcement learning.
ArXiv, 1804.06893, 2018.
.ps.gz | .pdf | .djvu | weblink | abstract]
Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen “robustly”: commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.

2017

[1] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio.
Neural combinatorial optimization with reinforcement learning.
In Workshop Track of the International Conference on Learning Representations, ICLR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
We present a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. Without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. These results, albeit still quite far from state-of-the-art, give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems.
[2] C. Chelba, M. Norouzi, and S. Bengio.
N-gram language modeling using recurrent neural network estimation.
ArXiv, 1703.10724, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
We investigate the effective memory depth of RNN models by using them for n-gram language model (LM) smoothing. Experiments on a small corpus (UPenn Treebank, one million words of training data and 10k vocabulary) have found the LSTM cell with dropout to be the best model for encoding the n-gram state when compared with feed-forward and vanilla RNN models. When preserving the sentence independence assumption the LSTM n-gram matches the LSTM LM performance for n=9 and slightly outperforms it for n=13. When allowing dependencies across sentence boundaries, the LSTM 13-gram almost matches the perplexity of the unlimited history LSTM LM. LSTM n-gram smoothing also has the desirable property of improving with increasing n-gram order, unlike the Katz or Kneser-Ney back-off estimators. Using multinomial distributions as targets in training instead of the usual one-hot target is only slightly beneficial for low n-gram orders. Experiments on the One Billion Words benchmark show that the results hold at larger scale. Building LSTM n-gram LMs may be appealing for some practical situations: the state in a n-gram LM can be succinctly represented with (n-1)*4 bytes storing the identity of the words in the context and batches of n-gram contexts can be processed in parallel. On the downside, the n-gram context encoding computed by the LSTM is discarded, making the model more expensive than a regular recurrent LSTM LM.
[3] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio.
Sharp minima can generalize for deep nets.
In International Conference on Machine Learning, ICML, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.
[4] L. Dinh, J. Sohl-Dickstein, and S. Bengio.
Density estimation using real NVP.
In International Conference on Learning Representations, ICLR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using real-valued non-volume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact log-likelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, log-likelihood evaluation and latent variable manipulations.
[5] L. Kaiser, O. Nachum, A. Roy, and S. Bengio.
Learning to remember rare events.
In International Conference on Learning Representations, ICLR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Despite recent advances, memory-augmented deep neural networks are still limited when it comes to life-long and one-shot learning, especially in remembering rare events. We present a large-scale life-long memory module for use in deep learning. The module exploits fast nearest-neighbor algorithms for efficiency and thus scales to large memory sizes. Except for the nearest-neighbor query, the module is fully differentiable and trained end-to-end with no extra supervision. It operates in a life-long manner, i.e., without the need to reset it during training. Our memory module can be easily added to any part of a supervised neural network. To show its versatility we add it to a number of networks, from simple convolutional ones tested on image classification to deep sequence-to-sequence and recurrent-convolutional models. In all cases, the enhanced network gains the ability to remember and do life-long one-shot learning. Our module remembers training examples shown many thousands of steps in the past and it can successfully generalize from them. We set new state-of-the-art for one-shot learning on the Omniglot dataset and demonstrate, for the first time, life-long one-shot learning in recurrent neural networks on a large-scale machine translation task.
[6] A. Kurakin, I. Goodfellow, and S. Bengio.
Adversarial examples in the physical world.
In Workshop Track of the International Conference on Learning Representations, ICLR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work have assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as an input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cell-phone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.
[7] A. Kurakin, I. Goodfellow, and S. Bengio.
Adversarial machine learning at scale.
In International Conference on Learning Representations, ICLR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model¿s parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet (Russakovsky et al., 2014). Our contributions include: (1) recommendations for how to succesfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than singlestep attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a ¿label leaking¿ effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.
[8] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean.
Device placement optimization with reinforcement learning.
In International Conference on Machine Learning, ICML, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-tosequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.
[9] R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik.
Context-aware captions from context-agnostic supervision.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
We introduce a technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", our system generates language that describes the "siamese cat" in a way that distinguishes it from "tiger cat". We start with a generic language model that is context-agnostic and add a listener to discriminate between closely-related concepts. Our approach offers two key advantages over previous work: 1) our listener does not need separate training, and 2) allows joint inference to decode sentences that satisfy both the speaker and listener -- yielding an introspective speaker. We first apply our introspective speaker to a justification task, i.e. to describe why an image contains a particular fine-grained category as opposed to another closely related category in the CUB-200-2011 dataset. We then study discriminative image captioning to generate language that uniquely refers to one out of two semantically similar images in the COCO dataset. Evaluations with discriminative ground truth for justification and human studies for discriminative image captioning reveal that our approach outperforms baseline generative and speaker-listener approaches for discrimination.
[10] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.
Show and tell: Lessons learned from the 2015 mscoco image captioning challenge.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 39(4):652--663, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. Finally, given the recent surge of interest in this task, a competition was organized in 2015 using the newly released COCO dataset. We describe and analyze the various improvements we applied to our own baseline and show the resulting performance in the competition, which we won ex-aequo with a team from Microsoft Research, and provide an open source implementation in TensorFlow.
[11] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R.J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R.A. Saurous.
Tacotron: A fully end-to-end text-to-speech synthesis model.
In Proceedings of Interspeech, 2017.
.ps.gz | .pdf | .djvu | weblink | abstract]
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.
[12] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
Understanding deep learning requires rethinking generalization.
In International Conference on Learning Representations, ICLR, 2017.
Best Paper Award [ .ps.gz | .pdf | .djvu | weblink | abstract]
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

2016

[1] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio.
Generating sentences from a continuous space.
In SIGNLL Conference on Computational Natural Language Learning, CONLL, 2016.
.ps.gz | .pdf | .djvu | abstract]
The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnn-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.
[2] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz.
Revisiting distributed synchronous SGD.
In Workshop Track of the International Conference on Learning Representations, ICLR, 2016.
.ps.gz | .pdf | .djvu | weblink ]
[3] M. Cissé, M. Al-Shedivat, and S. Bengio.
ADIOS: Architectures deep in output space.
In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016.
.ps.gz | .pdf | .djvu | abstract]
Multi-label classification is a generalization of binary classification where the task consists in predicting sets of labels. With the availability of ever larger datasets, the multi-label setting has become a natural one in many applications, and the interest in solving multi-label problems has grown significantly. As expected, deep learning approaches are now yielding state-of-the-art performance for this class of problems. Unfortunately, they usually do not take into account the often unknown but nevertheless rich relationships between labels. In this paper, we propose to make use of this underlying structure by learning to partition the labels into a Markov Blanket Chain and then applying a novel deep architecture that exploits the partition. Experiments on several popular and large multi-label datasets demonstrate that our approach not only yields significant improvements, but also helps to overcome trade-offs specific to the multi-label classification setting.
[4] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer.
End-to-end text-dependent speaker verification.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easy-to-maintain systems with a small footprint.
[5] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio.
An online sequence-to-sequence model using partial conditioning.
In Advances In Neural Information Processing Systems, NIPS, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.
[6] L. Kaiser and S. Bengio.
Can active memory replace attention?
In Advances In Neural Information Processing Systems, NIPS, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech recognition, generative models, and learning algorithmic tasks, but it had probably the largest impact on neural machine translation. Recently, similar improvements have been obtained using alternative mechanisms that do not focus on a single part of a memory but operate on all of it in parallel, in a uniform way. Such mechanism, which we call active memory, improved over attention in algorithmic tasks, image processing, and in generative modelling. So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.
[7] J. Lee, S. Kim, G. Lebanon, Y. Singer, and S. Bengio.
LLORMA: Local low-rank matrix approximation.
Journal of Machine Learning Research, JMLR, 17:1--24, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is low-rank. In this paper, we propose, analyze, and experiment with two procedures, one parallel and the other global, for constructing local matrix approximations. The two approaches approximate the observed matrix as a weighted sum of low-rank matrices. These matrices are limited to a local region of the observed matrix. We analyze the accuracy of the proposed local low-rank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.
[8] M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and D. Schuurmans.
Reward augmented maximum likelihood for neural structured prediction.
In Advances In Neural Information Processing Systems, NIPS, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the log-likelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional log-probability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood
[9] O. Vinyals, S. Bengio, and M. Kudlur.
Order matters: Sequence to sequence for sets.
In International Conference on Learning Representations, ICLR, 2016.
.ps.gz | .pdf | .djvu | weblink | abstract]
Sequences have become first class citizens in supervised learning thanks to the resurgence of recurrent neural networks. Many complex tasks that require mapping from or to a sequence of observations can now be formulated with the sequence-to-sequence (seq2seq) framework which employs the chain rule to efficiently represent the joint probability of sequences. In many cases, however, variable sized inputs and/or outputs might not be naturally expressed as sequences. For instance, it is not clear how to input a set of numbers into a model where the task is to sort them; similarly, we do not know how to organize outputs when they correspond to random variables and the task is to model their unknown joint probability. In this paper, we first show using various examples that the order in which we organize input and/or output data matters significantly when learning an underlying model. We then discuss an extension of the seq2seq framework that goes beyond sequences and handles input sets in a principled way. In addition, we propose a loss which, by searching over possible orders during training, deals with the lack of structure of output sets. We show empirical evidence of our claims regarding ordering, and on the modifications to the seq2seq framework on benchmark language modeling and parsing tasks, as well as two artificial tasks -- sorting numbers and estimating the joint probability of unknown graphical models.

2015

[1] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.
In Advances In Neural Information Processing Systems, NIPS, 2015.
.ps.gz | .pdf | .djvu | weblink | abstract]
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements. Moreover, it was used successfully in our winning entry to the MSCOCO image captioning challenge, 2015.
[2] V. Ramanathan, J. Deng, C. Li, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio, C. Rosenberg, and F.-F. Li.
Learning semantic relationships for better action retrieval in images.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
.ps.gz | .pdf | .djvu | abstract]
Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We would like our method to reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action recognition models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify theses relationships. We train and test our model on a new largescale image dataset of human actions under two settings with 27K and 2K actions. We show a significant improvement in mean AP compared to different baseline methods including the state-of-the-art HEX-graph approach from Deng et al.
[3] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.
Show and tell: A neural image caption generator.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

2014

[1] S. Bengio and G. Heigold.
Word embeddings for speech recognition.
In Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech, 2014.
.ps.gz | .pdf | .djvu | abstract]
Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.
[2] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam.
Large-scale object classification using label relation graphs.
In Proceedings of the European Conference on Computer Vision, ECCV, 2014.
Best Paper Award [ .ps.gz | .pdf | .djvu | abstract]
In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can significantly improve object classification by exploiting the label relations.
[3] M. R. Gupta, S. Bengio, and J. Weston.
Training highly multiclass classifiers.
Journal of Machine Learning Research, JMLR, 15:1461--1492, 2014.
.ps.gz | .pdf | .djvu | weblink | abstract]
Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the more-confusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to one-vs-all linear SVMs and Wsabie.
[4] I. Lapidot, J.-F. Bonastre, and S. Bengio.
Telephone conversation speaker diarization using mealy-HMMs.
In Proceedings of Speaker Odyssey, 2014.
weblink | abstract]
When Hidden Markov Models (HMMs) were first introduced, two competing representation models were proposed, the Moore model, with separate emission and transition distributions, which is commonly used in speech technologies, and the Mealy model, with a single emission-transition distribution. Since then the literature has mostly focused on the Moore model. In this paper, we would like to show the use of MealyHMMs for telephone conversation speaker diarization task. We present the Viterbi training and decoding for Mealy-HMMs and show that it yields similar performance compared to MooreHMMs with a fewer number of parameters.
[5] J. Lee, S. Bengio, S. Kim, G. Lebanon, and Y. Singer.
Local collaborative ranking.
In International World Wide Web Conference, WWW, 2014.
.ps.gz | .pdf | .djvu | abstract]
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to recommendation systems typically assume that the rating matrix (e.g., movie ratings by viewers) is low-rank. In this paper, we examine an alternative approach in which the rating matrix is locally low-rank. Concretely, we assume that the rating matrix is low-rank within certain neighborhoods of the metric space defined by (user, item) pairs. We combine a recent approach for local low-rank approximation based on the Frobenius norm with a general empirical risk minimization for ranking losses. Our experiments indicate that the combination of a mixture of local low-rank matrices each of which was trained to minimize a ranking loss outperforms many of the currently used state-of-the-art recommendation systems. Moreover, our method is easy to parallelize, making it a viable approach for large scale real-world rank-based recommendation systems.
[6] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean.
Zero-shot learning by convex combination of semantic embeddings.
In International Conference on Learning Representations, ICLR, 2014.
.ps.gz | .pdf | .djvu | abstract]
Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zero-shot learning task.

2013

[1] S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and Y. Singer.
Using web co-occurrence statistics for improving image categorization.
ArXiv, 1312.5697, 2013.
.ps.gz | .pdf | .djvu | weblink | abstract]
Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets.
[2] S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov.
Guest editors' introduction: Special section on learning deep architectures.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 35:1795--1797, 2013.
.ps.gz | .pdf | .djvu | weblink ]
[3] H. Elmlund, D. Elmlund, and S. Bengio.
PRIME: Probabilistic initial 3d model generation for single-particle cryo-electron microscopy.
Structure, 21:1299--1306, 2013.
.ps.gz | .pdf | .djvu | weblink | abstract]
Low-dose electron microscopy of cryo-preserved individual biomolecules (single-particle cryo-EM) is a powerful tool for obtaining information about the structure and dynamics of large macromolecular assemblies. Acquiring images with low dose reduces radiation damage, preserves atomic structural details, but results in low signal-to-noise ratio of the individual images. The projection directions of the two-dimensional images are random and unknown. The grand challenge is to achieve the precise three-dimensional (3D) alignment of many (tens of thousands to millions) noisy projection images, which may then be combined to obtain a faithful 3D map. An accurate initial 3D model is critical for obtaining the precise 3D alignment required for high-resolution (<10 Å) map reconstruction. We report a method (PRIME) that, in a single step and without prior structural knowledge, can generate an accurate initial 3D map directly from the noisy images.
[4] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov.
DeViSE: A deep visual-semantic embedding model.
In Advances In Neural Information Processing Systems, NIPS, 2013.
.ps.gz | .pdf | .djvu | abstract]
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources -- such as text data -- both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.
[5] M. Stevens, S. Bengio, and Y. Singer.
Efficient learning of sparse ranking functions.
In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical Inference. Springer, 2013.
weblink | abstract]
Algorithms for learning to rank can be inefficient when they employ risk functions that use structural information. We describe and analyze a learning algorithm that efficiently learns a ranking function using a domination loss. This loss is designed for problems in which we need to rank a small number of positive examples over a vast number of negative examples. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present an extension that incorporates regularization thus extending Vapnik¿s notion of regularized empirical risk minimization to ranking learning. We also discuss an extension to the case of multi-values feedback. Experiments performed on several benchmark datasets and large scale Google internal dataset demonstrate the effectiveness of learning algorithm in constructing compact models while retaining the empirical performance accuracy.

2012

[1] S. Bengio.
Large scale visual semantic extraction.
In Frontiers of Engineering - Reports on Leading-Edge Engineering from the 2011 Symposium, 2012.
weblink | abstract]
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

2011

[1] C. Dimitrakakis and S. Bengio.
Phoneme and sentence-level ensembles for speech recognition.
EURASIP Journal on Audio, Speech, and Music Processing, 2011, 2011.
.ps.gz | .pdf | .djvu | weblink | abstract]
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.
[2] J. Weston, S. Bengio, and P. Hamel.
Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval.
Journal of New Music Research, 40:337--348, 2011.
.ps.gz | .pdf | .djvu | weblink | abstract]
Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single low-dimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multi-task learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.
[3] J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011.
.ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

2010

[1] S. Bengio.
Statistical machine learning for HCI.
In J.-P. Thiran, F. Marqués, and H. Bourlard, editors, Multimodal Signal Processing: Theory and Applications for Human-Computer Interaction, pages 7--23. Academic Press, 2010.
weblink | abstract]
This chapter introduces the main concepts of statistical machine learning, as they are pivot in most algorithms tailored for multimodal signal processing. In particular, the chapter will cover a general introduction to machine learning and how it is used in classification, regression and density estimation. Following this introduction, two particularly well known models will be presented, together with their associated learning algorithm: support vector machines, which are well-known for classification tasks, and hidden Markov models, which are tailored for sequence processing tasks such as speech recognition.
[2] S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multi-class tasks.
In Advances in Neural Information Processing Systems, NIPS, 2010.
.ps.gz | .pdf | .djvu | abstract]
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.
[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research, JMLR, 11:1109--1135, 2010.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.
[4] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research, JMLR, 11:625--660, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.
[5] R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik.
Sound retrieval and ranking using sparse auditory representations.
Neural Computation, 22(9):2390--2416, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machine-vision method, the “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use adaptive pole--zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC frontend.
[6] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010.
Best Paper Award in Machine Learning [ .ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.
[7] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
Machine Learning Journal, 81(1):21--35, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

2009

[1] S. Bengio and J. Keshet.
Introduction.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 3--10. Wiley, 2009.
weblink | abstract]
One of the most natural communication tools used by humans is their voice. It is hence natural that a lot of research has been devoted to analyzing and understanding human uttered speech for various applications. The most obvious one is automatic speech recognition, where the goal is to transcribe a recorded speech utterance into its corresponding sequence of words. Other applications include speaker recognition, where the goal is to determine either the claimed identity of the speaker (verification) or who is speaking (identification), and speaker segmentation or diarization, where the goal is to segment an acoustic sequence in terms of the underlying speakers (such as during a dialog). Although an enormous amount of research has been devoted to speech processing, there appears to be some form of local optimum in terms of the fundamental tools used to approach these problems. The aim of this book is to introduce the speech researcher community to radically different approaches based on more recent kernel based machine learning methods. In this introduction, we first briefly review the predominant speech processing approach, based on hidden Markov models, as well as its known problems; we then introduce the most well known kernel based approach, the Support Vector Machine (SVM), and finally outline the various contributions of this book.
[2] S. Bengio, F. Pereira, Y. Singer, and D. Strelow.
Group sparse coding.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.
[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large-scale online learning of image similarity through ranking: Extended abstract.
In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address real-world large-scale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.
[4] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.
[5] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent.
The difficulty of training deep architectures and the effect of unsupervised pre-training.
In D. van Dyk and M. Wellings, editors, Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 5 of JMLR Workshop and Conference Procedings, pages 153--160, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Whereas theoretical work suggests that deep architectures might be more efficient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pre-training. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.
[6] D. Grangier, J. Keshet, and S. Bengio.
Discriminative keyword spotting.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 175--194. Wiley, 2009.
weblink | abstract]
This chapter introduces a discriminative method for detecting and spotting keywords in spoken utterances. Given a word represented as a sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time span of the phoneme sequence in the spoken utterance along with a confidence. If the prediction confidence is above certain level the keyword is declared to be spoken in the utterance within the predicted time span, otherwise the keyword is declared as not spoken. The problem of keyword spotting training is formulated as a discriminative task where the model parameters are chosen so the utterance in which the keyword is spoken would have higher confidence than any other spoken utterance in which the keyword is not spoken. It is shown theoretically and empirically that the proposed training method resulted with a high area under the Receiver Operating Characteristic (ROC) curve, the most common measure to evaluate keyword spotters. We present an iterative algorithm to train the keyword spotter efficiently. The proposed approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Several experiments performed on TIMIT and WSJ corpora show the advantage of our approach over HMM-based alternatives.
[7] J. Keshet and S. Bengio, editors.
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. Wiley, 2009.
weblink | abstract]
This is the first book dedicated to uniting research related to speech and speaker recognition based on the recent advances in large margin and kernel methods. The first part of the book presents theoretical and practical foundations of large margin and kernel methods, from Support Vector Machines to large margin methods for structured learning. The second part of the book is dedicated to acoustic modeling of continuous speech recognizers, where the grounds for practical large margin sequence learning are set. The third part introduces large margin methods for discriminative language modeling. The last part of the book is dedicated to the application of keyword spotting, speaker verification and spectral clustering. The book is an important reference to researchers and practitioners in the field of modern speech and speaker recognition. The purpose of the book is twofold: first, to set the theoretical foundation of large margin and kernel methods relevant to the speech recognition domain; second, to propose a practical guide on implementation of these methods to the speech recognition domain. The reader is presumed to have basic knowledge of large margin and kernel methods and of basic algorithms in speech and speaker recognition.
[8] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
Speech Communication, 51:317--329, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.
[9] J. Mariéthoz, S. Bengio, and Y. Grandvalet.
Kernel-based text-independent speaker verification.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 195--220. Wiley, 2009.
weblink | abstract]
The goal of a person authentication system is to certify/attest the claimed identity of a user. When this authentication is based on the voice of the user, without respect to what the user exactly said, the system is called a text-independent speaker verification system. Speaker verification systems are increasingly often used to secure personal information, particularly for mobile phone based applications. Furthermore, text-independent versions of speaker verification systems are most used for their simplicity, as they do not require complex speech recognition modules. The most common approach to this task is based on Gaussian Mixture Models (GMMs) (Reynolds et al. 2000), which do not take into account any temporal information. GMMs have been intensively used thanks to their good performance, especially with the use of the Maximum a posteriori (MAP) (Gauvain and Lee 1994) adaptation algorithm. This approach is based on the density estimation of an impostor data distribution, followed by its adaptation to a specific client data set. Note that the estimation of these densities is not the final goal of speaker verification systems, which is rather to discriminate the client and impostor classes; hence discriminative approaches might appear good candidates for this task as well. As a matter of fact, Support Vector Machine (SVM) based systems have been the subject of several recent publications in the speaker verification community, in which they obtain Ma performance similar to or even better than GMMs on several text-independent speaker verification tasks. In order to use SVMs or any other discriminant approaches for speaker verification, several modifications of the classical techniques need to be performed. The purpose of this chapter is to present an overview of discriminant approaches that have been used successfully for the task of text-independent speaker verification, to analyze their differences from and their similarities to each other and to classical generative approaches based on GMMs. An open-source version of the C++ source code used to performed all experiments described in this chapter can be found at http://speaker.abracadoudou.com.
[10] J.-F. Paiement, S. Bengio, and D. Eck.
Probabilistic models for melodic prediction.
Artificial Intelligence Journal, 173(14):1266--1274, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.
[11] J.-F. Paiement, Y. Grandvalet, and S. Bengio.
Predictive models for music.
Connection Science, 21(2 & 3):253--272, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce predictive models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, these models are able to generate realistic melodies given appropriate musical contexts.
[12] M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparse-code representations.
In ICML 2009 Workshop on Sparse Method for Music Audio, 2009.
.ps.gz | .pdf | .djvu | abstract]
The task of ranking sounds from text queries is a good test application for machine-hearing techniques, and particularly for comparison and evaluation of alternative sound representations in a large-scale setting. We have adapted a machine-vision system, “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns, using a ranking-based cost function, a linear mapping from a very large sparse feature space to a large query-term space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. In addition to two main auditory-image models, we also include and compare a family of more conventional Mel-Frequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. The two auditory models tested use the adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precision-at-top-k performance measures, the best results are about 72% top-1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

2008

[1] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Large-scale content-based audio retrieval from text queries.
In ACM International Conference on Multimedia Information Retrieval, MIR, 2008.
.ps.gz | .pdf | .djvu | abstract]
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.
[2] D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 30(8):1371--1384, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.
[3] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms.
In International Conference on Machine Learning, ICML, 2008.
.ps.gz | .pdf | .djvu | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.
[4] H. Paugam-Moisy, R. Martinez, and S. Bengio.
Delay learning and polychronization for reservoir computing.
Neurocomputing, 71(7--9):1143--1158, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
We propose a multi-timescale learning rule for spiking neuron networks, in the line of the recently emerging field of reservoir computing. The reservoir is a network model of spiking neurons, with random topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that affects the synaptic delays linking the network to the readout neurons, with classification as a goal task. The network processing and the resulting performance can be explained by the concept of polychronization, proposed by Izhikevich (2006, Neural Computation, 18:2), on physiological grounds. The model emphasizes that polychronization can be used as a tool for exploiting the computational power of synaptic delays and for monitoring the topology and activity of a spiking neuron network.

2007

[1] S. Bengio and J. Mariéthoz.
Biometric person authentication is a multiple classifier problem.
In M. Haindl, J. Kittler, and F. Roli, editors, 7th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 4472. Springer-Verlag, 2007.
.ps.gz | .pdf | .djvu | abstract]
Several papers have already shown the interest of using multiple classifiers in order to enhance the performance of biometric person authentication systems. In this paper, we would like to argue that the core task of Biometric Person Authentication is actually a multiple classifier problem as such: indeed, in order to reach state-of-the-art performance, we argue that all current systems , in one way or another, try to solve several tasks simultaneously and that without such joint training (or sharing), they would not succeed as well. We explain hereafter this perspective, and according to it, we propose some ways to take advantage of it, ranging from more parameter sharing to similarity learning.
[2] D. Grangier and S. Bengio.
Learning the inter-frame distance for discriminative template-based keyword detection.
In Proceedings of the 10th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2007.
.ps.gz | .pdf | .djvu | abstract]
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.
[3] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007.
.ps.gz | .pdf | .djvu | abstract]
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.
[4] J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to text-independent speaker verification systems.
Pattern Recognition, 40:2315--2324, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.
[5] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms.
In NIPS Workshop on Brain, Music and Cognition, 2007.
.ps.gz | .pdf | .djvu | abstract]
Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.
[6] H. Paugam-Moisy, R. Martinez, and S. Bengio.
A supervised learning approach based on STDP and polychronization in spiking neuron networks.
In European Symposium on Artificial Neural Networks, ESANN, 2007.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
We propose a network model of spiking neurons, without preimposed topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that has effect on the synaptic delays linking the network to the output neurons, with classification as a goal task. The network processing and the resulting performance are completely explainable by the concept of polychronization, proposed by Izhikevich [?]. The model emphasizes the computational capabilities of this concept.
[7] N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007.
.ps.gz | .pdf | .djvu | abstract]
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.
[8] N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint user-specific and sample bootstraps.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 29(3):492--498, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
Biometric authentication performance is often depicted by a decision error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.
[9] S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006.
volume 4299 of Lecture Notes in Computer Science. Springer-Verlag, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 1­4, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme) -- the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence -- and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was co-located with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT-06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT-06 evaluation continued evaluation tasks in the areas of speech-to-text, who-spoke-when, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction -- and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included human­human communication modeling, speech and visual processing, multimodal processing, fusion and fission, human­computer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick Murray-Smith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.
[10] S. Sonnenburg, M. L. Braun, C. Soon Ong, S Bengio, L. Bottou, G. Holmes, Y. LeCun, K.-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson.
The need for open source software in machine learning.
Journal of Machine Learning Research, JMLR, 8:2443--2466, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not utilized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.
[11] D. Zhang and S. Bengio.
Exploring contextual information in a layered framework for group action recognition.
In IEEE International Conference on Multimedia & Expo, ICME, 2007.
.ps.gz | .pdf | .djvu | abstract]
Contextual information is important for sequence modeling. Hidden Markov models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommodate overlapping features or long-term contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state alpha and gamma posteriors (as usually referred to in the HMM formalism). The third method is based on conditional random fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

2006

[1] F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images.
IEEE Transactions on Signal Processing, 54(1):361--373, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo-2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best trade-off in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.
[2] O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation.
In Tenth Conference on Computational Natural Language Learning, CONLL, 2006.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.
[3] D. Grangier and S. Bengio.
A neural network to retrieve images from text queries.
In Proceedings of the 16th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 4132. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.
[4] D. Grangier, F. Monay, and S. Bengio.
A discriminative approach for the retrieval of images from text queries.
In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science, volume LNCS 4212. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.
[5] D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model.
In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006.
.ps.gz | .pdf | .djvu | abstract]
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).
[6] J. Keshet, S. Shalev-Shwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernel-based phoneme sequence recognition.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
.ps.gz | .pdf | .djvu | abstract]
We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMM-based approaches, our method uses a discriminative kernel-based training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vector-space endowed with an inner-product that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMM-based approach.
[7] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.
[8] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.
[9] M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems.
In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186--195. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the stroke-based feature set outperforms the point-based feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.
[10] J. Mariéthoz and S. Bengio.
A max kernel for text-independent speaker verification systems.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to state-of-the-art GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.
[11] J.-F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization.
In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218--229. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.
[12] N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue -- the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.
[13] N. Poh and S. Bengio.
Database, protocol and tools for evaluating score-level fusion algorithms in biometric authentication.
Pattern Recognition, 39(2):223--233, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.
[14] N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess user-dependent variabilities.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue -- the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.
[15] A. Pozdnoukhov and S. Bengio.
Graph-based invariant manifolds for invariant pattern recognition with kernel methods.
In International Conference on Pattern Recognition, ICPR, 2006.
.ps.gz | .pdf | .djvu | abstract]
We present here an approach for applying the technique of modeling data transformation manifolds for invariant learning with kernel methods. The approach is based on building a kernel function on the graph modeling the invariant manifold. It provides a way for taking into account nearly arbitrary transformations of the input samples. The approach is verified experimentally on the task of optical character recognition, providing state-of-the-art performance on harder problem settings.
[16] A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects.
Pattern Recognition Letters, 27(10):1087--1097, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electro-encephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.
[17] A. Pozdnoukhov and S. Bengio.
Semi-supervised kernel methods for regression estimation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
The paper presents a semi-supervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed data-dependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two real-world datasets were used to demonstrate the performance of the proposed algorithm.
[18] S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005.
volume 3869 of Lecture Notes in Computer Science. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 11-13 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT-05) evaluation of speech technologies within the meeting domain. Building on the success of the RT-04 spring evaluation, the RT-05 evaluation continued the speech-to-text and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Human-human communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Human-human interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.
[19] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems.
Image and Vision Computing, 24(8):882--893, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.
[20] C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for non-frontal face verification.
Pattern Recognition, 39(2):288--302, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCA-derived features) and a local feature system (based on DCT-derived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.
[21] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs.
IEEE Transactions on Multimedia, 8(3):509--520, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.
[22] D. Zhang, D. Gatica-Perez, D. Roy, and S. Bengio.
Modeling interactions from email communication.
In IEEE International Conference on Multimedia & Expo, ICME, 2006.
.ps.gz | .pdf | .djvu | abstract]
Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topic-based interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

2005

[1] S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004.
volume 3361 of Lecture Notes in Computer Science. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 21-23, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multi-disciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multi-modal meeting processing. These areas included human-human communication modeling, speech and visual processing, multi-modal processing, fusion and fission, multi-modal dialog modeling, human-human interaction modeling, multi-modal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.
[2] S. Bengio and H. Bourlard.
Multi channel sequence processing.
In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 22--36. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper summarizes some of the current research challenges arising from multi-channel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multi-stream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multi-channel tasks, ranging from audio-visual speech recognition to automatic meeting analysis.
[3] S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve.
In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the so-called break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.
[4] C. Dimitrakakis and S. Bengio.
Boosting word error rates.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501--504, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the well-known Numbers 95 database that indicate the importance of this temporal probability distribution.
[5] C. Dimitrakakis and S. Bengio.
Gradient-based estimates of return distributions.
In PASCAL Workshop on Principled Methods of Trading Exploration and Exploitation, 2005.
.ps.gz | .pdf | .djvu | abstract]
We present a general method for maintaining estimates of the distribution of parameters in arbitrary models. This is then applied to the estimation of probability distributions over actions in value-based reinforcement learning. While this approach is similar to other techniques that maintain a confidence measure for action-values, it nevertheless offers an insight into current techniques and hints at potential avenues of further research.
[6] C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers.
Neurocomputing, 64:211--221, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Q-learning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.
[7] D. Gatica-Perez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections.
In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audio-visual sensors, discussing open issues in this domain.
[8] D. Gatica-Perez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interest-level in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489--492, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interest-level that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of high-interest from audio-visual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interest-level. On a 50-meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable ground-truth for automatic methods. For the automatic detection of high-interest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single- and multi-stream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interest-level is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multi-modal approach is beneficial.
[9] N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how?
In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123--138. Office for Official Publications of the European Communities, Luxembourg, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: K-Nearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.
[10] Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over state-of-the-art procedures.
[11] D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model.
In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.
.ps.gz | .pdf | .djvu | abstract]
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d inC according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b ->sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.
[12] D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks.
In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b ->sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.
[13] M. Keller and S. Bengio.
A neural network for text representation.
In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667--672. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.
[14] M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking non-parametric statistical tests.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.
[15] H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multi-stream posterior based speech recognition system.
In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior” definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the state-of-the-art Tandem systems.
[16] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems.
In Proceedings of the 9th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the “gamma” distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a state-of-the-art hybrid HMM/ANN system.
[17] J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification.
IEEE Signal Processing Letters, 12(7):532--535, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The purpose of this paper is to unify several of the state-of-the-art score normalization techniques applied to text-independent speaker verification systems. We propose a new framework for this purpose. The two well-known Z- and T-normalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these well-known score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.
[18] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 27(3):305--317, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMM-based approaches, where the observations are provided by a set of audio-visual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.
[19] J.-F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions.
In International Conference on Music Information Retrieval, ISMIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
[20] J.-F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space.
In International Conference on Machine Learning, ICML, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
[21] N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments?
In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various real-life applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a so-called a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply “true users”) taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.
[22] N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks.
In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 74--85. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometric-enabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several base-systems. How exactly do correlation and imbalance nature of base-system performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the class-dependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark score-level fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.
[23] N. Poh and S. Bengio.
F-ratio client-dependent normalisation for biometric authentication tasks.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721--724, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new client-dependent normalisation to improve biometric authentication systems. There exists many client-de-pendent score normalisation techniques applied to speaker authentication, such as Z-Norm, D-Norm and T-Norm. Such normalisation is intended to adjust the variation across different client models. We propose “F-ratio” normalisation, or F-Norm, applied to face and speaker authentication systems. This normalisation requires only that as few as two client-dependent accesses are available (the more the better). Different from previous normalisation techniques, F-Norm considers the client and impostor distributions simultaneously. We show that F-ratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that F-Norm actually “interpolates” between client-independent and client-dependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over Z-Norm, client-dependent threshold normalisation or no normalisation.
[24] N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks?
IEEE Transactions on Signal Processing, 53(11):4384--4396, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in real-life biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of base-classifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance Reduction-Equal Error Rate (VR-EER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of F-ratio, which itself is a function of 1) correlation, 2) variance of base-experts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of base-experts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be miss-leading. Experimental results on the BANCA multimodal database confirm our findings using VR-EER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated base-experts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (class-dependent) variance. Moreover, by linking the concept of bias-variance-covariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.
[25] N. Poh and S. Bengio.
Improving fusion with margin-derived confidence in biometric authentication tasks.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1059--1068. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, “after observing the score of a given system, what is the confidence (or risk) associated to that given access?”. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.
[26] N. Poh and S. Bengio.
A novel approach to combining client-dependent and confidence information in multimodal biometrics.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1120--1129. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The issues of fusion with client-dependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a second-level classifier, across different experts. Although the formulation of such two-step solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the second-level classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call “Prior Knowledge Incorporation” has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS score-level fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new state-of-the-art performance on this database.
[27] N. Poh and S. Bengio.
A score-level fusion benchmark database for biometric authentication.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474--483. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.
[28] V. Popovici, S. Bengio, and J.-P. Thiran.
Kernel matching pursuit for large datasets.
Pattern Recognition, 38(12):2385--2390, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.
[29] A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems.
In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for inter-class invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVM-based classifier.
[30] C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity trade-offs in face verification.
In IEEE International Conference on Information Technology and Applications, ICITA, pages 638--643, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D- and pseudo-2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo-2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best trade-off in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.
[31] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised adapted HMMs for unusual event detection.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised model-based approaches. We propose a semi-supervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in pre-defining unusual events. Experiments on audio, visual, and audio-visual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.
[32] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised meeting event recognition with adapted HMMs.
In IEEE International Conference on Multimedia Expo, ICME, pages 611--618, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
This paper investigates the use of unlabeled data to help labeled data for audio-visual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semi-supervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a well-estimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audio-visual events defined in meetings. Experiments and comparison with the fully-supervised baseline method show the validity of the proposed semi-supervised approach.
[33] D. Zhang, D. Gatica-Perez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | abstract]
We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model.

2004

[1] S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81--89, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.
[2] S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2-class classification tasks.
[3] S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).
[4] H. Bourlard, S. Bengio, M. Magimai Doss, Q. Zhu, B. Mesot, and N. Morgan.
Towards using hierarchical posteriors for flexible automatic speech recognition systems.
In Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Rich Transcription (RT'04) Workshop, 2004.
.ps.gz | .pdf | .djvu | abstract]
Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task.
[5] F. Cardinaux, C. Sanderson, and S. Bengio.
Face verification using adapted generative models.
In International Conference on Automatic Face and Gesture Recognition, FG, pages 825--830, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been shown previously that systems based on local features and relatively complex generative models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition (here we mean both identification and verification). Recently a simpler generative model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In this paper we first propose to increase the performance of the GMM approach (without sacrificing its simplicity) through the use of local features with embedded positional information; we show that the performance obtained is comparable to 1D HMMs. Secondly, we evaluate different training techniques for both GMM and HMM based systems. We show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available; we propose to tackle this problem through the use of Maximum a Posteriori (MAP) training, where the lack of data problem can be effectively circumvented; we show that models estimated with MAP are significantly more robust and are able to generalize to adverse conditions present in the BANCA database.
[6] S. Chiappa and S. Bengio.
HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
We compare the use of two Markovian models, HMMs and IOHMMs, to discriminate between three mental tasks for brain computer interface systems using an asynchronous protocol. We show that IOHMMs outperform HMMs but that, probably due to the lack of any prior information on the state dynamics, no practical advantage in the use of these models over their static counterparts is obtained.
[7] R. Collobert and S. Bengio.
A gentle hessian for efficient gradient descent.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 517--520, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n2) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.
[8] R. Collobert and S. Bengio.
Links between perceptrons, MLPs and SVMs.
In International Conference on Machine Learning, ICML, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.
[9] F. de Wet, K. Weber, L. Boves, B. Cranen, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for automatic speech recognition.
Journal of the Acoustical Society of America, JASA, 116(3):1781--1792, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates possibilities to find a low-dimensional, formant-related physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formant-like features and state-of-the-art, noise-robust features have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, i.e. robust formants and HMM2 features, and to compare these features to hand-labeled formants as well as to mel-frequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the hand-labeled formants.
[10] C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621--624, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.
[11] C. Dimitrakakis and S. Bengio.
Online policy adaptation for ensemble classifiers.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper, the idea of using an adaptive policy for training and combining the base classifiers is put forward. The effectiveness of this approach for online learning is demonstrated by experimental results on several UCI benchmark databases.
[12] M. Magimai Doss, S. Bengio, and H. Bourlard.
Joint decoding for phoneme-grapheme continuous speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 177--180, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Standard ASR systems typically use phoneme as the subword units. Preliminary studies have shown that the performance of the ASR system could be improved by using grapheme as additional subword units. In this paper, we investigate such a system where the word models are defined in terms of two different subword units, i.e., phoneme and grapheme. During training, models for both the subword units are trained, and then during recognition either both or just one subword unit is used. We have studied this system for a continuous speech recognition task in American English language. Our studies show that grapheme information used along with phoneme information improves the performance of ASR.
[13] M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation.
In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatic Text Processing tasks, documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.
[14] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Machine Learning for Multimodal Interaction: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 56--75. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.
[15] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux, S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyz, L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes, E. Vidal, L. Bai, L. Shen, Y. Wang, C. Yueh-Hsuan, L. Hsien-Chang, H. Yi-Ping, A. Heinrichs, M. Muller, A. Tewes, C. von der Malsburg, R. Wurtz, Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and H. Schneiderman.
Face authentication test on the BANCA database.
In International Conference on Pattern Recognition, ICPR, volume 4, pages 523--532, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a Face Authentication Test (FAT2004) held in conjunction with the 17th International Conference on Pattern Recognition. The contest was held on the publicly available BANCA database according to a defined protocol. The competition also had a sequestered part in which institutions had to submit their algorithms for independent testing. 13 different verification algorithms from 10 institutions submitted results. Also, a standard set of face recognition software packages from the Internet were used to provide a baseline performance measure.
[16] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz, L. Vandendorpe, J. Ng, H. Cheung, and B. Tang.
Face authentication competition on the BANCA database.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 8--15. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a face verification competition held in conjunction with the First International Conference on Biometric Authe ntication. The contest was held on the publically available BANCA database according to a defined protocol. Six different verification algorithms from 4 academic and commercial institutions submitted results. Also, a standard set of face recognition software from the internet was used to provide a baseline performance measure.
[17] N. Poh and S. Bengio.
Noise-robust multi-stream fusion for text-independent speaker authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-stream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noise-robust multi-stream text-independent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noise-robustness is mainly due to the combination mechanism. This two-step approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to sub-streams. An important finding is that a trade-off is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this trade-off, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.
[18] N. Poh and S. Bengio.
Towards predicting optimal subsets of base classifiers in biometric authentication tasks.
In S. Bengio and H. Bourlard, editors, Machine Learning for Multimodal Interactions: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 159--172. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extend in various real-life applications. However, combining too many systems (base-experts) will also increase both hardware and computation costs. One way to selecting a subset of optimal base-experts out of N is to carry out the experiments explicitly. There are 2N-1 possible combinations. In this paper, we propose an analytical solution to this task when weighted sum fusion mechanism is used. The proposed approach is at least valid in the domain of person authentication. It has a complexity that is additive between the number of examples and the number of possible combinations while the conventional approach, using brute-force experimenting, is multiplicative between these two terms. Hence, our approach will scale better with large fusion problems. Experiments on the BANCA multi-modal database verified our approach. While we will consider here fusion in the context of identity verification via biometrics, or simply biometric authentication, it can also have an important impact in meetings because this a priori information can assist in retrieving highlights in meeting analysis as in “who said what”. Furthermore, automatic meeting analysis also requires many systems working together and involves possibly many audio-visual media streams. Development in fusion of identity verification will provide insights into how fusion in meetings can be done. The ability to predict fusion performance is another important step towards understanding the fusion problem.
[19] N. Poh and S. Bengio.
Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks?
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893--896, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-band, multi-stream and multi-modal approaches have proven to be very successful both in experiments and in real-life applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, Multi-Layer-Perceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.
[20] N. Poh, C. Sanderson, and S. Bengio.
Spectral subband centroids as complementary features for speaker authentication.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 631--639. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, e.g., Mel-scale Filterbank Cepstrum Coefficients (MFCCs), Linear-scale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTA-PLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to formant frequencies but are limited to a given subband.Empirical experiments carried out on the NIST2001 database using SSCs, MFCCs, LFCCs and their combinations by concatenation suggest that SSCs are somewhat more robust compared to conventional MFCC and LFCC features as well as being partially complementary.
[21] A. Pozdnoukhov and S. Bengio.
Tangent vector kernels for invariant image classification with SVMs.
In International Conference on Pattern Recognition, ICPR, volume 3, pages 486--489, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents an application of the general sample-to-object approach to the problem of invariant image classification. The approach results in defining new SVM kernels based on tangent vectors that take into account prior information on known invariances. Real data of face images are used for experiments. The presented approach integrates virtual sample and tangent distance methods. We observe a significant increase in performance with respect to standard approaches. The experiments also illustrate (as expected) that prior knowledge becomes more important as the amount of training data decreases.
[22] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Estimating the quality of face localization for face verification.
In IEEE International Conference on Image Processing, ICIP, pages 581--584, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Face localization is the process of finding the exact position of a face in a given image. This can be useful in several applications such as face tracking or person authentication. The purpose of this paper is to show that the error made during the localization process may have different impacts depending on the final application. Hence in order to evaluate the performance of a face localization algorithm, we propose to embed the final application (here face verification) into the performance measuring process. Moreover, in this paper, we estimate this embedding using either a multilayer perceptron or a K nearest neighbor algorithm in order to speedup the evaluation process. We show on the BANCA database that our proposed measure best matches the final verification results when comparing several localization algorithms, on various performance measures currently used in face localization.
[23] C. Sanderson and S. Bengio.
Extrapolating single view face models for multi-view recognition.
In International Conference on Intelligente Sensors, Sensor Networks and Information Processings, ISSNIP, pages 581--586, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Performance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.
[24] C. Sanderson and S. Bengio.
Statistical transformations of frontal models for non-frontal face verification.
In IEEE International Conference on Image Processing, ICIP, pages 585--588, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In the framework of a face verification system using local features and a Gaussian Mixture Model based classifier, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each client's frontal face model with artificially synthesized models for non-frontal views. Furthermore, we propose the Maximum Likelihood Shift (MLS) synthesis technique and compare its performance against a Maximum Likelihood Linear Regression (MLLR) based technique (originally developed for adapting speech recognition systems) and the recently proposed "difference between two Universal Background Models" (UBMdiff) technique. All techniques rely on prior information and learn how a generic face model for the frontal view is related to generic models at non-frontal views. Experiments on the FERET database suggest that that the proposed MLS technique is more suitable than MLLR (due to a lower number of free parameters) and UBMdiff (due to lack of heuristics). The results further suggest that extending frontal models considerably reduces errors.
[25] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 26(6):709--720, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ~50% for single writer data and by ~25% for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.
[26] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Modeling individual and group actions in meetings: a two-layer hmm framework.
In IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition, CVPR, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that between has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.
[27] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Multimodal group action clustering in meetings.
In ACM Multimedia Workshop on Video Surveillance and Sensor Networks, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of clustering multimodal group actions in meetings using a two-layer HMM framework. Meetings are structured as sequences of group actions. Our approach aims at creating one cluster for each group action, where the number of group actions and the action boundaries are unknown a priori. In our framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and low-level audio-visual features. A number of options that explicitly model certain aspects of the data (e.g., asynchrony) were considered. The second layer models the group actions using unsupervised HMM learning. The two layers are linked by a set of probability-based features produced by the individual action layer as input to the group action layer. The methodology was assessed on a set of multimodal turn-taking group actions, using a public five-hour meeting corpus. The results show that the use of multiple modalities and the layered framework are advantageous, compared to various baseline methods.

2003

[1] E. Bailly-Baillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran.
The BANCA database and evaluation protocol.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625--638. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we describe the acquistion and content of a new large, realistic and challenging multi-modal database intended for training and testing multi-modal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.
[2] M. Barnard, J.-M. Odobez, and S. Bengio.
Multi-modal audio-visual event recognition for football analysis.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 469--478, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The recognition of events within multi-modal data is a challenging problem. In this paper we focus on the recognition of events by using both audio and video data. We investigate the use of data fusion techniques in order to recognise these sequences within the framework of Hidden Markov Models (HMM) used to model audio and video data sequences. Specifically we look at the recognition of play and break sequences in football and the segmentation of football games based on these two events. Recognising relatively simple semantic events such as this is an important step towards full automatic indexing of such video material. These experiments were done using approximately 3 hours of data from two games of the Euro96 competition. We propose that modelling the audio and video streams separately for each sequence and fusing the decisions from each stream should yield an accurate and robust method of segmenting multi-modal data.
[3] S. Bengio.
An asynchronous hidden markov model for audio-visual speech recognition.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, pages 1237--1244. MIT Press, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.
[4] S. Bengio.
Multimodal authentication using asynchronous HMMs.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770--777. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audio-visual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other state-of-the-art techniques.
[5] H. Bourlard, S. Bengio, and K. Weber.
Towards robust and adaptive speech recognition models.
In M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld, editors, Mathematical Foundations of Speech and Language Processing, Institute for Mathematics and its Applications (IMA) Series, Volume 138, pages 169--189. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.
[6] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
International Journal on Pattern Recognition and Artificial Intelligence, IJPRAI, 17(3):349--365, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.
[7] J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audio-visual person identity verification.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752--760. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.
[8] M. Magimai Doss, T. A. Stephenson, H. Bourlard, and S. Bengio.
Phoneme-grapheme based speech recognition system.
In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, pages 94--98, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
State-of-the-art ASR systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined in-terms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units.
[9] D. Gatica-Perez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard.
On automatic annotation of meeting databases.
In IEEE International Conference on Image Processing, ICIP, volume 3, pages 629--632, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multi-modal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audio-visual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a well-defined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.
[10] N. Gilardi and S. Bengio.
Comparison of four machine learning algorithms for spatial data analysis.
In G. Dubois, J. Malczewski, and M. DeCort, editors, Mapping radioactivity in the environment - Spatial Interpolation Comparison 97, pages 222--237. Office for Official Publications of the European Communities, Luxembourg, 2003.
.ps.gz | .pdf | .djvu | abstract]
This chapter proposes a clear methodology on how to use machine learning algorithms for spatial data analysis in order to avoid any bias and eventually obtain fair estimation of their performance on new data. Four different machine learning algorithms are presented, namely multilayer perceptrons (MLP), mixture of experts (ME), support vector regression (SVR) and a local version of the latter (local SVR). Evaluation criteria adapted to geostatistical problems are also presented in order to compare adequately different models on the same dataset. Finally, an experimental comparison is given on the SIC97 dataset as well as an analysis of the results.
[11] Q. Le and S. Bengio.
Client dependent GMM-SVM models for speaker verification.
In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443--451. Springer Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A cross-validation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.
[12] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D. Moore, P. Wellner, and H. Bourlard.
Modeling human interaction in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 4, pages 748--751, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMM-based approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.
[13] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Ambient Intelligence, Lecture Notes in Computer Science, volume LNCS 2875, pages 235--251, Eindhoven, 2003. Springer-Verlag.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.
[14] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar, F. Deravi, and N. Mavity.
Face verification competition on the XM2VTS database.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 964--974. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink ]
[15] N. Poh and S. Bengio.
Non-linear variance reduction techniques in biometric authentication.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple synthetic samples, different extractors (features) and biometric modalities. The scores are combined using the average operator, Multi-Layer Perceptron and Support Vector Machines. It is found empirically that VR via modalities is the best technique, followed by VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.
[16] N. Poh, S. Marcel, and S. Bengio.
Improving face authentication using virtual samples.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 3, pages 233--236, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present a simple yet effective way to improve a face verification system by generating multiple virtual samples from the unique image corresponding to an access request. These images are generated using simple geometric transformations. This method is often used during training to improve accuracy of a neural network model by making it robust against minor translation, scale and orientation change. The main contribution of this paper is to introduce such method during testing. By generating N images from one single image and propagating them to a trained network model, one obtains N scores. By merging these scores using a simple mean operator, we show that the variance of merged scores is decreased by a factor between 1 and N. An experiment is carried out on the XM2VTS database which achieves new state-of-the-art performances.
[17] C. Sanderson and S. Bengio.
Augmenting frontal face models for non-frontal verification.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this work we propose to address the problem of non-frontal face verification when only a frontal training image is available (e.g. a passport photograph) by augmenting a client's frontal face model with artificially synthesized models for non-frontal views. In the framework of a Gaussian Mixture Model (GMM) based classifier, two techniques are proposed for the synthesis: UBMdiff and LinReg. Both techniques rely on a priori information and learn how face models for the frontal view are related to face models at a non-frontal view. The synthesis and augmentation approach is evaluated by applying it to two face verification systems: Principal Component Analysis (PCA) based and DCTmod2 (Sanderson et al, 2003) based; the two systems are a representation of holistic and non-holistic approaches, respectively. Results from experiments on the FERET database suggest that in almost all cases, frontal model augmentation has beneficial effects for both systems; they also suggest that the LinReg technique (which is based on multivariate regression of classifier parameters) is more suited to the PCA based system and that the UBMdiff technique (which is based on differences between two general face models) is more suited to the DCTmod2 based system. The results also support the view that the standard DCTmod2/GMM system (trained on frontal faces) is less affected by out-of-plane rotations than the corresponding PCA/GMM system;moreover, the DCTmod2/GMM system using augmented models is, in almost all cases, more robust than the corresponding PCA/GMM system.
[18] C. Sanderson and S. Bengio.
Robust features for frontal authentication in difficult image conditions.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 495--504. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper we extend the recently proposed DCT-mod2 feature extraction technique (which utilizes polynomial coefficients derived from 2D DCT coefficients obtained from horizontally & vertically neighbouring blocks) via the use of various windows and diagonally neighbouring blocks. We also propose enhanced PCA, where traditional PCA feature extraction is combined with DCT-mod2. Results using test images corrupted by a linear and a non-linear illumination change, white Gaussian noise and compression artefacts, show that use of diagonally neighbouring blocks and windowing is detrimental to robustness against illumination changes while being useful for increasing robustness against white noise and compression artefacts. We also show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness against white noise and compression artefacts) while also being robust to illumination changes; moreover, enhanced PCA outperforms PCA with histogram equalisation pre-processing.
[19] C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux, and S. Marcel.
Speech & face based biometric authentication at IDIAP.
In International Conference on Multimedia and Expo, ICME, volume 3, pages 1--4, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers user-customised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.
[20] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of large vocabulary cursive handwritten text.
In International Conference on Document Analysis and Recognition, ICDAR, pages 1101--1105, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of cursive handwritten lines of text. The system is based on continuous density HMMs and Statistical Language Models. The system recognizes data produced by a single writer. No a-priori knowledge is used about the content of the text to be recognized. Changes in the experimental setup with respect to the recognition of single words are highlighted. The results show a recognition rate of ~85% with a lexicon containing 50'000 words. The experiments were performed over a publicly available database.
[21] K. Weber, S. Ikbal, S. Bengio, and H. Bourlard.
Robust speech recognition and feature extraction using HMM2.
Computer, Speech and Language, 17(2-3):195--211, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to correspond to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.

2002

[1] S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification.
Information Fusion, 3(4):267--276, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.
[2] H. Bourlard, T. Adali, S. Bengio, J. Larsen, and S. Douglas, editors.
Proceedings of the Twelfth IEEE Workshop on Neural Networks for Signal Processing (NNSP). IEEE Press, 2002.
[3] H. Bourlard and S. Bengio.
Hidden markov models and other finite state automata for sequence processing.
In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, Second Edition. The MIT Press, 2002.
.ps.gz | .pdf | .djvu | idiap-RR ]
[4] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
Neural Computation, 14(5):1105--1114, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.
[5] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633--640. MIT Press, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.
[6] R. Collobert, S. Bengio, and J. Mariéthoz.
Torch: a modular machine learning software library.
IDIAP RR, 02-46, 2002.
.ps.gz | .pdf | .djvu | abstract]
Many scientific communities have expressed a growing interest in machine learning algorithms recently, mainly due to the generally good results they provide, compared to traditional statistical or AI approaches. However, these machine learning algorithms are often complex to implement and to use properly and efficiently. We thus present in this paper a new machine learning software library in which most state-of-the-art algorithms have already been implemented and are available in a unified framework, in order for scientists to be able to use them, compare them, and even extend them. More interestingly, this library is freely available under a BSD license and can be retrieved on the web by everyone.
[7] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
In S. Lee and A. Verri, editors, International Workshop on Pattern Recognition with Support Vector Machines, SVM, Lecture Notes in Computer Science, volume LNCS 2388, pages 8--23. Springer-Verlag, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.
[8] N. Gilardi, S. Bengio, and M. Kanevski.
Conditional gaussian mixture models for environmental risk mapping.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 777--786, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper proposes the use of Gaussian Mixture Models to estimate conditional probability density functions in an environmental risk mapping context. A conditional Gaussian Mixture Model has been compared to the geostatistical method of Sequential Gaussian Simulations and shows good performances in reconstructing local PDF. The data sets used for this comparison are parts of the digital elevation model of Switzerland.
[9] S. Marcel and S. Bengio.
Improving face verification using skin color information.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 2, pages 11--15. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance.
[10] S. Marcel, C. Marcel, and S. Bengio.
A state-of-the-art neural network for robust face verification.
In COST275 Workshop on the advent of Biometrics on the Internet, 2002.
.ps.gz | .pdf | .djvu | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance and that the proposed model achieves robust state-of-the-art results.
[11] J. Mariéthoz and S. Bengio.
A comparative study of adaptation methods for speaker verification.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Real-life speaker verification systems are often implemented using client model adaptation methods, since the amount of data available for each client is often too low to consider plain Maximum Likelihood methods. While the Bayesian Maximum A Posteriori (MAP) adaptation method is commonly used in speaker verification, other methods have proven to be successful in related domains such as speech recognition. This paper proposes an experimental comparison between three well-known adaptation methods, namely MAP, Maximum Likelihood Linear Regression, and finally EigenVoices. All three methods are compared to the more classical Maximum Likelihood method, and results are given for a subset of the 1999 NIST Speaker Recognition Evaluation database.
[12] N. Poh, S. Bengio, and J. Korczak.
A multi-sample multi-source model for biometric authentication.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 375--384, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this study, two techniques that can improve the authentication process are examined: (i) multiple samples and (ii) multiple biometric sources. We propose the fusion of multiple samples obtained from multiple biometric sources at the score level. By using the average operator, both the theoretical and empirical results show that integrating as many samples and as many biometric sources as possible can improve the overall reliability of the system. This strategy is called multi-sample multi-source approach. This strategy was tested on a real-life database using neural networks trained in one-versus-all configuration.
[13] A. Vinciarelli and S. Bengio.
Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 3, pages 81--84. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents an Offline Cursive Word Recognition System dealing with single writer samples. The system was a continuous density hiddden Markov model trained using either the raw data, or data transformed using Principal Component Analysis or Independent Component Analysis. Both techniques significantly improved the recognition rate of the system. Preprocessing, normalization and feature extraction are described in detail as well as the training technique adopted. Several experiments were performed using a publicly available database. The accuracy obtained is the highest presented in the literature over the same data.
[14] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
In Proceedings of the 8th International Conference on Frontiers in Handwriting Recognition, pages 287--291, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.
[15] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
Pattern Recognition Letters, 23(8):905--916, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.
[16] K. Weber, S. Bengio, and H. Bourlard.
Increasing speech recognition noise robustness with HMM2.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 929--932, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noise-robust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multi-stream combination of (already noise-robust) cepstral features and formant-like features (extracted by HMM2) improves the noise robustness of a state-of-the-art automatic speech recognition system.
[17] K. Weber, F. de Wet, B. Cranen, L. Boves, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for ASR.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates possibilities to automatically find a low-dimensional, formant-related physical representation of the speech signal, which is suitable for automatic speech recognition (ASR). This aim is motivated by the fact that formants have been shown to be discriminant features for ASR. Combinations of automatically extracted formant-like features and `conventional', noise-robust, state-of-the-art features (such as MFCCs including spectral subtraction and cepstral mean subtraction) have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, and to compare these features to hand-labeled formant tracks as well as to standard MFCCs in terms of their performance on a vowel classification task

2001

[1] S. Bengio and J. Mariéthoz.
Learning the decision function for speaker verification.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 425--428, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks always yielded performance improvements and sometimes significantly.
[2] H. Bourlard, S. Bengio, and K. Weber.
New approaches towards robust and adaptive speech recognition.
In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, NIPS 13, pages 751--757. MIT Press, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.
[3] R. Collobert and S. Bengio.
SVMTorch: Support vector machines for large-scale regression problems.
Journal of Machine Learning Research, JMLR, 1:143--160, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.
[4] J.-L. DesGranges, P. Agin, and S. Bengio.
The use of predictive models of breeding bird assemblages for assessing and monitoring forest bird diversity.
In A. Franc, O. Laroussinie, and T. Karjalainen, editors, Criteria and Indicators for Sustainable Forest Management at the Forest Management Unit Level, volume 38, pages 181--200. European Forest Institute Proceedings, 2001.
[5] K. Weber, S. Bengio, and H. Bourlard.
HMM2- extraction of formant features and their use for robust ASR.
In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 2001.
.ps.gz | .pdf | .djvu | abstract]
As recently introduced, an HMM2 can be considered as a particular case of an HMM mixture in which the HMM emission probabilities (usually estimated through Gaussian mixtures or an artificial neural network) are modeled by state-dependent, feature-based HMM (referred to as frequency HMM). A general EM training algorithm for such a structure has already been developed. Although there are numerous motivations for using such a structure, and many possible ways to exploit it, this paper will mainly focus on one particular instantiation of HMM2 in which the frequency HMM will be used to extract formant structure information, which will then be used as additional acoustic features in a standard Automatic Speech Recognition (ASR) system. While the fact that this architecture is able to automatically extract meaningful formant information is interesting by itself, empirical results will also show the robustness of these features to noise, and their potential to enhance state-of-the-art noise-robust HMM-based ASR.
[6] K. Weber, S. Bengio, and H. Bourlard.
Speech recognition using advanced HMM2 features.
In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU, pages 65--68, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs. As shown previously, a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formant-like features for standard ASR system. HMM2 performs a nonlinear, state-dependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To further improve the HMM2 system, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4-dimensional feature vector (3 formant-like parameters and one time index) were obtained.

2000

[1] S. Bengio and Y. Bengio.
Taking on the curse of dimensionality in joint distributions using neural networks.
IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, 11(3):550--557, 2000.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.
[2] Y. Bengio and S. Bengio.
Modeling high-dimensional discrete data with multi-layer neural networks.
In S.A. Solla, T.K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, NIPS 12, pages 400--406. MIT Press, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.
[3] N. Gilardi and S. Bengio.
Local machine learning models for spatial data analysis.
Journal of Geographic Information and Decision Analysis, 4(1):11--28, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper, we compare different machine learning algorithms applied to non stationary spatial data analysis. We show that models taking into account local variability of the data are better than models which are trained globally on the whole dataset. Two global models (Support Vector Regression and Multilayer Perceptrons) and two local models (a local version of Support Vector Regression and Mixture of Experts) were compared over the Spatial Interpolation Comparison 97 (SIC97) dataset, and the results are presented and compared to previous results obtained on the same dataset.
[4] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris.
Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Current technology for automatic speech recognition (ASR) uses hidden Markov models (HMMs) that recognize spoken speech using the acoustic signal. However, no use is made of the causes of the acoustic signal: the articulators. We present here a dynamic Bayesian network (DBN) model that utilizes an additional variable for representing the state of the articulators. A particular strength of the system is that, while it uses measured articulatory data during its training, it does not need to know these values during recognition. As Bayesian networks are not used often in the speech community, we give an introduction to them. After describing how they can be used in ASR, we present a system to do isolated word recognition using articulatory information. Recognition results are given, showing that a system with both acoustics and inferred articulatory positions performs better than a system with only acoustics.
[5] K. Weber, S. Bengio, and H. Bourlard.
HMM2- a novel approach to HMM emission probability estimation.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we discuss and investigate a new method to estimate local emission probabilities in the framework of hidden Markov models (HMM). Each feature vector is considered to be a sequence and is supposed to be modeled by yet another HMM. Therefore, we call this approach `HMM2'. There is a variety of possible topologies of such HMM2 systems, e.g. incorporating trellis or ergodic HMM structures. Preliminary HMM2 speech recognition experiments on cepstral and spectral features yielded worse results than state-of-the-art systems. However, we believe that HMM2 systems have a lot of potential advantages and are therefore worth investigating further.

Before 2000

[1] S. Bengio.
Intégration des systèmes tutoriels traditionnels et des systèmes tutoriels intelligents.
Master's thesis, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, 1989.
[2] S. Bengio.
Optimisation d'une règle d'apprentissage pour réseaux de neurones artificiels.
PhD thesis, Département d'Informatique et Recherche Opérationnelle. Université de Montréal, 1993.
.ps.gz | .pdf | .djvu ]
[3] S. Bengio and Y. Bengio.
An EM algorithm for asynchronous input/output hidden markov models.
In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996.
.ps.gz | .pdf | .djvu | abstract]
In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an Expectation-Maximization algorithm for training such models.
[4] S. Bengio, Y. Bengio, and J. Cloutier.
Use of genetic programming for the search of a new learning rule for neural networks.
In Proceedings of the First Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, volume 1, pages 324--327, 1994.
.ps.gz | .pdf | .djvu | weblink ]
[5] S. Bengio, Y. Bengio, and J. Cloutier.
On the search for new learning rules for ANNs.
Neural Processing Letters, 2(4):26--30, 1995.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper, we present a framework where a learning rule can be optimized within a parametric learning rule space. We define what we callparametric learning rules and present a theoretical study of theirgeneralization properties when estimated from a set of learning tasks and tested over another set of tasks. We corroborate the results of this study with practical experiments.
[6] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Aspects théoriques de l'optimisation d'une règle d'apprentissage.
In Actes de la conférence Neuro-Nîmes 1992, Nîmes, France, 1992.
.ps.gz | .pdf | .djvu ]
[7] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule.
In Conference on Optimality in Biological and Artificial Networks, Dallas, USA, 1992.
.ps.gz | .pdf | .djvu ]
[8] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Generalization of a parametric learning rule.
In S. Gielen and B. Kappen, editors, Proceedings of the International Conference on Artificial Neural Networks, ICANN'93, pages 502--502, Amsterdam, Nederlands, 1993. Springer-Verlag.
.ps.gz | .pdf | .djvu ]
[9] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule.
In D. S. Levine and W. R. Elsberry, editors, Optimality in Biological and Artificial Networks?, pages 265--287. Lawrence Erlbaum Associates, 1997.
.ps.gz | .pdf | .djvu ]
[10] S. Bengio, Y. Bengio, J. Robert, and G. Bélanger.
Stochastic learning of strategic equilibria for auctions.
Neural Computation, 11(5):1199--1209, 1999.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. The proposed approach addresses the problems of tracking a moving target and balancing exploration (of action space) versus exploitation (of better modeled regions of action space). Neural networks are used to represent a stochastic decision model for each bidder. Experiments confirm the correctness and usefulness of the approach.
[11] S. Bengio, G. Brassard, Y. Desmedt, C. Goutier, and J.-J. Quisquater.
Secure implementation of identification systems.
Journal of Cryptology, 4(3):175--183, 1991.
weblink | abstract]
In this paper we demonstrate that widely known identification systems, such as the public-file-based Feige-Fiat-Shamir scheme, can be insecure if proper care is not taken with their implementation. We suggest possible solutions. On the other hand, identity-based versions of the Feige-Fiat-Shamir scheme are conceptually more complicated than necessary.
[12] S. Bengio, F. Clerot, A. Gravey, and D. Collobert.
Dynamical resource reservation schemes in an ATM network using neural network-based traffic prediction.
In D. D. Kouvatsos, editor, Proceedings of the IFIP Fifth International Workshop on Performance Modelling and Evaluation of ATM Networks. Kluwer B. V., 1997.
.ps.gz | .pdf | .djvu | weblink | abstract]
Using real traffic data, we show that neural network-based prediction techniques can be used to predict the queuing behaviour of highly bursty traffics typical of LAN interconnection in a way accurate enough so as to allow dynamical renegotiation of a DBR traffic contract at the edge of an ATM network. The performances of predictor-based in service renegotiation are evaluated in terms of renegotiation errors and reserved bandwidth for the the DBR traffic handling capability and are shown to be very encouraging for the use of connectionist prediction techniques for the management of bursty traffics in ATM networks.
[13] S. Bengio, F. Fessant, and D. Collobert.
A connectionist system for medium-term horizon time series prediction.
In International Workshop on Applications of Neural Networks to Telecommunications, IWANNT, Stockholm, Sweden, 1995.
.ps.gz | .pdf | .djvu ]
[14] S. Bengio, F. Fessant, and D. Collobert.
Use of modular architectures for time series prediction.
Neural Processing Letters, 3(2):101--106, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[15] S. Bengio, C. Frasson, and J. Gecsei.
Integrating traditional and intelligent computerized tutoring.
In Fourth International Symposium on Computer and Information Sciences, Cesme, Turkey, 1989.
[16] S. Bengio, C. Frasson, and J. Gecsei.
Utilisation de systèmes d'EAO dans des systèmes d'EIAO.
In Sixième symposium canadien sur la technologie pédagogique, Halifax, Canada, 1989.
[17] Y. Bengio and S. Bengio.
Training asynchronous input/output hidden markov models.
In AAAI Spring Symposium on Computational Issues in Learning Models of Dynamical Systems, 1996.
.ps.gz | .pdf | .djvu ]
[18] Y. Bengio, S. Bengio, and J. Cloutier.
Learning a synaptic learning rule.
In Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume 2, pages 969--974, Seattle, USA, 1991.
.ps.gz | .pdf | .djvu | weblink ]
[19] Y. Bengio, S. Bengio, J.-F. Isabelle, and Y. Singer.
Shared context probabilistic transducers.
In Advances in Neural Information Processing Systems, NIPS 10, 1998.
.ps.gz | .pdf | .djvu | weblink | abstract]
Recently, a model for supervised learning of probabilistic transducers represented by suffix trees was introduced. However, this algorithm tends to build very large trees, requiring very large amounts of computer memory. In this paper, we propose a new, more compact, transducer model in which one shares the parameters of distributions associated to contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm with comparative experiments on inducing a noun phrase recognizer.
[20] Y. Bengio, S. Bengio, Y. Pouliot, and P. Agin.
A neural network to detect homologies in proteins.
In Advances in Neural Information Processing Systems, NIPS 2, San Mateo, CA, USA, 1990. Morgan Kaufmann.
.ps.gz | .pdf | .djvu | weblink ]
[21] Y. Desmedt, C. Goutier, and S. Bengio.
Special uses and abuses of the Fiat-Shamir passport protocol.
In Advances in Cryptology, Crypto, Lecture Notes in Computer Science, volume LNCS 293, pages 21--39, Santa Barbara, USA, 1988. Springer Verlag.
.ps.gz | .pdf | .djvu | weblink | abstract]
If the physical description of a person would be unique and adequately used and tested, then the security of the fiat-shamir scheme is not based on zero-knowledge. otherwise some new frauds exist. the feige-fiat-shamir scheme always suffers from these frauds. using an extended notion of subliminal channels, several other undetectable abuses of the fiat-shamir protocol, which are not possible with ordinary passports, are discussed. this technique can be used by a terrorist sponsoring country to communicate 500 new words of secret information each time a tourist passport is verified. a non-trivial solution to avoid these subliminal channel problems is presented. the notion of relative zero-knowledge is introduced.
[22] F. Fessant, S. Bengio, and D. Collobert.
On the prediction of solar activity using different neural network models.
Annales Geophysicae, 14:20--26, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[23] J.-Y. Potvin and S. Bengio.
The vehicle routing problem with time windows - part II: Genetic search.
INFORMS Journal on Computing, 8(2):165--172, 1996.
.ps.gz | .pdf | .djvu | weblink ]