Samy Bengio - Publications

Some of the files below are copyrighted. They are provided for your convenience, yet you may download them only if you are entitled to do so by your arrangements with the various publishers.

By Year Referred Publications: 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000
Technical Reports: 2006 2005 2004 2003 2002 2001 2000 before 2000

By Topic Large Scale Learning Kernel Machines Ensembles Ranking Biometric Authentication Multimodal
Handwritten Recognition Time Series Geostatistics Speech Graphical Models Deep Learning

Referred Publications

2014

[1] S. Bengio and G. Heigold.
Word embeddings for speech recognition.
In Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech, 2014.
.ps.gz | .pdf | .djvu | abstract]
Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

[2] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam.
Large-scale object classi¿cation using label relation graphs.
In Proceedings of the European Conference on Computer Vision, ECCV, 2014.
.ps.gz | .pdf | .djvu | abstract]
In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can significantly improve object classification by exploiting the label relations.

[3] M. R. Gupta, S. Bengio, and J. Weston.
Training highly multiclass classifiers.
Journal of Machine Learning Research, JMLR, 15:1461-1492, 2014.
.ps.gz | .pdf | .djvu | weblink | abstract]
Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the more-confusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to one-vs-all linear SVMs and Wsabie.

[4] J. Lee, S. Bengio, S. Kim, G. Lebanon, and Y. Singer.
Local collaborative ranking.
In International World Wide Web Conference, WWW, 2014.
.ps.gz | .pdf | .djvu | abstract]
Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to recommendation systems typically assume that the rating matrix (e.g., movie ratings by viewers) is low-rank. In this paper, we examine an alternative approach in which the rating matrix is locally low-rank. Concretely, we assume that the rating matrix is low-rank within certain neighborhoods of the metric space defined by (user, item) pairs. We combine a recent approach for local low-rank approximation based on the Frobenius norm with a general empirical risk minimization for ranking losses. Our experiments indicate that the combination of a mixture of local low-rank matrices each of which was trained to minimize a ranking loss outperforms many of the currently used state-of-the-art recommendation systems. Moreover, our method is easy to parallelize, making it a viable approach for large scale real-world rank-based recommendation systems.

[5] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean.
Zero-shot learning by convex combination of semantic embeddings.
In International Conference on Representation Learning, ICLR, 2014.
.ps.gz | .pdf | .djvu | abstract]
Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional classification framing of image understanding, particularly in terms of the promise for zero-shot learning - the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zero-shot learning task.

2013

[1] S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov.
Guest editors' introduction: Special section on learning deep architectures.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35:1795-1797, 2013.
.ps.gz | .pdf | .djvu | weblink ]
[2] H. Elmlund, D. Elmlund, and S. Bengio.
PRIME: Probabilistic initial 3d model generation for single-particle cryo-electron microscopy.
Structure, 21:1299-1306, 2013.
.ps.gz | .pdf | .djvu | weblink | abstract]
Low-dose electron microscopy of cryo-preserved individual biomolecules (single-particle cryo-EM) is a powerful tool for obtaining information about the structure and dynamics of large macromolecular assemblies. Acquiring images with low dose reduces radiation damage, preserves atomic structural details, but results in low signal-to-noise ratio of the individual images. The projection directions of the two-dimensional images are random and unknown. The grand challenge is to achieve the precise three-dimensional (3D) alignment of many (tens of thousands to millions) noisy projection images, which may then be combined to obtain a faithful 3D map. An accurate initial 3D model is critical for obtaining the precise 3D alignment required for high-resolution (<10 Å) map reconstruction. We report a method (PRIME) that, in a single step and without prior structural knowledge, can generate an accurate initial 3D map directly from the noisy images.

[3] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov.
DeViSE: A deep visual-semantic embedding model.
In Advances In Neural Information Processing Systems, NIPS, 2013.
.ps.gz | .pdf | .djvu | abstract]
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources - such as text data - both to train visual models and to constrain their predictions. In this paper we present a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches state-of-the-art performance on the 1000-class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zero-shot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

[4] M. Stevens, S. Bengio, and Y. Singer.
Efficient learning of sparse ranking functions.
In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical Inference. Springer, 2013.
weblink | abstract]
Algorithms for learning to rank can be inefficient when they employ risk functions that use structural information. We describe and analyze a learning algorithm that efficiently learns a ranking function using a domination loss. This loss is designed for problems in which we need to rank a small number of positive examples over a vast number of negative examples. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present an extension that incorporates regularization thus extending Vapnik¿s notion of regularized empirical risk minimization to ranking learning. We also discuss an extension to the case of multi-values feedback. Experiments performed on several benchmark datasets and large scale Google internal dataset demonstrate the effectiveness of learning algorithm in constructing compact models while retaining the empirical performance accuracy.

2012

[1] S. Bengio.
Large scale visual semantic extraction.
In Frontiers of Engineering - Reports on Leading-Edge Engineering from the 2011 Symposium, 2012.
weblink | abstract]
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

2011

[1] C. Dimitrakakis and S. Bengio.
Phoneme and sentence-level ensembles for speech recognition.
EURASIP Journal on Audio, Speech, and Music Processing, 2011, 2011.
.ps.gz | .pdf | .djvu | weblink | abstract]
We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.

[2] J. Weston, S. Bengio, and P. Hamel.
Multi-tasking with joint semantic spaces for large-scale music annotation and retrieval.
Journal of New Music Research, 40:337-348, 2011.
.ps.gz | .pdf | .djvu | weblink | abstract]
Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single low-dimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multi-task learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

[3] J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011.
.ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

2010

[1] S. Bengio.
Statistical machine learning for HCI.
In J.-P. Thiran, F. Marqués, and H. Bourlard, editors, Multimodal Signal Processing: Theory and Applications for Human-Computer Interaction, pages 7-23. Academic Press, 2010.
weblink | abstract]
This chapter introduces the main concepts of statistical machine learning, as they are pivot in most algorithms tailored for multimodal signal processing. In particular, the chapter will cover a general introduction to machine learning and how it is used in classification, regression and density estimation. Following this introduction, two particularly well known models will be presented, together with their associated learning algorithm: support vector machines, which are well-known for classification tasks, and hidden Markov models, which are tailored for sequence processing tasks such as speech recognition.

[2] S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multi-class tasks.
In Advances in Neural Information Processing Systems, NIPS, 2010.
.ps.gz | .pdf | .djvu | abstract]
Multi-class classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a tree-structure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than non-embedding approaches and has superior accuracy to Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including One-vs-Rest while being orders of magnitude faster.

[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research, JMLR, 11:1109-1135, 2010.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[4] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research, JMLR, 11:625-660, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

[5] R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik.
Sound retrieval and ranking using sparse auditory representations.
Neural Computation, 22(9):2390-2416, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machine-vision method, the “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC frontend.

[6] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010.
Best Paper Award in Machine Learning [ .ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[7] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
Machine Learning Journal, 81(1):21-35, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

2009

[1] S. Bengio and J. Keshet.
Introduction.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 3-10. Wiley, 2009.
weblink | abstract]
One of the most natural communication tools used by humans is their voice. It is hence natural that a lot of research has been devoted to analyzing and understanding human uttered speech for various applications. The most obvious one is automatic speech recognition, where the goal is to transcribe a recorded speech utterance into its corresponding sequence of words. Other applications include speaker recognition, where the goal is to determine either the claimed identity of the speaker (verification) or who is speaking (identification), and speaker segmentation or diarization, where the goal is to segment an acoustic sequence in terms of the underlying speakers (such as during a dialog). Although an enormous amount of research has been devoted to speech processing, there appears to be some form of local optimum in terms of the fundamental tools used to approach these problems. The aim of this book is to introduce the speech researcher community to radically different approaches based on more recent kernel based machine learning methods. In this introduction, we first briefly review the predominant speech processing approach, based on hidden Markov models, as well as its known problems; we then introduce the most well known kernel based approach, the Support Vector Machine (SVM), and finally outline the various contributions of this book.

[2] S. Bengio, F. Pereira, Y. Singer, and D. Strelow.
Group sparse coding.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.

[3] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large-scale online learning of image similarity through ranking: Extended abstract.
In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address real-world large-scale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.

[4] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[5] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent.
The difficulty of training deep architectures and the effect of unsupervised pre-training.
In D. van Dyk and M. Wellings, editors, Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 5 of JMLR Workshop and Conference Procedings, pages 153-160, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Whereas theoretical work suggests that deep architectures might be more efficient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pre-training. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.

[6] D. Grangier, J. Keshet, and S. Bengio.
Discriminative keyword spotting.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 175-194. Wiley, 2009.
weblink | abstract]
This chapter introduces a discriminative method for detecting and spotting keywords in spoken utterances. Given a word represented as a sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time span of the phoneme sequence in the spoken utterance along with a confidence. If the prediction confidence is above certain level the keyword is declared to be spoken in the utterance within the predicted time span, otherwise the keyword is declared as not spoken. The problem of keyword spotting training is formulated as a discriminative task where the model parameters are chosen so the utterance in which the keyword is spoken would have higher confidence than any other spoken utterance in which the keyword is not spoken. It is shown theoretically and empirically that the proposed training method resulted with a high area under the Receiver Operating Characteristic (ROC) curve, the most common measure to evaluate keyword spotters. We present an iterative algorithm to train the keyword spotter efficiently. The proposed approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Several experiments performed on TIMIT and WSJ corpora show the advantage of our approach over HMM-based alternatives.

[7] J. Keshet and S. Bengio, editors.
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. Wiley, 2009.
weblink | abstract]
This is the first book dedicated to uniting research related to speech and speaker recognition based on the recent advances in large margin and kernel methods. The first part of the book presents theoretical and practical foundations of large margin and kernel methods, from Support Vector Machines to large margin methods for structured learning. The second part of the book is dedicated to acoustic modeling of continuous speech recognizers, where the grounds for practical large margin sequence learning are set. The third part introduces large margin methods for discriminative language modeling. The last part of the book is dedicated to the application of keyword spotting, speaker verification and spectral clustering. The book is an important reference to researchers and practitioners in the field of modern speech and speaker recognition. The purpose of the book is twofold: first, to set the theoretical foundation of large margin and kernel methods relevant to the speech recognition domain; second, to propose a practical guide on implementation of these methods to the speech recognition domain. The reader is presumed to have basic knowledge of large margin and kernel methods and of basic algorithms in speech and speaker recognition.

[8] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
Speech Communication, 51:317-329, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.

[9] J. Mariéthoz, S. Bengio, and Y. Grandvalet.
Kernel-based text-independent speaker verification.
In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 195-220. Wiley, 2009.
weblink | abstract]
The goal of a person authentication system is to certify/attest the claimed identity of a user. When this authentication is based on the voice of the user, without respect to what the user exactly said, the system is called a text-independent speaker verification system. Speaker verification systems are increasingly often used to secure personal information, particularly for mobile phone based applications. Furthermore, text-independent versions of speaker verification systems are most used for their simplicity, as they do not require complex speech recognition modules. The most common approach to this task is based on Gaussian Mixture Models (GMMs) (Reynolds et al. 2000), which do not take into account any temporal information. GMMs have been intensively used thanks to their good performance, especially with the use of the Maximum a posteriori (MAP) (Gauvain and Lee 1994) adaptation algorithm. This approach is based on the density estimation of an impostor data distribution, followed by its adaptation to a specific client data set. Note that the estimation of these densities is not the final goal of speaker verification systems, which is rather to discriminate the client and impostor classes; hence discriminative approaches might appear good candidates for this task as well. As a matter of fact, Support Vector Machine (SVM) based systems have been the subject of several recent publications in the speaker verification community, in which they obtain Ma performance similar to or even better than GMMs on several text-independent speaker verification tasks. In order to use SVMs or any other discriminant approaches for speaker verification, several modifications of the classical techniques need to be performed. The purpose of this chapter is to present an overview of discriminant approaches that have been used successfully for the task of text-independent speaker verification, to analyze their differences from and their similarities to each other and to classical generative approaches based on GMMs. An open-source version of the C++ source code used to performed all experiments described in this chapter can be found at http://speaker.abracadoudou.com.

[10] J.-F. Paiement, S. Bengio, and D. Eck.
Probabilistic models for melodic prediction.
Artificial Intelligence Journal, 173(14):1266-1274, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.

[11] J.-F. Paiement, Y. Grandvalet, and S. Bengio.
Predictive models for music.
Connection Science, 21(2 & 3):253-272, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce predictive models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, these models are able to generate realistic melodies given appropriate musical contexts.

[12] M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparse-code representations.
In ICML 2009 Workshop on Sparse Method for Music Audio, 2009.
.ps.gz | .pdf | .djvu | abstract]
The task of ranking sounds from text queries is a good test application for machine-hearing techniques, and particularly for comparison and evaluation of alternative sound representations in a large-scale setting. We have adapted a machine-vision system, “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns, using a ranking-based cost function, a linear mapping from a very large sparse feature space to a large query-term space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. In addition to two main auditory-image models, we also include and compare a family of more conventional Mel-Frequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. The two auditory models tested use the adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precision-at-top-k performance measures, the best results are about 72% top-1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

2008

[1] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Large-scale content-based audio retrieval from text queries.
In ACM International Conference on Multimedia Information Retrieval, MIR, 2008.
.ps.gz | .pdf | .djvu | abstract]
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[2] D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):1371-1384, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

[3] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms.
In International Conference on Machine Learning, ICML, 2008.
.ps.gz | .pdf | .djvu | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[4] H. Paugam-Moisy, R. Martinez, and S. Bengio.
Delay learning and polychronization for reservoir computing.
Neurocomputing, 71(7-9):1143-1158, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
We propose a multi-timescale learning rule for spiking neuron networks, in the line of the recently emerging field of reservoir computing. The reservoir is a network model of spiking neurons, with random topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that affects the synaptic delays linking the network to the readout neurons, with classification as a goal task. The network processing and the resulting performance can be explained by the concept of polychronization, proposed by Izhikevich (2006, Neural Computation, 18:2), on physiological grounds. The model emphasizes that polychronization can be used as a tool for exploiting the computational power of synaptic delays and for monitoring the topology and activity of a spiking neuron network.

2007

[1] S. Bengio and J. Mariéthoz.
Biometric person authentication is a multiple classifier problem.
In M. Haindl, J. Kittler, and F. Roli, editors, 7th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 4472. Springer-Verlag, 2007.
.ps.gz | .pdf | .djvu | abstract]
Several papers have already shown the interest of using multiple classifiers in order to enhance the performance of biometric person authentication systems. In this paper, we would like to argue that the core task of Biometric Person Authentication is actually a multiple classifier problem as such: indeed, in order to reach state-of-the-art performance, we argue that all current systems , in one way or another, try to solve several tasks simultaneously and that without such joint training (or sharing), they would not succeed as well. We explain hereafter this perspective, and according to it, we propose some ways to take advantage of it, ranging from more parameter sharing to similarity learning.

[2] D. Grangier and S. Bengio.
Learning the inter-frame distance for discriminative template-based keyword detection.
In Proceedings of the 10th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2007.
.ps.gz | .pdf | .djvu | abstract]
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

[3] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007.
.ps.gz | .pdf | .djvu | abstract]
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.

[4] J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to text-independent speaker verification systems.
Pattern Recognition, 40:2315-2324, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[5] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms.
In NIPS Workshop on Brain, Music and Cognition, 2007.
.ps.gz | .pdf | .djvu | abstract]
Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[6] H. Paugam-Moisy, R. Martinez, and S. Bengio.
A supervised learning approach based on STDP and polychronization in spiking neuron networks.
In European Symposium on Artificial Neural Networks, ESANN, 2007.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
We propose a network model of spiking neurons, without preimposed topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that has effect on the synaptic delays linking the network to the output neurons, with classification as a goal task. The network processing and the resulting performance are completely explainable by the concept of polychronization, proposed by Izhikevich [?]. The model emphasizes the computational capabilities of this concept.

[7] N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007.
.ps.gz | .pdf | .djvu | abstract]
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.

[8] N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint user-specific and sample bootstraps.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(3):492-498, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
Biometric authentication performance is often depicted by a decision error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.

[9] S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006.
volume 4299 of Lecture Notes in Computer Science. Springer-Verlag, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 1­4, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme) - the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence - and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was co-located with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT-06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT-06 evaluation continued evaluation tasks in the areas of speech-to-text, who-spoke-when, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included human­human communication modeling, speech and visual processing, multimodal processing, fusion and fission, human­computer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick Murray-Smith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.

[10] S. Sonnenburg, M. L. Braun, C. Soon Ong, S Bengio, L. Bottou, G. Holmes, Y. LeCun, K.-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson.
The need for open source software in machine learning.
Journal of Machine Learning Research, JMLR, 8:2443-2466, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not utilized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

[11] D. Zhang and S. Bengio.
Exploring contextual information in a layered framework for group action recognition.
In IEEE International Conference on Multimedia & Expo, ICME, 2007.
.ps.gz | .pdf | .djvu | abstract]
Contextual information is important for sequence modeling. Hidden Markov models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommodate overlapping features or long-term contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state alpha and gamma posteriors (as usually referred to in the HMM formalism). The third method is based on conditional random fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

2006

[1] F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images.
IEEE Transactions on Signal Processing, 54(1):361-373, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo-2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best trade-off in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.

[2] O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation.
In Tenth Conference on Computational Natural Language Learning, CONLL, 2006.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.

[3] D. Grangier and S. Bengio.
A neural network to retrieve images from text queries.
In Proceedings of the 16th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 4132. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.

[4] D. Grangier, F. Monay, and S. Bengio.
A discriminative approach for the retrieval of images from text queries.
In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science, volume LNCS 4212. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.

[5] D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model.
In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006.
.ps.gz | .pdf | .djvu | abstract]
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).

[6] J. Keshet, S. Shalev-Shwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernel-based phoneme sequence recognition.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
.ps.gz | .pdf | .djvu | abstract]
We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMM-based approaches, our method uses a discriminative kernel-based training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vector-space endowed with an inner-product that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMM-based approach.

[7] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.

[8] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.

[9] M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems.
In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186-195. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the stroke-based feature set outperforms the point-based feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.

[10] J. Mariéthoz and S. Bengio.
A max kernel for text-independent speaker verification systems.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to state-of-the-art GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.

[11] J.-F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization.
In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218-229. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.

[12] N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[13] N. Poh and S. Bengio.
Database, protocol and tools for evaluating score-level fusion algorithms in biometric authentication.
Pattern Recognition, 39(2):223-233, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.

[14] N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess user-dependent variabilities.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[15] A. Pozdnoukhov and S. Bengio.
Graph-based invariant manifolds for invariant pattern recognition with kernel methods.
In International Conference on Pattern Recognition, ICPR, 2006.
.ps.gz | .pdf | .djvu | abstract]
We present here an approach for applying the technique of modeling data transformation manifolds for invariant learning with kernel methods. The approach is based on building a kernel function on the graph modeling the invariant manifold. It provides a way for taking into account nearly arbitrary transformations of the input samples. The approach is verified experimentally on the task of optical character recognition, providing state-of-the-art performance on harder problem settings.

[16] A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects.
Pattern Recognition Letters, 27(10):1087-1097, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electro-encephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.

[17] A. Pozdnoukhov and S. Bengio.
Semi-supervised kernel methods for regression estimation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
The paper presents a semi-supervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed data-dependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two real-world datasets were used to demonstrate the performance of the proposed algorithm.

[18] S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005.
volume 3869 of Lecture Notes in Computer Science. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 11-13 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT-05) evaluation of speech technologies within the meeting domain. Building on the success of the RT-04 spring evaluation, the RT-05 evaluation continued the speech-to-text and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Human-human communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Human-human interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.

[19] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems.
Image and Vision Computing, 24(8):882-893, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.

[20] C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for non-frontal face verification.
Pattern Recognition, 39(2):288-302, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCA-derived features) and a local feature system (based on DCT-derived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.

[21] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs.
IEEE Transactions on Multimedia, 8(3):509-520, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.

[22] D. Zhang, D. Gatica-Perez, D. Roy, and S. Bengio.
Modeling interactions from email communication.
In IEEE International Conference on Multimedia & Expo, ICME, 2006.
.ps.gz | .pdf | .djvu | abstract]
Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topic-based interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

2005

[1] S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004.
volume 3361 of Lecture Notes in Computer Science. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 21-23, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multi-disciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multi-modal meeting processing. These areas included human-human communication modeling, speech and visual processing, multi-modal processing, fusion and fission, multi-modal dialog modeling, human-human interaction modeling, multi-modal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.

[2] S. Bengio and H. Bourlard.
Multi channel sequence processing.
In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 22-36. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper summarizes some of the current research challenges arising from multi-channel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multi-stream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multi-channel tasks, ranging from audio-visual speech recognition to automatic meeting analysis.

[3] S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve.
In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the so-called break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[4] C. Dimitrakakis and S. Bengio.
Boosting word error rates.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501-504, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the well-known Numbers 95 database that indicate the importance of this temporal probability distribution.

[5] C. Dimitrakakis and S. Bengio.
Gradient-based estimates of return distributions.
In PASCAL Workshop on Principled Methods of Trading Exploration and Exploitation, 2005.
.ps.gz | .pdf | .djvu | abstract]
We present a general method for maintaining estimates of the distribution of parameters in arbitrary models. This is then applied to the estimation of probability distributions over actions in value-based reinforcement learning. While this approach is similar to other techniques that maintain a confidence measure for action-values, it nevertheless offers an insight into current techniques and hints at potential avenues of further research.

[6] C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers.
Neurocomputing, 64:211-221, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Q-learning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.

[7] D. Gatica-Perez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections.
In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audio-visual sensors, discussing open issues in this domain.

[8] D. Gatica-Perez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interest-level in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489-492, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interest-level that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of high-interest from audio-visual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interest-level. On a 50-meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable ground-truth for automatic methods. For the automatic detection of high-interest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single- and multi-stream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interest-level is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multi-modal approach is beneficial.

[9] N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how?
In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123-138. Office for Official Publications of the European Communities, Luxembourg, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: K-Nearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.

[10] Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over state-of-the-art procedures.

[11] D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model.
In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.
.ps.gz | .pdf | .djvu | abstract]
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d inC according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b ->sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.

[12] D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks.
In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b ->sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.

[13] M. Keller and S. Bengio.
A neural network for text representation.
In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667-672. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.

[14] M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking non-parametric statistical tests.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.

[15] H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multi-stream posterior based speech recognition system.
In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior” definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the state-of-the-art Tandem systems.

[16] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems.
In Proceedings of the 9th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the “gamma” distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a state-of-the-art hybrid HMM/ANN system.

[17] J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification.
IEEE Signal Processing Letters, 12(7):532-535, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The purpose of this paper is to unify several of the state-of-the-art score normalization techniques applied to text-independent speaker verification systems. We propose a new framework for this purpose. The two well-known Z- and T-normalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these well-known score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.

[18] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(3):305-317, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMM-based approaches, where the observations are provided by a set of audio-visual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.

[19] J.-F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions.
In International Conference on Music Information Retrieval, ISMIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[20] J.-F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space.
In International Conference on Machine Learning, ICML, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[21] N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments?
In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various real-life applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a so-called a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply “true users”) taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.

[22] N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks.
In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 74-85. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometric-enabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several base-systems. How exactly do correlation and imbalance nature of base-system performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the class-dependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark score-level fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.

[23] N. Poh and S. Bengio.
F-ratio client-dependent normalisation for biometric authentication tasks.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721-724, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new client-dependent normalisation to improve biometric authentication systems. There exists many client-de-pendent score normalisation techniques applied to speaker authentication, such as Z-Norm, D-Norm and T-Norm. Such normalisation is intended to adjust the variation across different client models. We propose “F-ratio” normalisation, or F-Norm, applied to face and speaker authentication systems. This normalisation requires only that as few as two client-dependent accesses are available (the more the better). Different from previous normalisation techniques, F-Norm considers the client and impostor distributions simultaneously. We show that F-ratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that F-Norm actually “interpolates” between client-independent and client-dependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over Z-Norm, client-dependent threshold normalisation or no normalisation.

[24] N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks?
IEEE Transactions on Signal Processing, 53(11):4384-4396, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in real-life biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of base-classifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance Reduction-Equal Error Rate (VR-EER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of F-ratio, which itself is a function of 1) correlation, 2) variance of base-experts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of base-experts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be miss-leading. Experimental results on the BANCA multimodal database confirm our findings using VR-EER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated base-experts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (class-dependent) variance. Moreover, by linking the concept of bias-variance-covariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[25] N. Poh and S. Bengio.
Improving fusion with margin-derived confidence in biometric authentication tasks.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1059-1068. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, “after observing the score of a given system, what is the confidence (or risk) associated to that given access?”. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.

[26] N. Poh and S. Bengio.
A novel approach to combining client-dependent and confidence information in multimodal biometrics.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1120-1129. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The issues of fusion with client-dependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a second-level classifier, across different experts. Although the formulation of such two-step solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the second-level classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call “Prior Knowledge Incorporation” has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS score-level fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new state-of-the-art performance on this database.

[27] N. Poh and S. Bengio.
A score-level fusion benchmark database for biometric authentication.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474-483. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.

[28] V. Popovici, S. Bengio, and J.-P. Thiran.
Kernel matching pursuit for large datasets.
Pattern Recognition, 38(12):2385-2390, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.

[29] A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems.
In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for inter-class invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVM-based classifier.

[30] C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity trade-offs in face verification.
In IEEE International Conference on Information Technology and Applications, ICITA, pages 638-643, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D- and pseudo-2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo-2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best trade-off in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.

[31] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised adapted HMMs for unusual event detection.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised model-based approaches. We propose a semi-supervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in pre-defining unusual events. Experiments on audio, visual, and audio-visual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.

[32] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised meeting event recognition with adapted HMMs.
In IEEE International Conference on Multimedia Expo, ICME, pages 611-618, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
This paper investigates the use of unlabeled data to help labeled data for audio-visual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semi-supervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a well-estimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audio-visual events defined in meetings. Experiments and comparison with the fully-supervised baseline method show the validity of the proposed semi-supervised approach.

[33] D. Zhang, D. Gatica-Perez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | abstract]
We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model.

2004

[1] S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81-89, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[2] S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2-class classification tasks.

[3] S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).

[4] H. Bourlard, S. Bengio, M. Magimai Doss, Q. Zhu, B. Mesot, and N. Morgan.
Towards using hierarchical posteriors for flexible automatic speech recognition systems.
In Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Rich Transcription (RT'04) Workshop, 2004.
.ps.gz | .pdf | .djvu | abstract]
Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task.

[5] F. Cardinaux, C. Sanderson, and S. Bengio.
Face verification using adapted generative models.
In International Conference on Automatic Face and Gesture Recognition, FG, pages 825-830, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been shown previously that systems based on local features and relatively complex generative models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition (here we mean both identification and verification). Recently a simpler generative model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In this paper we first propose to increase the performance of the GMM approach (without sacrificing its simplicity) through the use of local features with embedded positional information; we show that the performance obtained is comparable to 1D HMMs. Secondly, we evaluate different training techniques for both GMM and HMM based systems. We show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available; we propose to tackle this problem through the use of Maximum a Posteriori (MAP) training, where the lack of data problem can be effectively circumvented; we show that models estimated with MAP are significantly more robust and are able to generalize to adverse conditions present in the BANCA database.

[6] S. Chiappa and S. Bengio.
HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
We compare the use of two Markovian models, HMMs and IOHMMs, to discriminate between three mental tasks for brain computer interface systems using an asynchronous protocol. We show that IOHMMs outperform HMMs but that, probably due to the lack of any prior information on the state dynamics, no practical advantage in the use of these models over their static counterparts is obtained.

[7] R. Collobert and S. Bengio.
A gentle hessian for efficient gradient descent.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 517-520, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n2) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

[8] R. Collobert and S. Bengio.
Links between perceptrons, MLPs and SVMs.
In International Conference on Machine Learning, ICML, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

[9] F. de Wet, K. Weber, L. Boves, B. Cranen, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for automatic speech recognition.
Journal of the Acoustical Society of America (JASA), 116(3):1781-1792, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates possibilities to find a low-dimensional, formant-related physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formant-like features and state-of-the-art, noise-robust features have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, i.e. robust formants and HMM2 features, and to compare these features to hand-labeled formants as well as to mel-frequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the hand-labeled formants.

[10] C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621-624, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[11] C. Dimitrakakis and S. Bengio.
Online policy adaptation for ensemble classifiers.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper, the idea of using an adaptive policy for training and combining the base classifiers is put forward. The effectiveness of this approach for online learning is demonstrated by experimental results on several UCI benchmark databases.

[12] M. Magimai Doss, S. Bengio, and H. Bourlard.
Joint decoding for phoneme-grapheme continuous speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 177-180, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Standard ASR systems typically use phoneme as the subword units. Preliminary studies have shown that the performance of the ASR system could be improved by using grapheme as additional subword units. In this paper, we investigate such a system where the word models are defined in terms of two different subword units, i.e., phoneme and grapheme. During training, models for both the subword units are trained, and then during recognition either both or just one subword unit is used. We have studied this system for a continuous speech recognition task in American English language. Our studies show that grapheme information used along with phoneme information improves the performance of ASR.

[13] M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation.
In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatic Text Processing tasks, documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[14] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Machine Learning for Multimodal Interaction: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 56-75. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[15] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux, S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyz, L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes, E. Vidal, L. Bai, L. Shen, Y. Wang, C. Yueh-Hsuan, L. Hsien-Chang, H. Yi-Ping, A. Heinrichs, M. Muller, A. Tewes, C. von der Malsburg, R. Wurtz, Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and H. Schneiderman.
Face authentication test on the BANCA database.
In International Conference on Pattern Recognition, ICPR, volume 4, pages 523-532, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a Face Authentication Test (FAT2004) held in conjunction with the 17th International Conference on Pattern Recognition. The contest was held on the publicly available BANCA database according to a defined protocol. The competition also had a sequestered part in which institutions had to submit their algorithms for independent testing. 13 different verification algorithms from 10 institutions submitted results. Also, a standard set of face recognition software packages from the Internet were used to provide a baseline performance measure.

[16] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz, L. Vandendorpe, J. Ng, H. Cheung, and B. Tang.
Face authentication competition on the BANCA database.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 8-15. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a face verification competition held in conjunction with the First International Conference on Biometric Authe ntication. The contest was held on the publically available BANCA database according to a defined protocol. Six different verification algorithms from 4 academic and commercial institutions submitted results. Also, a standard set of face recognition software from the internet was used to provide a baseline performance measure.

[17] N. Poh and S. Bengio.
Noise-robust multi-stream fusion for text-independent speaker authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-stream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noise-robust multi-stream text-independent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noise-robustness is mainly due to the combination mechanism. This two-step approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to sub-streams. An important finding is that a trade-off is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this trade-off, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.

[18] N. Poh and S. Bengio.
Towards predicting optimal subsets of base classifiers in biometric authentication tasks.
In S. Bengio and H. Bourlard, editors, Machine Learning for Multimodal Interactions: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 159-172. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extend in various real-life applications. However, combining too many systems (base-experts) will also increase both hardware and computation costs. One way to selecting a subset of optimal base-experts out of N is to carry out the experiments explicitly. There are 2N-1 possible combinations. In this paper, we propose an analytical solution to this task when weighted sum fusion mechanism is used. The proposed approach is at least valid in the domain of person authentication. It has a complexity that is additive between the number of examples and the number of possible combinations while the conventional approach, using brute-force experimenting, is multiplicative between these two terms. Hence, our approach will scale better with large fusion problems. Experiments on the BANCA multi-modal database verified our approach. While we will consider here fusion in the context of identity verification via biometrics, or simply biometric authentication, it can also have an important impact in meetings because this a priori information can assist in retrieving highlights in meeting analysis as in “who said what”. Furthermore, automatic meeting analysis also requires many systems working together and involves possibly many audio-visual media streams. Development in fusion of identity verification will provide insights into how fusion in meetings can be done. The ability to predict fusion performance is another important step towards understanding the fusion problem.

[19] N. Poh and S. Bengio.
Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks?
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893-896, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-band, multi-stream and multi-modal approaches have proven to be very successful both in experiments and in real-life applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, Multi-Layer-Perceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[20] N. Poh, C. Sanderson, and S. Bengio.
Spectral subband centroids as complementary features for speaker authentication.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 631-639. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, e.g., Mel-scale Filterbank Cepstrum Coefficients (MFCCs), Linear-scale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTA-PLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to formant frequencies but are limited to a given subband.Empirical experiments carried out on the NIST2001 database using SSCs, MFCCs, LFCCs and their combinations by concatenation suggest that SSCs are somewhat more robust compared to conventional MFCC and LFCC features as well as being partially complementary.

[21] A. Pozdnoukhov and S. Bengio.
Tangent vector kernels for invariant image classification with SVMs.
In International Conference on Pattern Recognition, ICPR, volume 3, pages 486-489, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents an application of the general sample-to-object approach to the problem of invariant image classification. The approach results in defining new SVM kernels based on tangent vectors that take into account prior information on known invariances. Real data of face images are used for experiments. The presented approach integrates virtual sample and tangent distance methods. We observe a significant increase in performance with respect to standard approaches. The experiments also illustrate (as expected) that prior knowledge becomes more important as the amount of training data decreases.

[22] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Estimating the quality of face localization for face verification.
In IEEE International Conference on Image Processing, ICIP, pages 581-584, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Face localization is the process of finding the exact position of a face in a given image. This can be useful in several applications such as face tracking or person authentication. The purpose of this paper is to show that the error made during the localization process may have different impacts depending on the final application. Hence in order to evaluate the performance of a face localization algorithm, we propose to embed the final application (here face verification) into the performance measuring process. Moreover, in this paper, we estimate this embedding using either a multilayer perceptron or a K nearest neighbor algorithm in order to speedup the evaluation process. We show on the BANCA database that our proposed measure best matches the final verification results when comparing several localization algorithms, on various performance measures currently used in face localization.

[23] C. Sanderson and S. Bengio.
Extrapolating single view face models for multi-view recognition.
In International Conference on Intelligente Sensors, Sensor Networks and Information Processings, ISSNIP, pages 581-586, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Performance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.

[24] C. Sanderson and S. Bengio.
Statistical transformations of frontal models for non-frontal face verification.
In IEEE International Conference on Image Processing, ICIP, pages 585-588, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In the framework of a face verification system using local features and a Gaussian Mixture Model based classifier, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each client's frontal face model with artificially synthesized models for non-frontal views. Furthermore, we propose the Maximum Likelihood Shift (MLS) synthesis technique and compare its performance against a Maximum Likelihood Linear Regression (MLLR) based technique (originally developed for adapting speech recognition systems) and the recently proposed "difference between two Universal Background Models" (UBMdiff) technique. All techniques rely on prior information and learn how a generic face model for the frontal view is related to generic models at non-frontal views. Experiments on the FERET database suggest that that the proposed MLS technique is more suitable than MLLR (due to a lower number of free parameters) and UBMdiff (due to lack of heuristics). The results further suggest that extending frontal models considerably reduces errors.

[25] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(6):709-720, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ~50% for single writer data and by ~25% for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

[26] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Modeling individual and group actions in meetings: a two-layer hmm framework.
In IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition, CVPR, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that between has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.

[27] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Multimodal group action clustering in meetings.
In ACM Multimedia Workshop on Video Surveillance and Sensor Networks, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of clustering multimodal group actions in meetings using a two-layer HMM framework. Meetings are structured as sequences of group actions. Our approach aims at creating one cluster for each group action, where the number of group actions and the action boundaries are unknown a priori. In our framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and low-level audio-visual features. A number of options that explicitly model certain aspects of the data (e.g., asynchrony) were considered. The second layer models the group actions using unsupervised HMM learning. The two layers are linked by a set of probability-based features produced by the individual action layer as input to the group action layer. The methodology was assessed on a set of multimodal turn-taking group actions, using a public five-hour meeting corpus. The results show that the use of multiple modalities and the layered framework are advantageous, compared to various baseline methods.

2003

[1] E. Bailly-Baillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran.
The BANCA database and evaluation protocol.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625-638. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we describe the acquistion and content of a new large, realistic and challenging multi-modal database intended for training and testing multi-modal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] M. Barnard, J.-M. Odobez, and S. Bengio.
Multi-modal audio-visual event recognition for football analysis.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 469-478, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The recognition of events within multi-modal data is a challenging problem. In this paper we focus on the recognition of events by using both audio and video data. We investigate the use of data fusion techniques in order to recognise these sequences within the framework of Hidden Markov Models (HMM) used to model audio and video data sequences. Specifically we look at the recognition of play and break sequences in football and the segmentation of football games based on these two events. Recognising relatively simple semantic events such as this is an important step towards full automatic indexing of such video material. These experiments were done using approximately 3 hours of data from two games of the Euro96 competition. We propose that modelling the audio and video streams separately for each sequence and fusing the decisions from each stream should yield an accurate and robust method of segmenting multi-modal data.

[3] S. Bengio.
An asynchronous hidden markov model for audio-visual speech recognition.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, pages 1237-1244. MIT Press, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.

[4] S. Bengio.
Multimodal authentication using asynchronous HMMs.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770-777. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audio-visual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other state-of-the-art techniques.

[5] H. Bourlard, S. Bengio, and K. Weber.
Towards robust and adaptive speech recognition models.
In M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld, editors, Mathematical Foundations of Speech and Language Processing, Institute for Mathematics and its Applications (IMA) Series, Volume 138, pages 169-189. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[6] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349-365, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

[7] J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audio-visual person identity verification.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752-760. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[8] M. Magimai Doss, T. A. Stephenson, H. Bourlard, and S. Bengio.
Phoneme-grapheme based speech recognition system.
In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, pages 94-98, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
State-of-the-art ASR systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined in-terms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units.

[9] D. Gatica-Perez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard.
On automatic annotation of meeting databases.
In IEEE International Conference on Image Processing, ICIP, volume 3, pages 629-632, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multi-modal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audio-visual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a well-defined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.

[10] N. Gilardi and S. Bengio.
Comparison of four machine learning algorithms for spatial data analysis.
In G. Dubois, J. Malczewski, and M. DeCort, editors, Mapping radioactivity in the environment - Spatial Interpolation Comparison 97, pages 222-237. Office for Official Publications of the European Communities, Luxembourg, 2003.
.ps.gz | .pdf | .djvu | abstract]
This chapter proposes a clear methodology on how to use machine learning algorithms for spatial data analysis in order to avoid any bias and eventually obtain fair estimation of their performance on new data. Four different machine learning algorithms are presented, namely multilayer perceptrons (MLP), mixture of experts (ME), support vector regression (SVR) and a local version of the latter (local SVR). Evaluation criteria adapted to geostatistical problems are also presented in order to compare adequately different models on the same dataset. Finally, an experimental comparison is given on the SIC97 dataset as well as an analysis of the results.

[11] Q. Le and S. Bengio.
Client dependent GMM-SVM models for speaker verification.
In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443-451. Springer Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A cross-validation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[12] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D. Moore, P. Wellner, and H. Bourlard.
Modeling human interaction in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 4, pages 748-751, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMM-based approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.

[13] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Ambient Intelligence, Lecture Notes in Computer Science, volume LNCS 2875, pages 235-251, Eindhoven, 2003. Springer-Verlag.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[14] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar, F. Deravi, and N. Mavity.
Face verification competition on the XM2VTS database.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 964-974. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink ]
[15] N. Poh and S. Bengio.
Non-linear variance reduction techniques in biometric authentication.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple synthetic samples, different extractors (features) and biometric modalities. The scores are combined using the average operator, Multi-Layer Perceptron and Support Vector Machines. It is found empirically that VR via modalities is the best technique, followed by VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[16] N. Poh, S. Marcel, and S. Bengio.
Improving face authentication using virtual samples.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 3, pages 233-236, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present a simple yet effective way to improve a face verification system by generating multiple virtual samples from the unique image corresponding to an access request. These images are generated using simple geometric transformations. This method is often used during training to improve accuracy of a neural network model by making it robust against minor translation, scale and orientation change. The main contribution of this paper is to introduce such method during testing. By generating N images from one single image and propagating them to a trained network model, one obtains N scores. By merging these scores using a simple mean operator, we show that the variance of merged scores is decreased by a factor between 1 and N. An experiment is carried out on the XM2VTS database which achieves new state-of-the-art performances.

[17] C. Sanderson and S. Bengio.
Augmenting frontal face models for non-frontal verification.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this work we propose to address the problem of non-frontal face verification when only a frontal training image is available (e.g. a passport photograph) by augmenting a client's frontal face model with artificially synthesized models for non-frontal views. In the framework of a Gaussian Mixture Model (GMM) based classifier, two techniques are proposed for the synthesis: UBMdiff and LinReg. Both techniques rely on a priori information and learn how face models for the frontal view are related to face models at a non-frontal view. The synthesis and augmentation approach is evaluated by applying it to two face verification systems: Principal Component Analysis (PCA) based and DCTmod2 (Sanderson et al, 2003) based; the two systems are a representation of holistic and non-holistic approaches, respectively. Results from experiments on the FERET database suggest that in almost all cases, frontal model augmentation has beneficial effects for both systems; they also suggest that the LinReg technique (which is based on multivariate regression of classifier parameters) is more suited to the PCA based system and that the UBMdiff technique (which is based on differences between two general face models) is more suited to the DCTmod2 based system. The results also support the view that the standard DCTmod2/GMM system (trained on frontal faces) is less affected by out-of-plane rotations than the corresponding PCA/GMM system;moreover, the DCTmod2/GMM system using augmented models is, in almost all cases, more robust than the corresponding PCA/GMM system.

[18] C. Sanderson and S. Bengio.
Robust features for frontal authentication in difficult image conditions.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 495-504. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper we extend the recently proposed DCT-mod2 feature extraction technique (which utilizes polynomial coefficients derived from 2D DCT coefficients obtained from horizontally & vertically neighbouring blocks) via the use of various windows and diagonally neighbouring blocks. We also propose enhanced PCA, where traditional PCA feature extraction is combined with DCT-mod2. Results using test images corrupted by a linear and a non-linear illumination change, white Gaussian noise and compression artefacts, show that use of diagonally neighbouring blocks and windowing is detrimental to robustness against illumination changes while being useful for increasing robustness against white noise and compression artefacts. We also show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness against white noise and compression artefacts) while also being robust to illumination changes; moreover, enhanced PCA outperforms PCA with histogram equalisation pre-processing.

[19] C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux, and S. Marcel.
Speech & face based biometric authentication at IDIAP.
In International Conference on Multimedia and Expo, ICME, volume 3, pages 1-4, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers user-customised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.

[20] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of large vocabulary cursive handwritten text.
In International Conference on Document Analysis and Recognition, ICDAR, pages 1101-1105, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of cursive handwritten lines of text. The system is based on continuous density HMMs and Statistical Language Models. The system recognizes data produced by a single writer. No a-priori knowledge is used about the content of the text to be recognized. Changes in the experimental setup with respect to the recognition of single words are highlighted. The results show a recognition rate of ~85% with a lexicon containing 50'000 words. The experiments were performed over a publicly available database.

[21] K. Weber, S. Ikbal, S. Bengio, and H. Bourlard.
Robust speech recognition and feature extraction using HMM2.
Computer, Speech and Language, 17(2-3):195-211, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to correspond to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.

2002

[1] S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification.
Information Fusion, 3(4):267-276, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[2] H. Bourlard, T. Adali, S. Bengio, J. Larsen, and S. Douglas, editors.
Proceedings of the Twelfth IEEE Workshop on Neural Networks for Signal Processing (NNSP). IEEE Press, 2002.
[3] H. Bourlard and S. Bengio.
Hidden markov models and other finite state automata for sequence processing.
In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, Second Edition. The MIT Press, 2002.
.ps.gz | .pdf | .djvu | idiap-RR ]
[4] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
Neural Computation, 14(5):1105-1114, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[5] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633-640. MIT Press, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.

[6] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
In S. Lee and A. Verri, editors, International Workshop on Pattern Recognition with Support Vector Machines, SVM, Lecture Notes in Computer Science, volume LNCS 2388, pages 8-23. Springer-Verlag, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

[7] N. Gilardi, S. Bengio, and M. Kanevski.
Conditional gaussian mixture models for environmental risk mapping.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 777-786, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper proposes the use of Gaussian Mixture Models to estimate conditional probability density functions in an environmental risk mapping context. A conditional Gaussian Mixture Model has been compared to the geostatistical method of Sequential Gaussian Simulations and shows good performances in reconstructing local PDF. The data sets used for this comparison are parts of the digital elevation model of Switzerland.

[8] S. Marcel and S. Bengio.
Improving face verification using skin color information.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 2, pages 11-15. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance.

[9] S. Marcel, C. Marcel, and S. Bengio.
A state-of-the-art neural network for robust face verification.
In COST275 Workshop on the advent of Biometrics on the Internet, 2002.
.ps.gz | .pdf | .djvu | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance and that the proposed model achieves robust state-of-the-art results.

[10] J. Mariéthoz and S. Bengio.
A comparative study of adaptation methods for speaker verification.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Real-life speaker verification systems are often implemented using client model adaptation methods, since the amount of data available for each client is often too low to consider plain Maximum Likelihood methods. While the Bayesian Maximum A Posteriori (MAP) adaptation method is commonly used in speaker verification, other methods have proven to be successful in related domains such as speech recognition. This paper proposes an experimental comparison between three well-known adaptation methods, namely MAP, Maximum Likelihood Linear Regression, and finally EigenVoices. All three methods are compared to the more classical Maximum Likelihood method, and results are given for a subset of the 1999 NIST Speaker Recognition Evaluation database.

[11] N. Poh, S. Bengio, and J. Korczak.
A multi-sample multi-source model for biometric authentication.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 375-384, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this study, two techniques that can improve the authentication process are examined: (i) multiple samples and (ii) multiple biometric sources. We propose the fusion of multiple samples obtained from multiple biometric sources at the score level. By using the average operator, both the theoretical and empirical results show that integrating as many samples and as many biometric sources as possible can improve the overall reliability of the system. This strategy is called multi-sample multi-source approach. This strategy was tested on a real-life database using neural networks trained in one-versus-all configuration.

[12] A. Vinciarelli and S. Bengio.
Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 3, pages 81-84. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents an Offline Cursive Word Recognition System dealing with single writer samples. The system was a continuous density hiddden Markov model trained using either the raw data, or data transformed using Principal Component Analysis or Independent Component Analysis. Both techniques significantly improved the recognition rate of the system. Preprocessing, normalization and feature extraction are described in detail as well as the training technique adopted. Several experiments were performed using a publicly available database. The accuracy obtained is the highest presented in the literature over the same data.

[13] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
In Proceedings of the 8th International Conference on Frontiers in Handwriting Recognition, pages 287-291, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[14] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
Pattern Recognition Letters, 23(8):905-916, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[15] K. Weber, S. Bengio, and H. Bourlard.
Increasing speech recognition noise robustness with HMM2.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 929-932, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noise-robust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multi-stream combination of (already noise-robust) cepstral features and formant-like features (extracted by HMM2) improves the noise robustness of a state-of-the-art automatic speech recognition system.

[16] K. Weber, F. de Wet, B. Cranen, L. Boves, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for ASR.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates possibilities to automatically find a low-dimensional, formant-related physical representation of the speech signal, which is suitable for automatic speech recognition (ASR). This aim is motivated by the fact that formants have been shown to be discriminant features for ASR. Combinations of automatically extracted formant-like features and `conventional', noise-robust, state-of-the-art features (such as MFCCs including spectral subtraction and cepstral mean subtraction) have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, and to compare these features to hand-labeled formant tracks as well as to standard MFCCs in terms of their performance on a vowel classification task

2001

[1] S. Bengio and J. Mariéthoz.
Learning the decision function for speaker verification.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 425-428, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks always yielded performance improvements and sometimes significantly.

[2] H. Bourlard, S. Bengio, and K. Weber.
New approaches towards robust and adaptive speech recognition.
In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, NIPS 13, pages 751-757. MIT Press, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[3] R. Collobert and S. Bengio.
SVMTorch: Support vector machines for large-scale regression problems.
Journal of Machine Learning Research, JMLR, 1:143-160, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[4] J.-L. DesGranges, P. Agin, and S. Bengio.
The use of predictive models of breeding bird assemblages for assessing and monitoring forest bird diversity.
In A. Franc, O. Laroussinie, and T. Karjalainen, editors, Criteria and Indicators for Sustainable Forest Management at the Forest Management Unit Level, volume 38, pages 181-200. European Forest Institute Proceedings, 2001.
[5] K. Weber, S. Bengio, and H. Bourlard.
HMM2- extraction of formant features and their use for robust ASR.
In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 2001.
.ps.gz | .pdf | .djvu | abstract]
As recently introduced, an HMM2 can be considered as a particular case of an HMM mixture in which the HMM emission probabilities (usually estimated through Gaussian mixtures or an artificial neural network) are modeled by state-dependent, feature-based HMM (referred to as frequency HMM). A general EM training algorithm for such a structure has already been developed. Although there are numerous motivations for using such a structure, and many possible ways to exploit it, this paper will mainly focus on one particular instantiation of HMM2 in which the frequency HMM will be used to extract formant structure information, which will then be used as additional acoustic features in a standard Automatic Speech Recognition (ASR) system. While the fact that this architecture is able to automatically extract meaningful formant information is interesting by itself, empirical results will also show the robustness of these features to noise, and their potential to enhance state-of-the-art noise-robust HMM-based ASR.

[6] K. Weber, S. Bengio, and H. Bourlard.
Speech recognition using advanced HMM2 features.
In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU, pages 65-68, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs. As shown previously, a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formant-like features for standard ASR system. HMM2 performs a nonlinear, state-dependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To further improve the HMM2 system, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4-dimensional feature vector (3 formant-like parameters and one time index) were obtained.

2000

[1] S. Bengio and Y. Bengio.
Taking on the curse of dimensionality in joint distributions using neural networks.
IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, 11(3):550-557, 2000.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[2] Y. Bengio and S. Bengio.
Modeling high-dimensional discrete data with multi-layer neural networks.
In S.A. Solla, T.K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, NIPS 12, pages 400-406. MIT Press, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[3] N. Gilardi and S. Bengio.
Local machine learning models for spatial data analysis.
Journal of Geographic Information and Decision Analysis, 4(1):11-28, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper, we compare different machine learning algorithms applied to non stationary spatial data analysis. We show that models taking into account local variability of the data are better than models which are trained globally on the whole dataset. Two global models (Support Vector Regression and Multilayer Perceptrons) and two local models (a local version of Support Vector Regression and Mixture of Experts) were compared over the Spatial Interpolation Comparison 97 (SIC97) dataset, and the results are presented and compared to previous results obtained on the same dataset.

[4] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris.
Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Current technology for automatic speech recognition (ASR) uses hidden Markov models (HMMs) that recognize spoken speech using the acoustic signal. However, no use is made of the causes of the acoustic signal: the articulators. We present here a dynamic Bayesian network (DBN) model that utilizes an additional variable for representing the state of the articulators. A particular strength of the system is that, while it uses measured articulatory data during its training, it does not need to know these values during recognition. As Bayesian networks are not used often in the speech community, we give an introduction to them. After describing how they can be used in ASR, we present a system to do isolated word recognition using articulatory information. Recognition results are given, showing that a system with both acoustics and inferred articulatory positions performs better than a system with only acoustics.

[5] K. Weber, S. Bengio, and H. Bourlard.
HMM2- a novel approach to HMM emission probability estimation.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we discuss and investigate a new method to estimate local emission probabilities in the framework of hidden Markov models (HMM). Each feature vector is considered to be a sequence and is supposed to be modeled by yet another HMM. Therefore, we call this approach `HMM2'. There is a variety of possible topologies of such HMM2 systems, e.g. incorporating trellis or ergodic HMM structures. Preliminary HMM2 speech recognition experiments on cepstral and spectral features yielded worse results than state-of-the-art systems. However, we believe that HMM2 systems have a lot of potential advantages and are therefore worth investigating further.

Before 2000

[1] S. Bengio.
Intégration des systèmes tutoriels traditionnels et des systèmes tutoriels intelligents.
Master's thesis, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, 1989.
[2] S. Bengio.
Optimisation d'une règle d'apprentissage pour réseaux de neurones artificiels.
PhD thesis, Département d'Informatique et Recherche Opérationnelle. Université de Montréal, 1993.
.ps.gz | .pdf | .djvu ]
[3] S. Bengio and Y. Bengio.
An EM algorithm for asynchronous input/output hidden markov models.
In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996.
.ps.gz | .pdf | .djvu | abstract]
In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an Expectation-Maximization algorithm for training such models.

[4] S. Bengio, Y. Bengio, and J. Cloutier.
Use of genetic programming for the search of a new learning rule for neural networks.
In Proceedings of the First Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, volume 1, pages 324-327, 1994.
.ps.gz | .pdf | .djvu | weblink ]
[5] S. Bengio, Y. Bengio, and J. Cloutier.
On the search for new learning rules for ANNs.
Neural Processing Letters, 2(4):26-30, 1995.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper, we present a framework where a learning rule can be optimized within a parametric learning rule space. We define what we callparametric learning rules and present a theoretical study of theirgeneralization properties when estimated from a set of learning tasks and tested over another set of tasks. We corroborate the results of this study with practical experiments.

[6] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Aspects théoriques de l'optimisation d'une règle d'apprentissage.
In Actes de la conférence Neuro-Nîmes 1992, Nîmes, France, 1992.
.ps.gz | .pdf | .djvu ]
[7] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule.
In Conference on Optimality in Biological and Artificial Networks, Dallas, USA, 1992.
.ps.gz | .pdf | .djvu ]
[8] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Generalization of a parametric learning rule.
In S. Gielen and B. Kappen, editors, Proceedings of the International Conference on Artificial Neural Networks, ICANN'93, pages 502-502, Amsterdam, Nederlands, 1993. Springer-Verlag.
.ps.gz | .pdf | .djvu ]
[9] S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule.
In D. S. Levine and W. R. Elsberry, editors, Optimality in Biological and Artificial Networks?, pages 265-287. Lawrence Erlbaum Associates, 1997.
.ps.gz | .pdf | .djvu ]
[10] S. Bengio, Y. Bengio, J. Robert, and G. Bélanger.
Stochastic learning of strategic equilibria for auctions.
Neural Computation, 11(5):1199-1209, 1999.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. The proposed approach addresses the problems of tracking a moving target and balancing exploration (of action space) versus exploitation (of better modeled regions of action space). Neural networks are used to represent a stochastic decision model for each bidder. Experiments confirm the correctness and usefulness of the approach.

[11] S. Bengio, G. Brassard, Y. Desmedt, C. Goutier, and J.-J. Quisquater.
Secure implementation of identification systems.
Journal of Cryptology, 4(3):175-183, 1991.
weblink | abstract]
In this paper we demonstrate that widely known identification systems, such as the public-file-based Feige-Fiat-Shamir scheme, can be insecure if proper care is not taken with their implementation. We suggest possible solutions. On the other hand, identity-based versions of the Feige-Fiat-Shamir scheme are conceptually more complicated than necessary.

[12] S. Bengio, F. Clerot, A. Gravey, and D. Collobert.
Dynamical resource reservation schemes in an ATM network using neural network-based traffic prediction.
In D. D. Kouvatsos, editor, Proceedings of the IFIP Fifth International Workshop on Performance Modelling and Evaluation of ATM Networks. Kluwer B. V., 1997.
.ps.gz | .pdf | .djvu | weblink | abstract]
Using real traffic data, we show that neural network-based prediction techniques can be used to predict the queuing behaviour of highly bursty traffics typical of LAN interconnection in a way accurate enough so as to allow dynamical renegotiation of a DBR traffic contract at the edge of an ATM network. The performances of predictor-based in service renegotiation are evaluated in terms of renegotiation errors and reserved bandwidth for the the DBR traffic handling capability and are shown to be very encouraging for the use of connectionist prediction techniques for the management of bursty traffics in ATM networks.

[13] S. Bengio, F. Fessant, and D. Collobert.
A connectionist system for medium-term horizon time series prediction.
In International Workshop on Applications of Neural Networks to Telecommunications, IWANNT, Stockholm, Sweden, 1995.
.ps.gz | .pdf | .djvu ]
[14] S. Bengio, F. Fessant, and D. Collobert.
Use of modular architectures for time series prediction.
Neural Processing Letters, 3(2):101-106, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[15] S. Bengio, C. Frasson, and J. Gecsei.
Integrating traditional and intelligent computerized tutoring.
In Fourth International Symposium on Computer and Information Sciences, Cesme, Turkey, 1989.
[16] S. Bengio, C. Frasson, and J. Gecsei.
Utilisation de systèmes d'EAO dans des systèmes d'EIAO.
In Sixième symposium canadien sur la technologie pédagogique, Halifax, Canada, 1989.
[17] Y. Bengio and S. Bengio.
Training asynchronous input/output hidden markov models.
In AAAI Spring Symposium on Computational Issues in Learning Models of Dynamical Systems, 1996.
.ps.gz | .pdf | .djvu ]
[18] Y. Bengio, S. Bengio, and J. Cloutier.
Learning a synaptic learning rule.
In Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume 2, pages 969-974, Seattle, USA, 1991.
.ps.gz | .pdf | .djvu | weblink ]
[19] Y. Bengio, S. Bengio, J.-F. Isabelle, and Y. Singer.
Shared context probabilistic transducers.
In Advances in Neural Information Processing Systems, NIPS 10, 1998.
.ps.gz | .pdf | .djvu | weblink | abstract]
Recently, a model for supervised learning of probabilistic transducers represented by suffix trees was introduced. However, this algorithm tends to build very large trees, requiring very large amounts of computer memory. In this paper, we propose a new, more compact, transducer model in which one shares the parameters of distributions associated to contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm with comparative experiments on inducing a noun phrase recognizer.

[20] Y. Bengio, S. Bengio, Y. Pouliot, and P. Agin.
A neural network to detect homologies in proteins.
In Advances in Neural Information Processing Systems, NIPS 2, San Mateo, CA, USA, 1990. Morgan Kaufmann.
.ps.gz | .pdf | .djvu | weblink ]
[21] Y. Desmedt, C. Goutier, and S. Bengio.
Special uses and abuses of the Fiat-Shamir passport protocol.
In Advances in Cryptology, Crypto, Lecture Notes in Computer Science, volume LNCS 293, pages 21-39, Santa Barbara, USA, 1988. Springer Verlag.
.ps.gz | .pdf | .djvu | weblink | abstract]
If the physical description of a person would be unique and adequately used and tested, then the security of the fiat-shamir scheme is not based on zero-knowledge. otherwise some new frauds exist. the feige-fiat-shamir scheme always suffers from these frauds. using an extended notion of subliminal channels, several other undetectable abuses of the fiat-shamir protocol, which are not possible with ordinary passports, are discussed. this technique can be used by a terrorist sponsoring country to communicate 500 new words of secret information each time a tourist passport is verified. a non-trivial solution to avoid these subliminal channel problems is presented. the notion of relative zero-knowledge is introduced.

[22] F. Fessant, S. Bengio, and D. Collobert.
On the prediction of solar activity using different neural network models.
Annales Geophysicae, 14:20-26, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[23] J.-Y. Potvin and S. Bengio.
The vehicle routing problem with time windows - part II: Genetic search.
INFORMS Journal on Computing, 8(2):165-172, 1996.
.ps.gz | .pdf | .djvu | weblink ]

Technical Reports

2013

[1] S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and Y. Singer.
Using web co-occurrence statistics for improving image categorization.
Technical Report 1312.5697, ArXiv, 2013.
.ps.gz | .pdf | .djvu | weblink | abstract]
Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using co-occurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) co-occur in web documents, we obtain a good estimate of expected co-occurrences in visual data. We then cast the problem of combining textual co-occurrence statistics with the predictions of image-based classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets.

2006

[1] D. Grangier and S. Bengio.
A discriminative approach for the retrieval of images from text queries.
Technical Report IDIAP-RR 06-15, IDIAP, 2006.
.ps.gz | .pdf | .djvu | abstract]
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative approach: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.

[2] M. Keller and S. Bengio.
A multitask learning approach to document representation using unlabeled data.
Technical Report IDIAP-RR 06-44, IDIAP, 2006.
.ps.gz | .pdf | .djvu | abstract]
Text categorization is intrinsically a supervised learning task, which aims at relating a given text document to one or more predefined categories. Unfortunately, labeling such databases of documents is a painful task. We present in this paper a method that takes advantage of huge amounts of unlabeled text documents available in digital format, to counter balance the relatively smaller available amount of labeled text documents. A Siamese MLP is trained in a multi-task framework in order to solve two concurrent tasks: using the unlabeled data, we search for a mapping from the documents' bag-of-word representation to a new feature space emphasizing similarities and dissimilarities among documents; simultaneously, this mapping is constrained to also give good text categorization performance over the labeled dataset. Experimental results on Reuters RCV1 suggest that, as expected, performance over the labeled task increases as the amount of unlabeled data increases.

2005

[1] S. Bengio.
Joint training of multi-stream HMMs.
Technical Report IDIAP-RR 05-22, IDIAP, 2005.
.ps.gz | .pdf | .djvu | abstract]
This report describes a novel technique to jointly train efficiently several streams of data describing the same sequence of events using a unified EM algorithm.

[2] D. Grangier and S. Bengio.
A discriminative decoder for the recognition of phoneme sequences.
Technical Report IDIAP-RR 05-67, IDIAP, 2005.
.ps.gz | .pdf | .djvu | abstract]
In this report, we propose a discriminative decoder for the recognition of phoneme sequences, i.e. the identification of the uttered phoneme sequence from a speech recording. This task is solved as a 3 step process: a phoneme classifier first classifies each accoustic frame, then temporal consistency features (TCF) are extracted from the phoneme classifier outputs, and finally a sequence decoder identifies the phoneme sequence according to the TCF.

[3] J. Mariéthoz and S. Bengio.
Can a professional imitator fool a gmm-based speaker verification system?
Technical Report IDIAP-RR 05-61, IDIAP, 2005.
.ps.gz | .pdf | .djvu | abstract]
This paper presents an attempt at assessing empirically how a state-of-the-art text-independent speaker verification system behaves when confronted to imposting attempts from a professional imitator who perfectly knows how to imitate in particular the clients he tried to impost. Empirical evidence show that, fortunately, current speaker verification systems are indeed robust to such attempts, even when humans are not able to discriminate between true and imposting accesses (a website with some examples is provided to convince the reader). Furthermore, we show that the knowledge of the lexical content of the access significantly helps the imitator, although fortunately not enough to fool the system. This study thus represents a first step in assessing a speaker verification system against true, informed, impostors.

[4] A. Pozdnoukhov and S. Bengio.
A kernel classifier for distributions.
Technical Report IDIAP-RR 05-32, IDIAP, 2005.
.ps.gz | .pdf | .djvu | abstract]
This paper presents a new algorithm for classifying distributions. The algorithm combines the principle of margin maximization and a kernel trick, applied to distributions. Thus, it combines the discriminative power of support vector machines and the well-developed framework of generative models. It can be applied to a number of real-life tasks which include data represented as distributions. The algorithm can also be applied for introducing some prior knowledge on invariances into a discriminative model. We illustrate this approach in details for the case of Gaussian distributions, using a toy problem. We also present experiments devoted to the real-life problem of invariant image classification.

[5] D. Zhang, D. Gatica-Perez, D. Roy, and S. Bengio.
Modeling interactions from email communication.
Technical Report IDIAP-RR 05-51, IDIAP, 2005.
.ps.gz | .pdf | .djvu | abstract]
Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topic-based interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization. We also present an email visualization and retrieval system which could not only search for relevant emails, but also for the relevant email users.

2004

[1] S. Chiappa and S. Bengio.
Sequence classification with input-output hidden markov models.
Technical Report IDIAP-RR 04-13, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
We present a training and testing method for Input-Output Hidden Markov Model that is particularly suited for classification of sequences in which class information accumulates over time. We discuss two such cases: the discrimination of mental tasks from sequences of EEG features, common in Brain Computer Interface research, and phoneme classification from sequences of acoustic features for speech recognition. The objective function is modified so that training focuses on the improvement of classification accuracy. For both tasks the algorithm performs significantly better than the alternative solution proposed in the literature, specifically designed for other types of sequences.

[2] C. Dimitrakakis and S. Bengio.
Estimates of parameter distributions for optimal action selection.
Technical Report IDIAP-RR 04-72, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
We present a general method for maintaining estimates of the distribution of parameters in arbitrary models. This is then applied to the estimation of probability distributions over actions in value-based reinforcement learning. While this approach is similar to other techniques that maintain a confidence measure for action-values, it nevertheless offers a new insight into current techniques and reveals potential avenues of further research.

[3] M. Keller, J. Mariéthoz, and S. Bengio.
Significance tests for bizarre measures in 2-class classification tasks.
Technical Report IDIAP-RR 04-34, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as F1, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.

[4] C. Sanderson and S. Bengio.
Statistical transformation techniques for face verification using faces rotated in depth.
Technical Report IDIAP-RR 04-04, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
In the framework of a Bayesian classifier based on mixtures of gaussians, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg based technique is more suited than the MLLR based techniques; for the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the DCTmod2 based system is less affected by out-of-plane rotations than the PCA based system; this can be attributed to the local feature representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between face parts, allowing for movement of facial areas.

2003

[1] R. Collobert and S. Bengio.
A new margin-based criterion for efficient gradient descent.
Technical Report IDIAP-RR 03-16, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
During the last few decades, several papers were published about second-order optimization methods for gradient descent based learning algorithms. Unfortunately, these methods usually have a cost in time close to O(n3) per iteration, and O(n2) in space, where n is the number of parameters to optimize, which is intractable with large optimization systems usually found in real-life problems. Moreover, these methods are usually not easy to implement. Many enhancements have also been proposed in order to overcome these problems, but most of them still cost O(n2) in time per iteration. Instead of trying to solve a hard optimization problem using complex second-order tricks, we propose to modify the problem itself in order to optimize a simpler one, by simply changing the cost function used during training. Furthermore, we will argue that analyzing the Hessian resulting from the choice of various cost functions is very informative and could help in the design of new machine learning algorithms. For instance, we propose in this paper a version of the Support Vector Machines criterion applied to Multi Layer Perceptrons, which yields very good training and generalization performance in practice. Several empirical comparisons on two benchmark data sets are given to justify this approach.

[2] M. Keller and S. Bengio.
Textual data representation.
Technical Report IDIAP-RR 03-49, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bag-of-words representation, are explained and some state-of-the-art methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered.

[3] J. Mariéthoz and S. Bengio.
An alternative to silence removal for text-independent speaker verification.
Technical Report IDIAP-RR 03-51, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
State-of-the-art text independent speaker verification systems use silence/speech detectors to get rid of silence frames which are considered to be non discriminative. This paper explores a possible replacement to this silence/speech detector by considering each Gaussian of a GMM as modeling a specific speech class and by using discriminant models like SVMs and MLPs in order to fuse the corresponding class-specific scores to obtain a final decision. Experiments on the NIST 2000 database yielded statistically significantly better performance for the new model as compared to our best baseline system involving a silence/speech detector, without having to rely on uncertain hypotheses.

[4] I. McCowan, D. Gatica-Perez, and S. Bengio.
Meeting data collection specifications.
Communication Report IDIAP-COM 03-10, IDIAP, 2003.
.ps.gz | .pdf | .djvu ]
[5] N. Poh and S. Bengio.
Variance reduction techniques in biometric authentication.
Technical Report IDIAP-RR 03-17, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple real samples, synthetic samples, different extractors (features) and biometric modalities. It is found empirically that VR via modalities is the best technique, followed by VR via real samples, VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[6] A. Pozdnoukhov and S. Bengio.
From samples to objects in kernel methods.
Technical Report IDIAP-RR 03-29, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. Two implementation techniques of this method, based on analytical kernel jittering and the vicinal risk minimization principle, are considered. Empirical results on one artificial dataset and one real dataset based on EEG signals demonstrate the performance of the proposed method.

2002

[1] R. Collobert, S. Bengio, and J. Mariéthoz.
Torch: a modular machine learning software library.
Technical Report IDIAP-RR 02-46, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Many scientific communities have expressed a growing interest in machine learning algorithms recently, mainly due to the generally good results they provide, compared to traditional statistical or AI approaches. However, these machine learning algorithms are often complex to implement and to use properly and efficiently. We thus present in this paper a new machine learning software library in which most state-of-the-art algorithms have already been implemented and are available in a unified framework, in order for scientists to be able to use them, compare them, and even extend them. More interestingly, this library is freely available under a BSD license and can be retrieved on the web by everyone.

[2] Q. Le and S. Bengio.
Hybrid generative-discriminative models for speech and speaker recognition.
Technical Report IDIAP-RR 02-06, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Generative probability models such as Hidden Markov Models are usually used for modeling sequences of data because of their ability to handle variable size sequences and missing information. On the other hand, because of their discriminative properties, discriminative models like Support Vector Machines (SVMs) usually yield better performance in classification problem and can construct flexible decision boundaries. An ideal classifier should have all the power of these two complementary approaches. A series of recent papers has suggested some techniques for mixing generative models and discriminative models. In one of them a fixed size vector (the Fisher score) containing sufficient statistics of a sequence is computed for a previously trained HMM and can then be used as input to a discriminative model for classification. The purpose of this project is thus to study, experiment, enhance and adapt these new approaches of integrating discriminative models such as SVM into generative models for sequence processing problems, such as speaker and speech recognition.

[3] F. Porée, J. Mariéthoz, S. Bengio, and F. Bimbot.
The BANCA database and experimental protocol for speaker verification.
Technical Report IDIAP-RR 02-13, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Identity verification has become a very important research topic recently, particularly using methods based on the face or the voice of the individuals. In the context of the BANCA european project, a novel multi-modal database was recently recorded, spanning 5 european languages, 2 modalities (face and voice), 2 microphones, 2 cameras and almost 300 individuals. As we believe that this database offers many advantages for this research community, this paper essentially presents the database and its associated experimental protocol, as well as a baseline state-of-the-art system using the voice data for a text-independent speaker verification task.

[4] A. Vinciarelli and S. Bengio.
Transforming the feature vectors to improve HMM based cursive word recognition systems.
Technical Report IDIAP-RR 02-32, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Although many Offline Cursive Word Recognition systems are based on HMMs, no attention was ever paid, to our knowledge, to the fact that the feature vectors are typically not in the most suitable form for modeling. They are most of the time correlated and embedded in a space of dimension higher than their Intrinsic Dimension. This leads to several problems and has a negative influence on the performance. By applying some transforms it is possible to solve, or at least to attenuate, such problems resulting in data easier to model and in systems with higher recognition rate. In this work, we used Principal Component Analysis (linear and nonlinear) and Independent Component Analysis. A reduction of the error rate by up to 30.3% (over single writer data) and 16.2% (over multiple writer samples) is shown to be achieved.

2001

[1] S. Bengio and J. Mariéthoz.
Comparison of client model adaptation schemes.
Technical Report IDIAP-RR 01-25, IDIAP, 2001.
.ps.gz | .pdf | .djvu ]
[2] S. Bengio, J. Mariéthoz, and S. Marcel.
Evaluation of biometric technology on XM2VTS.
Technical Report IDIAP-RR 01-21, IDIAP, 2001.
.ps.gz | .pdf | .djvu ]
[3] K. Weber, S. Bengio, and H. Bourlard.
A pragmatic view of the application of HMM2 for ASR.
Technical Report IDIAP-RR 01-23, IDIAP, 2001.
.ps.gz | .pdf | .djvu | abstract]
This report investigates the HMM2 approach recently introduced in the framework of automatic speech recognition. HMM2 can be seen as a mixture of HMMs, where a conventional primary HMM (processing a time series of speech data) is supported on a lower level by a secondary HMM, working along the frequency dimension of a temporal segment of speech. The application of HMM2 to the speech signal is motivated by numerous potential advantages. However, speech recognition results did not show the expected performance improvements. In this paper, the HMM2 approach is pragmatically analyzed and evaluated on speech data, revealing some problems and suggesting potential solutions.

2000

[1] S. Bengio, H. Bourlard, and K. Weber.
An EM algorithm for HMMs with emission distributions represented by HMMs.
Technical Report IDIAP-RR 00-11, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
A novel approach to represent emission distributions of Hidden Markov Models is presented in this paper. Whereas they are usually estimated with Gaussian mixtures or neural networks, we propose to estimate them with another HMM, but in feature space. This representation, referred here as HMM2, could enable the model to more accurately represent feature correlations with fewer parameters than standard HMMs. A full derivation of an EM algorithm is given in order to globally train all the HMM2 parameters. Preliminary experiments on speech data show promising results.

[2] R. Collobert and S. Bengio.
On the convergence of SVMTorch, an algorithm for large-scale regression problems.
Technical Report IDIAP-RR 00-24, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Recently, many researchers have proposed decomposition algorithms for SVM regression problems. In a previous paper, we also proposed such an algorithm, named SVMTorch. In this paper, we show that while there is actually no convergence proof for any other decomposition algorithm for SVM regression problems to our knowledge, such a proof does exist for SVMTorch for the particular case where no shrinking is used and the size of the working set is equal to 2, which is the size that gave the fastest results on most experiments we have done. This convergence proof is in fact mainly based on the convergence proof given by Keerthi and Gilbert for their SVM classification algorithm.

[3] R. Collobert and S. Bengio.
Support vector machines for large-scale regression problems.
Technical Report IDIAP-RR 00-17, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l2 memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorchphSVMTorch is available at http://www.idiap.ch/learning/SVMTorch.html., which is similar to SVM-Light proposed by Joachims for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another SVM algorithm for large-scale regression problems from Flake and Lawrence yielded significant time improvements.

Before 2000

[1] S. Bengio, G. Brassard, Y. Desmedt, C. Goutier, and J.-J. Quisquater.
Aspects and importance of secure implementations of identification systems.
Technical Report Manuscript M209, Philips Research Laboratory, Brussel, Belgium, 1987.
[2] S. Bengio and C. Frasson.
Utilisation d'EAO dans des systèmes d'EIAO.
Technical Report 651, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montreal (QC) Canada, 1988.
[3] Y. Bengio and S. Bengio.
Learning a synaptic learning rule.
Technical Report 751, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, Montreal (QC) Canada, 1990.
[4] A. Gravey, S. Bengio, D. Collobert, and F. Clerot.
Utilisation de techniques de prédiction neuromimétiques pour la négotiation dynamique des paramètres de contrat de trafic dans un réseau ATM.
Technical Report NT/LAA/EIA/132, France Télécom CNET, Lannion, France, 1996.
[5] J.-Y. Potvin and S. Bengio.
A genetic based heuristic for the vehicle routing problem with time windows.
Technical Report CRT-953, Centre de Recherche sur les Transports, Université de Montréal, 1994.

Organized by Topics

Speech

[1] S. Bengio.
An asynchronous hidden markov model for audio-visual speech recognition.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, pages 1237-1244. MIT Press, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.

[2] S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81-89, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[3] S. Bengio and Y. Bengio.
An EM algorithm for asynchronous input/output hidden markov models.
In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996.
.ps.gz | .pdf | .djvu | abstract]
In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an Expectation-Maximization algorithm for training such models.

[4] S. Bengio, H. Bourlard, and K. Weber.
An EM algorithm for HMMs with emission distributions represented by HMMs.
Technical Report IDIAP-RR 00-11, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
A novel approach to represent emission distributions of Hidden Markov Models is presented in this paper. Whereas they are usually estimated with Gaussian mixtures or neural networks, we propose to estimate them with another HMM, but in feature space. This representation, referred here as HMM2, could enable the model to more accurately represent feature correlations with fewer parameters than standard HMMs. A full derivation of an EM algorithm is given in order to globally train all the HMM2 parameters. Preliminary experiments on speech data show promising results.

[5] Y. Bengio and S. Bengio.
Training asynchronous input/output hidden markov models.
In AAAI Spring Symposium on Computational Issues in Learning Models of Dynamical Systems, 1996.
.ps.gz | .pdf | .djvu ]
[6] H. Bourlard and S. Bengio.
Hidden markov models and other finite state automata for sequence processing.
In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, Second Edition. The MIT Press, 2002.
.ps.gz | .pdf | .djvu | idiap-RR ]
[7] H. Bourlard, S. Bengio, M. Magimai Doss, Q. Zhu, B. Mesot, and N. Morgan.
Towards using hierarchical posteriors for flexible automatic speech recognition systems.
In Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speech-to-text) Rich Transcription (RT'04) Workshop, 2004.
.ps.gz | .pdf | .djvu | abstract]
Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task.

[8] H. Bourlard, S. Bengio, and K. Weber.
New approaches towards robust and adaptive speech recognition.
In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, NIPS 13, pages 751-757. MIT Press, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[9] H. Bourlard, S. Bengio, and K. Weber.
Towards robust and adaptive speech recognition models.
In M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld, editors, Mathematical Foundations of Speech and Language Processing, Institute for Mathematics and its Applications (IMA) Series, Volume 138, pages 169-189. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multi-style” training. More specifically, we will motivate and briefly describe new approaches based on multi-stream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[10] F. de Wet, K. Weber, L. Boves, B. Cranen, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for automatic speech recognition.
Journal of the Acoustical Society of America (JASA), 116(3):1781-1792, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates possibilities to find a low-dimensional, formant-related physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formant-like features and state-of-the-art, noise-robust features have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, i.e. robust formants and HMM2 features, and to compare these features to hand-labeled formants as well as to mel-frequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and hand-labeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 3099-3111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the hand-labeled formants.

[11] C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621-624, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[12] C. Dimitrakakis and S. Bengio.
Boosting word error rates.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501-504, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the well-known Numbers 95 database that indicate the importance of this temporal probability distribution.

[13] M. Magimai Doss, S. Bengio, and H. Bourlard.
Joint decoding for phoneme-grapheme continuous speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 177-180, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Standard ASR systems typically use phoneme as the subword units. Preliminary studies have shown that the performance of the ASR system could be improved by using grapheme as additional subword units. In this paper, we investigate such a system where the word models are defined in terms of two different subword units, i.e., phoneme and grapheme. During training, models for both the subword units are trained, and then during recognition either both or just one subword unit is used. We have studied this system for a continuous speech recognition task in American English language. Our studies show that grapheme information used along with phoneme information improves the performance of ASR.

[14] M. Magimai Doss, T. A. Stephenson, H. Bourlard, and S. Bengio.
Phoneme-grapheme based speech recognition system.
In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, pages 94-98, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
State-of-the-art ASR systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined in-terms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units.

[15] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
Speech Communication, 51:317-329, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.

[16] H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multi-stream posterior based speech recognition system.
In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior” definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the state-of-the-art Tandem systems.

[17] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems.
In Proceedings of the 9th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the “gamma” distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a state-of-the-art hybrid HMM/ANN system.

[18] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.

[19] H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.

[20] T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris.
Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Current technology for automatic speech recognition (ASR) uses hidden Markov models (HMMs) that recognize spoken speech using the acoustic signal. However, no use is made of the causes of the acoustic signal: the articulators. We present here a dynamic Bayesian network (DBN) model that utilizes an additional variable for representing the state of the articulators. A particular strength of the system is that, while it uses measured articulatory data during its training, it does not need to know these values during recognition. As Bayesian networks are not used often in the speech community, we give an introduction to them. After describing how they can be used in ASR, we present a system to do isolated word recognition using articulatory information. Recognition results are given, showing that a system with both acoustics and inferred articulatory positions performs better than a system with only acoustics.

[21] K. Weber, S. Bengio, and H. Bourlard.
HMM2- a novel approach to HMM emission probability estimation.
In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we discuss and investigate a new method to estimate local emission probabilities in the framework of hidden Markov models (HMM). Each feature vector is considered to be a sequence and is supposed to be modeled by yet another HMM. Therefore, we call this approach `HMM2'. There is a variety of possible topologies of such HMM2 systems, e.g. incorporating trellis or ergodic HMM structures. Preliminary HMM2 speech recognition experiments on cepstral and spectral features yielded worse results than state-of-the-art systems. However, we believe that HMM2 systems have a lot of potential advantages and are therefore worth investigating further.

[22] K. Weber, S. Bengio, and H. Bourlard.
HMM2- extraction of formant features and their use for robust ASR.
In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 2001.
.ps.gz | .pdf | .djvu | abstract]
As recently introduced, an HMM2 can be considered as a particular case of an HMM mixture in which the HMM emission probabilities (usually estimated through Gaussian mixtures or an artificial neural network) are modeled by state-dependent, feature-based HMM (referred to as frequency HMM). A general EM training algorithm for such a structure has already been developed. Although there are numerous motivations for using such a structure, and many possible ways to exploit it, this paper will mainly focus on one particular instantiation of HMM2 in which the frequency HMM will be used to extract formant structure information, which will then be used as additional acoustic features in a standard Automatic Speech Recognition (ASR) system. While the fact that this architecture is able to automatically extract meaningful formant information is interesting by itself, empirical results will also show the robustness of these features to noise, and their potential to enhance state-of-the-art noise-robust HMM-based ASR.

[23] K. Weber, S. Bengio, and H. Bourlard.
A pragmatic view of the application of HMM2 for ASR.
Technical Report IDIAP-RR 01-23, IDIAP, 2001.
.ps.gz | .pdf | .djvu | abstract]
This report investigates the HMM2 approach recently introduced in the framework of automatic speech recognition. HMM2 can be seen as a mixture of HMMs, where a conventional primary HMM (processing a time series of speech data) is supported on a lower level by a secondary HMM, working along the frequency dimension of a temporal segment of speech. The application of HMM2 to the speech signal is motivated by numerous potential advantages. However, speech recognition results did not show the expected performance improvements. In this paper, the HMM2 approach is pragmatically analyzed and evaluated on speech data, revealing some problems and suggesting potential solutions.

[24] K. Weber, S. Bengio, and H. Bourlard.
Speech recognition using advanced HMM2 features.
In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU, pages 65-68, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) state-dependent frequency-based HMMs. As shown previously, a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formant-like features for standard ASR system. HMM2 performs a nonlinear, state-dependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To further improve the HMM2 system, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4-dimensional feature vector (3 formant-like parameters and one time index) were obtained.

[25] K. Weber, S. Bengio, and H. Bourlard.
Increasing speech recognition noise robustness with HMM2.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 929-932, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noise-robust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multi-stream combination of (already noise-robust) cepstral features and formant-like features (extracted by HMM2) improves the noise robustness of a state-of-the-art automatic speech recognition system.

[26] K. Weber, F. de Wet, B. Cranen, L. Boves, S. Bengio, and H. Bourlard.
Evaluation of formant-like features for ASR.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates possibilities to automatically find a low-dimensional, formant-related physical representation of the speech signal, which is suitable for automatic speech recognition (ASR). This aim is motivated by the fact that formants have been shown to be discriminant features for ASR. Combinations of automatically extracted formant-like features and `conventional', noise-robust, state-of-the-art features (such as MFCCs including spectral subtraction and cepstral mean subtraction) have previously been shown to be more robust in adverse conditions than state-of-the-art features alone. However, it is not clear how these automatically extracted formant-like features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formant-like features, and to compare these features to hand-labeled formant tracks as well as to standard MFCCs in terms of their performance on a vowel classification task

[27] K. Weber, S. Ikbal, S. Bengio, and H. Bourlard.
Robust speech recognition and feature extraction using HMM2.
Computer, Speech and Language, 17(2-3):195-211, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to correspond to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.

Multimodal

[1] E. Bailly-Baillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran.
The BANCA database and evaluation protocol.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625-638. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we describe the acquistion and content of a new large, realistic and challenging multi-modal database intended for training and testing multi-modal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] M. Barnard, J.-M. Odobez, and S. Bengio.
Multi-modal audio-visual event recognition for football analysis.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 469-478, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The recognition of events within multi-modal data is a challenging problem. In this paper we focus on the recognition of events by using both audio and video data. We investigate the use of data fusion techniques in order to recognise these sequences within the framework of Hidden Markov Models (HMM) used to model audio and video data sequences. Specifically we look at the recognition of play and break sequences in football and the segmentation of football games based on these two events. Recognising relatively simple semantic events such as this is an important step towards full automatic indexing of such video material. These experiments were done using approximately 3 hours of data from two games of the Euro96 competition. We propose that modelling the audio and video streams separately for each sequence and fusing the decisions from each stream should yield an accurate and robust method of segmenting multi-modal data.

[3] S. Bengio.
Multimodal authentication using asynchronous HMMs.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770-777. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audio-visual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other state-of-the-art techniques.

[4] S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81-89, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[5] S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004.
volume 3361 of Lecture Notes in Computer Science. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 21-23, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multi-disciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multi-modal meeting processing. These areas included human-human communication modeling, speech and visual processing, multi-modal processing, fusion and fission, multi-modal dialog modeling, human-human interaction modeling, multi-modal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.

[6] S. Bengio and H. Bourlard.
Multi channel sequence processing.
In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 22-36. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper summarizes some of the current research challenges arising from multi-channel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multi-stream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multi-channel tasks, ranging from audio-visual speech recognition to automatic meeting analysis.

[7] S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification.
Information Fusion, 3(4):267-276, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[8] J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audio-visual person identity verification.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752-760. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[9] D. Gatica-Perez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard.
On automatic annotation of meeting databases.
In IEEE International Conference on Image Processing, ICIP, volume 3, pages 629-632, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multi-modal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audio-visual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a well-defined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.

[10] D. Gatica-Perez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections.
In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audio-visual sensors, discussing open issues in this domain.

[11] D. Gatica-Perez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interest-level in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489-492, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interest-level that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of high-interest from audio-visual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interest-level. On a 50-meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable ground-truth for automatic methods. For the automatic detection of high-interest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single- and multi-stream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interest-level is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multi-modal approach is beneficial.

[12] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D. Moore, P. Wellner, and H. Bourlard.
Modeling human interaction in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 4, pages 748-751, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMM-based approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.

[13] I. McCowan, D. Gatica-Perez, and S. Bengio.
Meeting data collection specifications.
Communication Report IDIAP-COM 03-10, IDIAP, 2003.
.ps.gz | .pdf | .djvu ]
[14] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(3):305-317, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMM-based approaches, where the observations are provided by a set of audio-visual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.

[15] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Ambient Intelligence, Lecture Notes in Computer Science, volume LNCS 2875, pages 235-251, Eindhoven, 2003. Springer-Verlag.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[16] I. McCowan, D. Gatica-Perez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions.
In Machine Learning for Multimodal Interaction: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 56-75. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
People meet in order to interact - disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[17] N. Poh and S. Bengio.
Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks?
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893-896, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-band, multi-stream and multi-modal approaches have proven to be very successful both in experiments and in real-life applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, Multi-Layer-Perceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[18] N. Poh and S. Bengio.
F-ratio client-dependent normalisation for biometric authentication tasks.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721-724, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new client-dependent normalisation to improve biometric authentication systems. There exists many client-de-pendent score normalisation techniques applied to speaker authentication, such as Z-Norm, D-Norm and T-Norm. Such normalisation is intended to adjust the variation across different client models. We propose “F-ratio” normalisation, or F-Norm, applied to face and speaker authentication systems. This normalisation requires only that as few as two client-dependent accesses are available (the more the better). Different from previous normalisation techniques, F-Norm considers the client and impostor distributions simultaneously. We show that F-ratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that F-Norm actually “interpolates” between client-independent and client-dependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over Z-Norm, client-dependent threshold normalisation or no normalisation.

[19] N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks?
IEEE Transactions on Signal Processing, 53(11):4384-4396, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in real-life biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of base-classifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance Reduction-Equal Error Rate (VR-EER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of F-ratio, which itself is a function of 1) correlation, 2) variance of base-experts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of base-experts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be miss-leading. Experimental results on the BANCA multimodal database confirm our findings using VR-EER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated base-experts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (class-dependent) variance. Moreover, by linking the concept of bias-variance-covariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[20] N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[21] N. Poh and S. Bengio.
Database, protocol and tools for evaluating score-level fusion algorithms in biometric authentication.
Pattern Recognition, 39(2):223-233, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.

[22] N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess user-dependent variabilities.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[23] S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005.
volume 3869 of Lecture Notes in Computer Science. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 11-13 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT-05) evaluation of speech technologies within the meeting domain. Building on the success of the RT-04 spring evaluation, the RT-05 evaluation continued the speech-to-text and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Human-human communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Human-human interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.

[24] S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006.
volume 4299 of Lecture Notes in Computer Science. Springer-Verlag, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 1­4, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme) - the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence - and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was co-located with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT-06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT-06 evaluation continued evaluation tasks in the areas of speech-to-text, who-spoke-when, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included human­human communication modeling, speech and visual processing, multimodal processing, fusion and fission, human­computer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick Murray-Smith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.

[25] D. Zhang and S. Bengio.
Exploring contextual information in a layered framework for group action recognition.
In IEEE International Conference on Multimedia & Expo, ICME, 2007.
.ps.gz | .pdf | .djvu | abstract]
Contextual information is important for sequence modeling. Hidden Markov models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommodate overlapping features or long-term contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state alpha and gamma posteriors (as usually referred to in the HMM formalism). The third method is based on conditional random fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

[26] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised adapted HMMs for unusual event detection.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised model-based approaches. We propose a semi-supervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in pre-defining unusual events. Experiments on audio, visual, and audio-visual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.

[27] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised meeting event recognition with adapted HMMs.
In IEEE International Conference on Multimedia Expo, ICME, pages 611-618, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
This paper investigates the use of unlabeled data to help labeled data for audio-visual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semi-supervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a well-estimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audio-visual events defined in meetings. Experiments and comparison with the fully-supervised baseline method show the validity of the proposed semi-supervised approach.

[28] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs.
IEEE Transactions on Multimedia, 8(3):509-520, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.

[29] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Modeling individual and group actions in meetings: a two-layer hmm framework.
In IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition, CVPR, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that between has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.

[30] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud.
Multimodal group action clustering in meetings.
In ACM Multimedia Workshop on Video Surveillance and Sensor Networks, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of clustering multimodal group actions in meetings using a two-layer HMM framework. Meetings are structured as sequences of group actions. Our approach aims at creating one cluster for each group action, where the number of group actions and the action boundaries are unknown a priori. In our framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and low-level audio-visual features. A number of options that explicitly model certain aspects of the data (e.g., asynchrony) were considered. The second layer models the group actions using unsupervised HMM learning. The two layers are linked by a set of probability-based features produced by the individual action layer as input to the group action layer. The methodology was assessed on a set of multimodal turn-taking group actions, using a public five-hour meeting corpus. The results show that the use of multiple modalities and the layered framework are advantageous, compared to various baseline methods.

[31] D. Zhang, D. Gatica-Perez, D. Roy, and S. Bengio.
Modeling interactions from email communication.
In IEEE International Conference on Multimedia & Expo, ICME, 2006.
.ps.gz | .pdf | .djvu | abstract]
Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topic-based interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

Biometric Authentication

[1] E. Bailly-Baillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz, J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.-P. Thiran.
The BANCA database and evaluation protocol.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625-638. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we describe the acquistion and content of a new large, realistic and challenging multi-modal database intended for training and testing multi-modal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] S. Bengio.
Multimodal authentication using asynchronous HMMs.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770-777. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audio-visual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other state-of-the-art techniques.

[3] S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81-89, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[4] S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification.
Information Fusion, 3(4):267-276, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[5] S. Bengio and J. Mariéthoz.
Comparison of client model adaptation schemes.
Technical Report IDIAP-RR 01-25, IDIAP, 2001.
.ps.gz | .pdf | .djvu ]
[6] S. Bengio and J. Mariéthoz.
Learning the decision function for speaker verification.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 425-428, 2001.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks always yielded performance improvements and sometimes significantly.

[7] S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2-class classification tasks.

[8] S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).

[9] S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve.
In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the so-called break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[10] S. Bengio, J. Mariéthoz, and S. Marcel.
Evaluation of biometric technology on XM2VTS.
Technical Report IDIAP-RR 01-21, IDIAP, 2001.
.ps.gz | .pdf | .djvu ]
[11] F. Cardinaux, C. Sanderson, and S. Bengio.
Face verification using adapted generative models.
In International Conference on Automatic Face and Gesture Recognition, FG, pages 825-830, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been shown previously that systems based on local features and relatively complex generative models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition (here we mean both identification and verification). Recently a simpler generative model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In this paper we first propose to increase the performance of the GMM approach (without sacrificing its simplicity) through the use of local features with embedded positional information; we show that the performance obtained is comparable to 1D HMMs. Secondly, we evaluate different training techniques for both GMM and HMM based systems. We show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available; we propose to tackle this problem through the use of Maximum a Posteriori (MAP) training, where the lack of data problem can be effectively circumvented; we show that models estimated with MAP are significantly more robust and are able to generalize to adverse conditions present in the BANCA database.

[12] F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images.
IEEE Transactions on Signal Processing, 54(1):361-373, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo-2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best trade-off in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.

[13] J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audio-visual person identity verification.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752-760. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[14] M. Keller, J. Mariéthoz, and S. Bengio.
Significance tests for bizarre measures in 2-class classification tasks.
Technical Report IDIAP-RR 04-34, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several well-known measures such as F1, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop non-parametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a real-life large dataset.

[15] Q. Le and S. Bengio.
Hybrid generative-discriminative models for speech and speaker recognition.
Technical Report IDIAP-RR 02-06, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Generative probability models such as Hidden Markov Models are usually used for modeling sequences of data because of their ability to handle variable size sequences and missing information. On the other hand, because of their discriminative properties, discriminative models like Support Vector Machines (SVMs) usually yield better performance in classification problem and can construct flexible decision boundaries. An ideal classifier should have all the power of these two complementary approaches. A series of recent papers has suggested some techniques for mixing generative models and discriminative models. In one of them a fixed size vector (the Fisher score) containing sufficient statistics of a sequence is computed for a previously trained HMM and can then be used as input to a discriminative model for classification. The purpose of this project is thus to study, experiment, enhance and adapt these new approaches of integrating discriminative models such as SVM into generative models for sequence processing problems, such as speaker and speech recognition.

[16] Q. Le and S. Bengio.
Client dependent GMM-SVM models for speaker verification.
In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443-451. Springer Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A cross-validation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[17] M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems.
In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186-195. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the stroke-based feature set outperforms the point-based feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.

[18] S. Marcel and S. Bengio.
Improving face verification using skin color information.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 2, pages 11-15. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance.

[19] S. Marcel, C. Marcel, and S. Bengio.
A state-of-the-art neural network for robust face verification.
In COST275 Workshop on the advent of Biometrics on the Internet, 2002.
.ps.gz | .pdf | .djvu | abstract]
The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. State-of-the-art methods often use the gray-scale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance and that the proposed model achieves robust state-of-the-art results.

[20] J. Mariéthoz and S. Bengio.
A comparative study of adaptation methods for speaker verification.
In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Real-life speaker verification systems are often implemented using client model adaptation methods, since the amount of data available for each client is often too low to consider plain Maximum Likelihood methods. While the Bayesian Maximum A Posteriori (MAP) adaptation method is commonly used in speaker verification, other methods have proven to be successful in related domains such as speech recognition. This paper proposes an experimental comparison between three well-known adaptation methods, namely MAP, Maximum Likelihood Linear Regression, and finally EigenVoices. All three methods are compared to the more classical Maximum Likelihood method, and results are given for a subset of the 1999 NIST Speaker Recognition Evaluation database.

[21] J. Mariéthoz and S. Bengio.
An alternative to silence removal for text-independent speaker verification.
Technical Report IDIAP-RR 03-51, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
State-of-the-art text independent speaker verification systems use silence/speech detectors to get rid of silence frames which are considered to be non discriminative. This paper explores a possible replacement to this silence/speech detector by considering each Gaussian of a GMM as modeling a specific speech class and by using discriminant models like SVMs and MLPs in order to fuse the corresponding class-specific scores to obtain a final decision. Experiments on the NIST 2000 database yielded statistically significantly better performance for the new model as compared to our best baseline system involving a silence/speech detector, without having to rely on uncertain hypotheses.

[22] J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification.
IEEE Signal Processing Letters, 12(7):532-535, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The purpose of this paper is to unify several of the state-of-the-art score normalization techniques applied to text-independent speaker verification systems. We propose a new framework for this purpose. The two well-known Z- and T-normalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these well-known score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.

[23] J. Mariéthoz and S. Bengio.
A max kernel for text-independent speaker verification systems.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to state-of-the-art GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.

[24] J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to text-independent speaker verification systems.
Pattern Recognition, 40:2315-2324, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[25] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux, S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyz, L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes, E. Vidal, L. Bai, L. Shen, Y. Wang, C. Yueh-Hsuan, L. Hsien-Chang, H. Yi-Ping, A. Heinrichs, M. Muller, A. Tewes, C. von der Malsburg, R. Wurtz, Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and H. Schneiderman.
Face authentication test on the BANCA database.
In International Conference on Pattern Recognition, ICPR, volume 4, pages 523-532, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a Face Authentication Test (FAT2004) held in conjunction with the 17th International Conference on Pattern Recognition. The contest was held on the publicly available BANCA database according to a defined protocol. The competition also had a sequestered part in which institutions had to submit their algorithms for independent testing. 13 different verification algorithms from 10 institutions submitted results. Also, a standard set of face recognition software packages from the Internet were used to provide a baseline performance measure.

[26] K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz, L. Vandendorpe, J. Ng, H. Cheung, and B. Tang.
Face authentication competition on the BANCA database.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 8-15. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper details the results of a face verification competition held in conjunction with the First International Conference on Biometric Authe ntication. The contest was held on the publically available BANCA database according to a defined protocol. Six different verification algorithms from 4 academic and commercial institutions submitted results. Also, a standard set of face recognition software from the internet was used to provide a baseline performance measure.

[27] K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio, F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou, W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar, F. Deravi, and N. Mavity.
Face verification competition on the XM2VTS database.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 964-974. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink ]
[28] N. Poh and S. Bengio.
Non-linear variance reduction techniques in biometric authentication.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple synthetic samples, different extractors (features) and biometric modalities. The scores are combined using the average operator, Multi-Layer Perceptron and Support Vector Machines. It is found empirically that VR via modalities is the best technique, followed by VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[29] N. Poh and S. Bengio.
Variance reduction techniques in biometric authentication.
Technical Report IDIAP-RR 03-17, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple real samples, synthetic samples, different extractors (features) and biometric modalities. It is found empirically that VR via modalities is the best technique, followed by VR via real samples, VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[30] N. Poh and S. Bengio.
Noise-robust multi-stream fusion for text-independent speaker authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-stream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noise-robust multi-stream text-independent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noise-robustness is mainly due to the combination mechanism. This two-step approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to sub-streams. An important finding is that a trade-off is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this trade-off, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.

[31] N. Poh and S. Bengio.
Towards predicting optimal subsets of base classifiers in biometric authentication tasks.
In S. Bengio and H. Bourlard, editors, Machine Learning for Multimodal Interactions: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 159-172. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extend in various real-life applications. However, combining too many systems (base-experts) will also increase both hardware and computation costs. One way to selecting a subset of optimal base-experts out of N is to carry out the experiments explicitly. There are 2N-1 possible combinations. In this paper, we propose an analytical solution to this task when weighted sum fusion mechanism is used. The proposed approach is at least valid in the domain of person authentication. It has a complexity that is additive between the number of examples and the number of possible combinations while the conventional approach, using brute-force experimenting, is multiplicative between these two terms. Hence, our approach will scale better with large fusion problems. Experiments on the BANCA multi-modal database verified our approach. While we will consider here fusion in the context of identity verification via biometrics, or simply biometric authentication, it can also have an important impact in meetings because this a priori information can assist in retrieving highlights in meeting analysis as in “who said what”. Furthermore, automatic meeting analysis also requires many systems working together and involves possibly many audio-visual media streams. Development in fusion of identity verification will provide insights into how fusion in meetings can be done. The ability to predict fusion performance is another important step towards understanding the fusion problem.

[32] N. Poh and S. Bengio.
Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks?
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893-896, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Multi-band, multi-stream and multi-modal approaches have proven to be very successful both in experiments and in real-life applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, Multi-Layer-Perceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[33] N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments?
In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various real-life applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a so-called a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply “true users”) taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.

[34] N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks.
In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 74-85. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometric-enabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several base-systems. How exactly do correlation and imbalance nature of base-system performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the class-dependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark score-level fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.

[35] N. Poh and S. Bengio.
F-ratio client-dependent normalisation for biometric authentication tasks.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721-724, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new client-dependent normalisation to improve biometric authentication systems. There exists many client-de-pendent score normalisation techniques applied to speaker authentication, such as Z-Norm, D-Norm and T-Norm. Such normalisation is intended to adjust the variation across different client models. We propose “F-ratio” normalisation, or F-Norm, applied to face and speaker authentication systems. This normalisation requires only that as few as two client-dependent accesses are available (the more the better). Different from previous normalisation techniques, F-Norm considers the client and impostor distributions simultaneously. We show that F-ratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that F-Norm actually “interpolates” between client-independent and client-dependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over Z-Norm, client-dependent threshold normalisation or no normalisation.

[36] N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks?
IEEE Transactions on Signal Processing, 53(11):4384-4396, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in real-life biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of base-classifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance Reduction-Equal Error Rate (VR-EER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of F-ratio, which itself is a function of 1) correlation, 2) variance of base-experts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of base-experts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be miss-leading. Experimental results on the BANCA multimodal database confirm our findings using VR-EER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated base-experts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (class-dependent) variance. Moreover, by linking the concept of bias-variance-covariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[37] N. Poh and S. Bengio.
Improving fusion with margin-derived confidence in biometric authentication tasks.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1059-1068. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, “after observing the score of a given system, what is the confidence (or risk) associated to that given access?”. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.

[38] N. Poh and S. Bengio.
A novel approach to combining client-dependent and confidence information in multimodal biometrics.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1120-1129. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The issues of fusion with client-dependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a second-level classifier, across different experts. Although the formulation of such two-step solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the second-level classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call “Prior Knowledge Incorporation” has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS score-level fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new state-of-the-art performance on this database.

[39] N. Poh and S. Bengio.
A score-level fusion benchmark database for biometric authentication.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474-483. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.

[40] N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[41] N. Poh and S. Bengio.
Database, protocol and tools for evaluating score-level fusion algorithms in biometric authentication.
Pattern Recognition, 39(2):223-233, 2006.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.

[42] N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007.
.ps.gz | .pdf | .djvu | abstract]
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.

[43] N. Poh, S. Bengio, and J. Korczak.
A multi-sample multi-source model for biometric authentication.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 375-384, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this study, two techniques that can improve the authentication process are examined: (i) multiple samples and (ii) multiple biometric sources. We propose the fusion of multiple samples obtained from multiple biometric sources at the score level. By using the average operator, both the theoretical and empirical results show that integrating as many samples and as many biometric sources as possible can improve the overall reliability of the system. This strategy is called multi-sample multi-source approach. This strategy was tested on a real-life database using neural networks trained in one-versus-all configuration.

[44] N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess user-dependent variabilities.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
.ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[45] N. Poh, S. Marcel, and S. Bengio.
Improving face authentication using virtual samples.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 3, pages 233-236, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present a simple yet effective way to improve a face verification system by generating multiple virtual samples from the unique image corresponding to an access request. These images are generated using simple geometric transformations. This method is often used during training to improve accuracy of a neural network model by making it robust against minor translation, scale and orientation change. The main contribution of this paper is to introduce such method during testing. By generating N images from one single image and propagating them to a trained network model, one obtains N scores. By merging these scores using a simple mean operator, we show that the variance of merged scores is decreased by a factor between 1 and N. An experiment is carried out on the XM2VTS database which achieves new state-of-the-art performances.

[46] N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint user-specific and sample bootstraps.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(3):492-498, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
Biometric authentication performance is often depicted by a decision error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.

[47] N. Poh, C. Sanderson, and S. Bengio.
Spectral subband centroids as complementary features for speaker authentication.
In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 631-639. Springer-Verlag, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, e.g., Mel-scale Filterbank Cepstrum Coefficients (MFCCs), Linear-scale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTA-PLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to formant frequencies but are limited to a given subband.Empirical experiments carried out on the NIST2001 database using SSCs, MFCCs, LFCCs and their combinations by concatenation suggest that SSCs are somewhat more robust compared to conventional MFCC and LFCC features as well as being partially complementary.

[48] F. Porée, J. Mariéthoz, S. Bengio, and F. Bimbot.
The BANCA database and experimental protocol for speaker verification.
Technical Report IDIAP-RR 02-13, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Identity verification has become a very important research topic recently, particularly using methods based on the face or the voice of the individuals. In the context of the BANCA european project, a novel multi-modal database was recently recorded, spanning 5 european languages, 2 modalities (face and voice), 2 microphones, 2 cameras and almost 300 individuals. As we believe that this database offers many advantages for this research community, this paper essentially presents the database and its associated experimental protocol, as well as a baseline state-of-the-art system using the voice data for a text-independent speaker verification task.

[49] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Estimating the quality of face localization for face verification.
In IEEE International Conference on Image Processing, ICIP, pages 581-584, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Face localization is the process of finding the exact position of a face in a given image. This can be useful in several applications such as face tracking or person authentication. The purpose of this paper is to show that the error made during the localization process may have different impacts depending on the final application. Hence in order to evaluate the performance of a face localization algorithm, we propose to embed the final application (here face verification) into the performance measuring process. Moreover, in this paper, we estimate this embedding using either a multilayer perceptron or a K nearest neighbor algorithm in order to speedup the evaluation process. We show on the BANCA database that our proposed measure best matches the final verification results when comparing several localization algorithms, on various performance measures currently used in face localization.

[50] Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems.
Image and Vision Computing, 24(8):882-893, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.

[51] C. Sanderson and S. Bengio.
Augmenting frontal face models for non-frontal verification.
In IEEE Multimodal User Authentication Workshop, 2003.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this work we propose to address the problem of non-frontal face verification when only a frontal training image is available (e.g. a passport photograph) by augmenting a client's frontal face model with artificially synthesized models for non-frontal views. In the framework of a Gaussian Mixture Model (GMM) based classifier, two techniques are proposed for the synthesis: UBMdiff and LinReg. Both techniques rely on a priori information and learn how face models for the frontal view are related to face models at a non-frontal view. The synthesis and augmentation approach is evaluated by applying it to two face verification systems: Principal Component Analysis (PCA) based and DCTmod2 (Sanderson et al, 2003) based; the two systems are a representation of holistic and non-holistic approaches, respectively. Results from experiments on the FERET database suggest that in almost all cases, frontal model augmentation has beneficial effects for both systems; they also suggest that the LinReg technique (which is based on multivariate regression of classifier parameters) is more suited to the PCA based system and that the UBMdiff technique (which is based on differences between two general face models) is more suited to the DCTmod2 based system. The results also support the view that the standard DCTmod2/GMM system (trained on frontal faces) is less affected by out-of-plane rotations than the corresponding PCA/GMM system;moreover, the DCTmod2/GMM system using augmented models is, in almost all cases, more robust than the corresponding PCA/GMM system.

[52] C. Sanderson and S. Bengio.
Robust features for frontal authentication in difficult image conditions.
In 4th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 495-504. Springer-Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper we extend the recently proposed DCT-mod2 feature extraction technique (which utilizes polynomial coefficients derived from 2D DCT coefficients obtained from horizontally & vertically neighbouring blocks) via the use of various windows and diagonally neighbouring blocks. We also propose enhanced PCA, where traditional PCA feature extraction is combined with DCT-mod2. Results using test images corrupted by a linear and a non-linear illumination change, white Gaussian noise and compression artefacts, show that use of diagonally neighbouring blocks and windowing is detrimental to robustness against illumination changes while being useful for increasing robustness against white noise and compression artefacts. We also show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness against white noise and compression artefacts) while also being robust to illumination changes; moreover, enhanced PCA outperforms PCA with histogram equalisation pre-processing.

[53] C. Sanderson and S. Bengio.
Extrapolating single view face models for multi-view recognition.
In International Conference on Intelligente Sensors, Sensor Networks and Information Processings, ISSNIP, pages 581-586, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Performance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.

[54] C. Sanderson and S. Bengio.
Statistical transformation techniques for face verification using faces rotated in depth.
Technical Report IDIAP-RR 04-04, IDIAP, 2004.
.ps.gz | .pdf | .djvu | abstract]
In the framework of a Bayesian classifier based on mixtures of gaussians, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg based technique is more suited than the MLLR based techniques; for the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the DCTmod2 based system is less affected by out-of-plane rotations than the PCA based system; this can be attributed to the local feature representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between face parts, allowing for movement of facial areas.

[55] C. Sanderson and S. Bengio.
Statistical transformations of frontal models for non-frontal face verification.
In IEEE International Conference on Image Processing, ICIP, pages 585-588, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In the framework of a face verification system using local features and a Gaussian Mixture Model based classifier, we address the problem of non-frontal face verification (when only a single (frontal) training image is available) by extending each client's frontal face model with artificially synthesized models for non-frontal views. Furthermore, we propose the Maximum Likelihood Shift (MLS) synthesis technique and compare its performance against a Maximum Likelihood Linear Regression (MLLR) based technique (originally developed for adapting speech recognition systems) and the recently proposed "difference between two Universal Background Models" (UBMdiff) technique. All techniques rely on prior information and learn how a generic face model for the frontal view is related to generic models at non-frontal views. Experiments on the FERET database suggest that that the proposed MLS technique is more suitable than MLLR (due to a lower number of free parameters) and UBMdiff (due to lack of heuristics). The results further suggest that extending frontal models considerably reduces errors.

[56] C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux, and S. Marcel.
Speech & face based biometric authentication at IDIAP.
In International Conference on Multimedia and Expo, ICME, volume 3, pages 1-4, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers user-customised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.

[57] C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for non-frontal face verification.
Pattern Recognition, 39(2):288-302, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCA-derived features) and a local feature system (based on DCT-derived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.

[58] C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity trade-offs in face verification.
In IEEE International Conference on Information Technology and Applications, ICITA, pages 638-643, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D- and pseudo-2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo-2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best trade-off in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.

Large Scale

[1] S. Bengio.
Large scale visual semantic extraction.
In Frontiers of Engineering - Reports on Leading-Edge Engineering from the 2011 Symposium, 2012.
weblink | abstract]
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

[2] S. Bengio and Y. Bengio.
Taking on the curse of dimensionality in joint distributions using neural networks.
IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, 11(3):550-557, 2000.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[3] Y. Bengio and S. Bengio.
Modeling high-dimensional discrete data with multi-layer neural networks.
In S.A. Solla, T.K. Leen, and K.-R. Müller, editors, Advances in Neural Information Processing Systems, NIPS 12, pages 400-406. MIT Press, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
The curse of dimensionality is severe when modeling high-dimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling high-dimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multi-layer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[4] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Large-scale content-based audio retrieval from text queries.
In ACM International Conference on Multimedia Information Retrieval, MIR, 2008.
.ps.gz | .pdf | .djvu | abstract]
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[5] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large-scale online learning of image similarity through ranking: Extended abstract.
In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address real-world large-scale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.

[6] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[7] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research, JMLR, 11:1109-1135, 2010.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[8] R. Collobert and S. Bengio.
On the convergence of SVMTorch, an algorithm for large-scale regression problems.
Technical Report IDIAP-RR 00-24, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Recently, many researchers have proposed decomposition algorithms for SVM regression problems. In a previous paper, we also proposed such an algorithm, named SVMTorch. In this paper, we show that while there is actually no convergence proof for any other decomposition algorithm for SVM regression problems to our knowledge, such a proof does exist for SVMTorch for the particular case where no shrinking is used and the size of the working set is equal to 2, which is the size that gave the fastest results on most experiments we have done. This convergence proof is in fact mainly based on the convergence proof given by Keerthi and Gilbert for their SVM classification algorithm.

[9] R. Collobert and S. Bengio.
Support vector machines for large-scale regression problems.
Technical Report IDIAP-RR 00-17, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l2 memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorchphSVMTorch is available at http://www.idiap.ch/learning/SVMTorch.html., which is similar to SVM-Light proposed by Joachims for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another SVM algorithm for large-scale regression problems from Flake and Lawrence yielded significant time improvements.

[10] R. Collobert and S. Bengio.
SVMTorch: Support vector machines for large-scale regression problems.
Journal of Machine Learning Research, JMLR, 1:143-160, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[11] R. Collobert and S. Bengio.
A new margin-based criterion for efficient gradient descent.
Technical Report IDIAP-RR 03-16, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
During the last few decades, several papers were published about second-order optimization methods for gradient descent based learning algorithms. Unfortunately, these methods usually have a cost in time close to O(n3) per iteration, and O(n2) in space, where n is the number of parameters to optimize, which is intractable with large optimization systems usually found in real-life problems. Moreover, these methods are usually not easy to implement. Many enhancements have also been proposed in order to overcome these problems, but most of them still cost O(n2) in time per iteration. Instead of trying to solve a hard optimization problem using complex second-order tricks, we propose to modify the problem itself in order to optimize a simpler one, by simply changing the cost function used during training. Furthermore, we will argue that analyzing the Hessian resulting from the choice of various cost functions is very informative and could help in the design of new machine learning algorithms. For instance, we propose in this paper a version of the Support Vector Machines criterion applied to Multi Layer Perceptrons, which yields very good training and generalization performance in practice. Several empirical comparisons on two benchmark data sets are given to justify this approach.

[12] R. Collobert and S. Bengio.
A gentle hessian for efficient gradient descent.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 517-520, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Several second-order optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n2) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a block-diagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of block-diagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

[13] R. Collobert and S. Bengio.
Links between perceptrons, MLPs and SVMs.
In International Conference on Machine Learning, ICML, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We propose to study links between three important classification algorithms: Perceptrons, Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

[14] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
Neural Computation, 14(5):1105-1114, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[15] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633-640. MIT Press, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.

[16] R. Collobert, S. Bengio, and J. Mariéthoz.
Torch: a modular machine learning software library.
Technical Report IDIAP-RR 02-46, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Many scientific communities have expressed a growing interest in machine learning algorithms recently, mainly due to the generally good results they provide, compared to traditional statistical or AI approaches. However, these machine learning algorithms are often complex to implement and to use properly and efficiently. We thus present in this paper a new machine learning software library in which most state-of-the-art algorithms have already been implemented and are available in a unified framework, in order for scientists to be able to use them, compare them, and even extend them. More interestingly, this library is freely available under a BSD license and can be retrieved on the web by everyone.

[17] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
In S. Lee and A. Verri, editors, International Workshop on Pattern Recognition with Support Vector Machines, SVM, Lecture Notes in Computer Science, volume LNCS 2388, pages 8-23. Springer-Verlag, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

[18] R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures.
International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349-365, 2003.
.ps.gz | .pdf | .djvu | weblink | abstract]
A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative log-likelihood.

[19] D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):1371-1384, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

[20] M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking non-parametric statistical tests.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.

[21] M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparse-code representations.
In ICML 2009 Workshop on Sparse Method for Music Audio, 2009.
.ps.gz | .pdf | .djvu | abstract]
The task of ranking sounds from text queries is a good test application for machine-hearing techniques, and particularly for comparison and evaluation of alternative sound representations in a large-scale setting. We have adapted a machine-vision system, “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns, using a ranking-based cost function, a linear mapping from a very large sparse feature space to a large query-term space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. In addition to two main auditory-image models, we also include and compare a family of more conventional Mel-Frequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. The two auditory models tested use the adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precision-at-top-k performance measures, the best results are about 72% top-1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

[22] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010.
Best Paper Award in Machine Learning [ .ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[23] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
Machine Learning Journal, 81(1):21-35, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[24] J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011.
.ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

Handwriting

[1] A. Vinciarelli and S. Bengio.
Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features.
In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 3, pages 81-84. IEEE Computer Society Press, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents an Offline Cursive Word Recognition System dealing with single writer samples. The system was a continuous density hiddden Markov model trained using either the raw data, or data transformed using Principal Component Analysis or Independent Component Analysis. Both techniques significantly improved the recognition rate of the system. Preprocessing, normalization and feature extraction are described in detail as well as the training technique adopted. Several experiments were performed using a publicly available database. The accuracy obtained is the highest presented in the literature over the same data.

[2] A. Vinciarelli and S. Bengio.
Transforming the feature vectors to improve HMM based cursive word recognition systems.
Technical Report IDIAP-RR 02-32, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Although many Offline Cursive Word Recognition systems are based on HMMs, no attention was ever paid, to our knowledge, to the fact that the feature vectors are typically not in the most suitable form for modeling. They are most of the time correlated and embedded in a space of dimension higher than their Intrinsic Dimension. This leads to several problems and has a negative influence on the performance. By applying some transforms it is possible to solve, or at least to attenuate, such problems resulting in data easier to model and in systems with higher recognition rate. In this work, we used Principal Component Analysis (linear and nonlinear) and Independent Component Analysis. A reduction of the error rate by up to 30.3% (over single writer data) and 16.2% (over multiple writer samples) is shown to be achieved.

[3] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
In Proceedings of the 8th International Conference on Frontiers in Handwriting Recognition, pages 287-291, 2002.
.ps.gz | .pdf | .djvu | weblink | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[4] A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based off-line cursive script recognition.
Pattern Recognition Letters, 23(8):905-916, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This work presents the application of HMM adaptation techniques to the problem of Off-Line Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[5] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of large vocabulary cursive handwritten text.
In International Conference on Document Analysis and Recognition, ICDAR, pages 1101-1105, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of cursive handwritten lines of text. The system is based on continuous density HMMs and Statistical Language Models. The system recognizes data produced by a single writer. No a-priori knowledge is used about the content of the text to be recognized. Changes in the experimental setup with respect to the recognition of single words are highlighted. The results show a recognition rate of ~85% with a lexicon containing 50'000 words. The experiments were performed over a publicly available database.

[6] A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(6):709-720, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ~50% for single writer data and by ~25% for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

Kernel

[1] R. Collobert and S. Bengio.
On the convergence of SVMTorch, an algorithm for large-scale regression problems.
Technical Report IDIAP-RR 00-24, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Recently, many researchers have proposed decomposition algorithms for SVM regression problems. In a previous paper, we also proposed such an algorithm, named SVMTorch. In this paper, we show that while there is actually no convergence proof for any other decomposition algorithm for SVM regression problems to our knowledge, such a proof does exist for SVMTorch for the particular case where no shrinking is used and the size of the working set is equal to 2, which is the size that gave the fastest results on most experiments we have done. This convergence proof is in fact mainly based on the convergence proof given by Keerthi and Gilbert for their SVM classification algorithm.

[2] R. Collobert and S. Bengio.
Support vector machines for large-scale regression problems.
Technical Report IDIAP-RR 00-17, IDIAP, Martigny, Switzerland, 2000.
.ps.gz | .pdf | .djvu | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l2 memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorchphSVMTorch is available at http://www.idiap.ch/learning/SVMTorch.html., which is similar to SVM-Light proposed by Joachims for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another SVM algorithm for large-scale regression problems from Flake and Lawrence yielded significant time improvements.

[3] R. Collobert and S. Bengio.
SVMTorch: Support vector machines for large-scale regression problems.
Journal of Machine Learning Research, JMLR, 1:143-160, 2001.
.ps.gz | .pdf | .djvu | weblink | abstract]
Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVM-Light proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve large-scale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for large-scale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[4] R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems.
Neural Computation, 14(5):1105-1114, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[5] Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over state-of-the-art procedures.

[6] D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):1371-1384, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

[7] J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
Speech Communication, 51:317-329, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.

[8] Q. Le and S. Bengio.
Hybrid generative-discriminative models for speech and speaker recognition.
Technical Report IDIAP-RR 02-06, IDIAP, 2002.
.ps.gz | .pdf | .djvu | abstract]
Generative probability models such as Hidden Markov Models are usually used for modeling sequences of data because of their ability to handle variable size sequences and missing information. On the other hand, because of their discriminative properties, discriminative models like Support Vector Machines (SVMs) usually yield better performance in classification problem and can construct flexible decision boundaries. An ideal classifier should have all the power of these two complementary approaches. A series of recent papers has suggested some techniques for mixing generative models and discriminative models. In one of them a fixed size vector (the Fisher score) containing sufficient statistics of a sequence is computed for a previously trained HMM and can then be used as input to a discriminative model for classification. The purpose of this project is thus to study, experiment, enhance and adapt these new approaches of integrating discriminative models such as SVM into generative models for sequence processing problems, such as speaker and speech recognition.

[9] Q. Le and S. Bengio.
Client dependent GMM-SVM models for speaker verification.
In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443-451. Springer Verlag, 2003.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A cross-validation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[10] J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to text-independent speaker verification systems.
Pattern Recognition, 40:2315-2324, 2007.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[11] V. Popovici, S. Bengio, and J.-P. Thiran.
Kernel matching pursuit for large datasets.
Pattern Recognition, 38(12):2385-2390, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.

[12] A. Pozdnoukhov and S. Bengio.
From samples to objects in kernel methods.
Technical Report IDIAP-RR 03-29, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. Two implementation techniques of this method, based on analytical kernel jittering and the vicinal risk minimization principle, are considered. Empirical results on one artificial dataset and one real dataset based on EEG signals demonstrate the performance of the proposed method.

[13] A. Pozdnoukhov and S. Bengio.
Tangent vector kernels for invariant image classification with SVMs.
In International Conference on Pattern Recognition, ICPR, volume 3, pages 486-489, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper presents an application of the general sample-to-object approach to the problem of invariant image classification. The approach results in defining new SVM kernels based on tangent vectors that take into account prior information on known invariances. Real data of face images are used for experiments. The presented approach integrates virtual sample and tangent distance methods. We observe a significant increase in performance with respect to standard approaches. The experiments also illustrate (as expected) that prior knowledge becomes more important as the amount of training data decreases.

[14] A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems.
In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for inter-class invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVM-based classifier.

[15] A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects.
Pattern Recognition Letters, 27(10):1087-1097, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electro-encephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.

[16] A. Pozdnoukhov and S. Bengio.
Semi-supervised kernel methods for regression estimation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
.ps.gz | .pdf | .djvu | abstract]
The paper presents a semi-supervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed data-dependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two real-world datasets were used to demonstrate the performance of the proposed algorithm.

Ranking

[1] S. Bengio.
Large scale visual semantic extraction.
In Frontiers of Engineering - Reports on Leading-Edge Engineering from the 2011 Symposium, 2012.
weblink | abstract]
Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a so-called "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visio-semantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semantic-only trees, such as WordNet, which do not take into account the visual appearance of concepts.

[2] S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve.
In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the so-called break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[3] S. Bengio, F. Pereira, Y. Singer, and D. Strelow.
Group sparse coding.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.

[4] G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Large-scale content-based audio retrieval from text queries.
In ACM International Conference on Multimedia Information Retrieval, MIR, 2008.
.ps.gz | .pdf | .djvu | abstract]
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[5] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large-scale online learning of image similarity through ranking: Extended abstract.
In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address real-world large-scale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that query-independent similarity could be accurately learned even for large-scale datasets that could not be handled before.

[6] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning.
In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of non-zero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than state-of-the-art methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The non-metric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[7] G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking.
Journal of Machine Learning Research, JMLR, 11:1109-1135, 2010.
.ps.gz | .pdf | .djvu | abstract]
Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passive-aggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing state-of-the-art methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[8] O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation.
In Tenth Conference on Computational Natural Language Learning, CONLL, 2006.
.ps.gz | .pdf | .djvu | abstract]
This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.

[9] D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model.
In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.
.ps.gz | .pdf | .djvu | abstract]
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d inC according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b ->sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.

[10] D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks.
In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b ->sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.

[11] D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):1371-1384, 2008.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.

[12] M. Keller and S. Bengio.
Textual data representation.
Technical Report IDIAP-RR 03-49, IDIAP, 2003.
.ps.gz | .pdf | .djvu | abstract]
We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bag-of-words representation, are explained and some state-of-the-art methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered.

[13] M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation.
In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatic Text Processing tasks, documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[14] M. Keller and S. Bengio.
A neural network for text representation.
In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667-672. Springer-Verlag, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.

[15] R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik.
Sound retrieval and ranking using sparse auditory representations.
Neural Computation, 22(9):2390-2416, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a sound-ranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machine-vision method, the “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large query-term space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. We tested auditory models that use adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional Mel-Frequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top-1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC frontend.

[16] M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparse-code representations.
In ICML 2009 Workshop on Sparse Method for Music Audio, 2009.
.ps.gz | .pdf | .djvu | abstract]
The task of ranking sounds from text queries is a good test application for machine-hearing techniques, and particularly for comparison and evaluation of alternative sound representations in a large-scale setting. We have adapted a machine-vision system, “passive-aggressive model for image retrieval” (PAMIR), which efficiently learns, using a ranking-based cost function, a linear mapping from a very large sparse feature space to a large query-term space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from high-dimensional auditory images. In addition to two main auditory-image models, we also include and compare a family of more conventional Mel-Frequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vector-quantized MFCCs. The two auditory models tested use the adaptive pole-zero filter cascade (PZFC) auditory filterbank and sparse-code feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precision-at-top-k performance measures, the best results are about 72% top-1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

[17] J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint word-image embeddings.
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD, 2010.
Best Paper Award in Machine Learning [ .ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[18] J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation.
In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011.
.ps.gz | .pdf | .djvu | abstract]
Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

Geostats

[1] N. Gilardi and S. Bengio.
Local machine learning models for spatial data analysis.
Journal of Geographic Information and Decision Analysis, 4(1):11-28, 2000.
.ps.gz | .pdf | .djvu | weblink | abstract]
In this paper, we compare different machine learning algorithms applied to non stationary spatial data analysis. We show that models taking into account local variability of the data are better than models which are trained globally on the whole dataset. Two global models (Support Vector Regression and Multilayer Perceptrons) and two local models (a local version of Support Vector Regression and Mixture of Experts) were compared over the Spatial Interpolation Comparison 97 (SIC97) dataset, and the results are presented and compared to previous results obtained on the same dataset.

[2] N. Gilardi and S. Bengio.
Comparison of four machine learning algorithms for spatial data analysis.
In G. Dubois, J. Malczewski, and M. DeCort, editors, Mapping radioactivity in the environment - Spatial Interpolation Comparison 97, pages 222-237. Office for Official Publications of the European Communities, Luxembourg, 2003.
.ps.gz | .pdf | .djvu | abstract]
This chapter proposes a clear methodology on how to use machine learning algorithms for spatial data analysis in order to avoid any bias and eventually obtain fair estimation of their performance on new data. Four different machine learning algorithms are presented, namely multilayer perceptrons (MLP), mixture of experts (ME), support vector regression (SVR) and a local version of the latter (local SVR). Evaluation criteria adapted to geostatistical problems are also presented in order to compare adequately different models on the same dataset. Finally, an experimental comparison is given on the SIC97 dataset as well as an analysis of the results.

[3] N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how?
In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123-138. Office for Official Publications of the European Communities, Luxembourg, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: K-Nearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.

[4] N. Gilardi, S. Bengio, and M. Kanevski.
Conditional gaussian mixture models for environmental risk mapping.
In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 777-786, 2002.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper proposes the use of Gaussian Mixture Models to estimate conditional probability density functions in an environmental risk mapping context. A conditional Gaussian Mixture Model has been compared to the geostatistical method of Sequential Gaussian Simulations and shows good performances in reconstructing local PDF. The data sets used for this comparison are parts of the digital elevation model of Switzerland.

Time Series

[1] S. Bengio, F. Fessant, and D. Collobert.
A connectionist system for medium-term horizon time series prediction.
In International Workshop on Applications of Neural Networks to Telecommunications, IWANNT, Stockholm, Sweden, 1995.
.ps.gz | .pdf | .djvu ]
[2] S. Bengio, F. Fessant, and D. Collobert.
Use of modular architectures for time series prediction.
Neural Processing Letters, 3(2):101-106, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[3] F. Fessant, S. Bengio, and D. Collobert.
On the prediction of solar activity using different neural network models.
Annales Geophysicae, 14:20-26, 1996.
.ps.gz | .pdf | .djvu | weblink ]
[4] A. Gravey, S. Bengio, D. Collobert, and F. Clerot.
Utilisation de techniques de prédiction neuromimétiques pour la négotiation dynamique des paramètres de contrat de trafic dans un réseau ATM.
Technical Report NT/LAA/EIA/132, France Télécom CNET, Lannion, France, 1996.

Ensembles

[1] C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621-624, 2004.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[2] C. Dimitrakakis and S. Bengio.
Online policy adaptation for ensemble classifiers.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper, the idea of using an adaptive policy for training and combining the base classifiers is put forward. The effectiveness of this approach for online learning is demonstrated by experimental results on several UCI benchmark databases.

[3] C. Dimitrakakis and S. Bengio.
Boosting word error rates.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501-504, 2005.
.ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the well-known Numbers 95 database that indicate the importance of this temporal probability distribution.

[4] C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers.
Neurocomputing, 64:211-221, 2005.
.ps.gz | .pdf | .djvu | weblink | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Q-learning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.

Graphical Models

[1] S. Chiappa and S. Bengio.
HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems.
In European Symposium on Artificial Neural Networks, ESANN, 2004.
.ps.gz | .pdf | .djvu | idiap-RR | abstract]
We compare the use of two Markovian models, HMMs and IOHMMs, to discriminate between three mental tasks for brain computer interface systems using an asynchronous protocol. We show that IOHMMs outperform HMMs but that, probably due to the lack of any prior information on the state dynamics, no practical advantage in the use of these models over their static counterparts is obtained.

[2] M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation.
In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004.
.ps.gz | .pdf | .djvu | weblink | abstract]
Automatic Text Processing tasks, documents are usually represented in the bag-of-word space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[3] J.-F. Paiement, S. Bengio, and D. Eck.
Probabilistic models for melodic prediction.
Artificial Intelligence Journal, 173(14):1266-1274, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.

[4] J.-F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions.
In International Conference on Music Information Retrieval, ISMIR, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[5] J.-F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization.
In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218-229. Springer-Verlag, 2006.
.ps.gz | .pdf | .djvu | weblink | abstract]
We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.

[6] J.-F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space.
In International Conference on Machine Learning, ICML, 2005.
.ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[7] J.-F. Paiement, Y. Grandvalet, and S. Bengio.
Predictive models for music.
Connection Science, 21(2 & 3):253-272, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce predictive models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, these models are able to generate realistic melodies given appropriate musical contexts.

[8] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms.
In NIPS Workshop on Brain, Music and Cognition, 2007.
.ps.gz | .pdf | .djvu | abstract]
Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[9] J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms.
In International Conference on Machine Learning, ICML, 2008.
.ps.gz | .pdf | .djvu | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[10] D. Zhang, D. Gatica-Perez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
.ps.gz | .pdf | .djvu | abstract]
We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model.

Deep Learning

[1] S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov.
Guest editors' introduction: Special section on learning deep architectures.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35:1795-1797, 2013.
.ps.gz | .pdf | .djvu | weblink ]
[2] D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pre-training help deep learning?
Journal of Machine Learning Research, JMLR, 11:625-660, 2010.
.ps.gz | .pdf | .djvu | weblink | abstract]
Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of auto-encoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pre-training phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pre-training work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pre-training. The results suggest that unsupervised pre-training guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pre-training.

[3] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent.
The difficulty of training deep architectures and the effect of unsupervised pre-training.
In D. van Dyk and M. Wellings, editors, Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 5 of JMLR Workshop and Conference Procedings, pages 153-160, 2009.
.ps.gz | .pdf | .djvu | weblink | abstract]
Whereas theoretical work suggests that deep architectures might be more efficient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pre-training. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pre-training in terms of optimization and its role as a regularizer. We empirically show the influence of pre-training with respect to architecture depth, model capacity, and number of training examples.