By Year  2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000 
By Topic  Large Scale Learning  Kernel Machines  Ensembles  Ranking  Biometric Authentication  Multimodal 
Handwritten Recognition  Time Series  Geostatistics  Speech  Graphical Models  Deep Learning 
[1] 
S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio.
Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning, CONLL, 2016. [ .ps.gz  .pdf  .djvu  abstract] The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnnbased variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and highlevel syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and wellformed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.

[2] 
J. Chen, R. Monga, S. Bengio, and R. Jozefowicz.
Revisiting distributed synchronous SGD. In Workshop Track of the International Conference on Learning Representations, ICLR, 2016. [ .ps.gz  .pdf  .djvu  weblink ] 
[3] 
M. Cissé, M. AlShedivat, and S. Bengio.
ADIOS: Architectures deep in output space. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016. [ .ps.gz  .pdf  .djvu  abstract] Multilabel classification is a generalization of binary classification where the task consists in predicting sets of labels. With the availability of ever larger datasets, the multilabel setting has become a natural one in many applications, and the interest in solving multilabel problems has grown significantly. As expected, deep learning approaches are now yielding stateoftheart performance for this class of problems. Unfortunately, they usually do not take into account the often unknown but nevertheless rich relationships between labels. In this paper, we propose to make use of this underlying structure by learning to partition the labels into a Markov Blanket Chain and then applying a novel deep architecture that exploits the partition. Experiments on several popular and large multilabel datasets demonstrate that our approach not only yields significant improvements, but also helps to overcome tradeoffs specific to the multilabel classification setting.

[4] 
L. Dinh, J. SohlDickstein, and S. Bengio.
Density estimation using real nvp. ArXiv, 1605.08803, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Unsupervised learning of probabilistic models is a central yet challenging problem in machine learning. Specifically, designing models with tractable learning, sampling, inference and evaluation is crucial in solving this task. We extend the space of such models using realvalued nonvolume preserving (real NVP) transformations, a set of powerful invertible and learnable transformations, resulting in an unsupervised learning algorithm with exact loglikelihood computation, exact sampling, exact inference of latent variables, and an interpretable latent space. We demonstrate its ability to model natural images on four datasets through sampling, loglikelihood evaluation and latent variable manipulations.

[5] 
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer.
Endtoend textdependent speaker verification. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we present a datadriven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domainspecific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for textdependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easytomaintain systems with a small footprint.

[6] 
N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio.
A neural transducer. In Advances In Neural Information Processing Systems, NIPS, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Sequencetosequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequencetosequence models, the Neural Transducer computes the nextstep distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

[7] 
A. Kurakin, I. Goodfellow, and S. Bengio.
Adversarial examples in the physical world. ArXiv, 1607.02533, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work have assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as an input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cellphone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.

[8] 
J. Lee, S. Kim, G. Lebanon, Y. Singer, and S. Bengio.
LLORMA: Local lowrank matrix approximation. Journal of Machine Learning Research, JMLR, 17:124, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is lowrank. In this paper, we propose, analyze, and experiment with two procedures, one parallel and the other global, for constructing local matrix approximations. The two approaches approximate the observed matrix as a weighted sum of lowrank matrices. These matrices are limited to a local region of the observed matrix. We analyze the accuracy of the proposed local lowrank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.

[9] 
M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and
D. Schuurmans.
Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, NIPS, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the loglikelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional logprobability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood

[10] 
O. Vinyals, S. Bengio, and M. Kudlur.
Order matters: Sequence to sequence for sets. In International Conference on Learning Representations, ICLR, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Sequences have become first class citizens in supervised learning thanks to the resurgence of recurrent neural networks. Many complex tasks that require mapping from or to a sequence of observations can now be formulated with the sequencetosequence (seq2seq) framework which employs the chain rule to efficiently represent the joint probability of sequences. In many cases, however, variable sized inputs and/or outputs might not be naturally expressed as sequences. For instance, it is not clear how to input a set of numbers into a model where the task is to sort them; similarly, we do not know how to organize outputs when they correspond to random variables and the task is to model their unknown joint probability. In this paper, we first show using various examples that the order in which we organize input and/or output data matters significantly when learning an underlying model. We then discuss an extension of the seq2seq framework that goes beyond sequences and handles input sets in a principled way. In addition, we propose a loss which, by searching over possible orders during training, deals with the lack of structure of output sets. We show empirical evidence of our claims regarding ordering, and on the modifications to the seq2seq framework on benchmark language modeling and parsing tasks, as well as two artificial tasks  sorting numbers and estimating the joint probability of unknown graphical models.

[1] 
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks. In Advances In Neural Information Processing Systems, NIPS, 2015. [ .ps.gz  .pdf  .djvu  weblink  abstract] Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements. Moreover, it was used successfully in our winning entry to the MSCOCO image captioning challenge, 2015.

[2] 
V. Ramanathan, J. Deng, C. Li, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio,
C. Rosenberg, and F.F. Li.
Learning semantic relationships for better action retrieval in images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015. [ .ps.gz  .pdf  .djvu  abstract] Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We would like our method to reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action recognition models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify theses relationships. We train and test our model on a new largescale image dataset of human actions under two settings with 27K and 2K actions. We show a significant improvement in mean AP compared to different baseline methods including the stateoftheart HEXgraph approach from Deng et al.

[3] 
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.
Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015. [ .ps.gz  .pdf  .djvu  weblink  abstract] Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current stateoftheart BLEU1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU4 of 27.7, which is the current stateoftheart.

[1] 
S. Bengio and G. Heigold.
Word embeddings for speech recognition. In Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech, 2014. [ .ps.gz  .pdf  .djvu  abstract] Speech recognition systems have used the concept of states as a way to decompose words into subword units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

[2] 
J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and
H. Adam.
Largescale object classification using label relation graphs. In Proceedings of the European Conference on Computer Vision, ECCV, 2014. [ .ps.gz  .pdf  .djvu  abstract] In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a largescale benchmark. Empirical results demonstrate that our model can significantly improve object classification by exploiting the label relations.

[3] 
M. R. Gupta, S. Bengio, and J. Weston.
Training highly multiclass classifiers. Journal of Machine Learning Research, JMLR, 15:14611492, 2014. [ .ps.gz  .pdf  .djvu  weblink  abstract] Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the moreconfusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to onevsall linear SVMs and Wsabie.

[4] 
J. Lee, S. Bengio, S. Kim, G. Lebanon, and Y. Singer.
Local collaborative ranking. In International World Wide Web Conference, WWW, 2014. [ .ps.gz  .pdf  .djvu  abstract] Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to recommendation systems typically assume that the rating matrix (e.g., movie ratings by viewers) is lowrank. In this paper, we examine an alternative approach in which the rating matrix is locally lowrank. Concretely, we assume that the rating matrix is lowrank within certain neighborhoods of the metric space defined by (user, item) pairs. We combine a recent approach for local lowrank approximation based on the Frobenius norm with a general empirical risk minimization for ranking losses. Our experiments indicate that the combination of a mixture of local lowrank matrices each of which was trained to minimize a ranking loss outperforms many of the currently used stateoftheart recommendation systems. Moreover, our method is easy to parallelize, making it a viable approach for large scale realworld rankbased recommendation systems.

[5] 
M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S.
Corrado, and J. Dean.
Zeroshot learning by convex combination of semantic embeddings. In International Conference on Learning Representations, ICLR, 2014. [ .ps.gz  .pdf  .djvu  abstract] Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional classification framing of image understanding, particularly in terms of the promise for zeroshot learning  the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zeroshot learning task.

[1] 
S. Bengio, J. Dean, D. Erhan, E. Ie, Q. Le, A. Rabinovich, J. Shlens, and
Y. Singer.
Using web cooccurrence statistics for improving image categorization. ArXiv, 1312.5697, 2013. [ .ps.gz  .pdf  .djvu  weblink  abstract] Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such common sense knowledge using cooccurrence statistics from web documents. By merely counting the number of times nouns (such as elephants, sharks, oceans, etc.) cooccur in web documents, we obtain a good estimate of expected cooccurrences in visual data. We then cast the problem of combining textual cooccurrence statistics with the predictions of imagebased classifiers as an optimization problem. The resulting optimization problem serves as a surrogate for our inference procedure. Albeit the simplicity of the resulting optimization problem, it is effective in improving both recognition and localization accuracy. Concretely, we observe significant improvements in recognition and localization rates for both ImageNet Detection 2012 and Sun 2012 datasets.

[2] 
S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov.
Guest editors' introduction: Special section on learning deep architectures. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35:17951797, 2013. [ .ps.gz  .pdf  .djvu  weblink ] 
[3] 
H. Elmlund, D. Elmlund, and S. Bengio.
PRIME: Probabilistic initial 3d model generation for singleparticle cryoelectron microscopy. Structure, 21:12991306, 2013. [ .ps.gz  .pdf  .djvu  weblink  abstract] Lowdose electron microscopy of cryopreserved individual biomolecules (singleparticle cryoEM) is a powerful tool for obtaining information about the structure and dynamics of large macromolecular assemblies. Acquiring images with low dose reduces radiation damage, preserves atomic structural details, but results in low signaltonoise ratio of the individual images. The projection directions of the twodimensional images are random and unknown. The grand challenge is to achieve the precise threedimensional (3D) alignment of many (tens of thousands to millions) noisy projection images, which may then be combined to obtain a faithful 3D map. An accurate initial 3D model is critical for obtaining the precise 3D alignment required for highresolution (<10 Å) map reconstruction. We report a method (PRIME) that, in a single step and without prior structural knowledge, can generate an accurate initial 3D map directly from the noisy images.

[4] 
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and
T. Mikolov.
DeViSE: A deep visualsemantic embedding model. In Advances In Neural Information Processing Systems, NIPS, 2013. [ .ps.gz  .pdf  .djvu  abstract] Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources  such as text data  both to train visual models and to constrain their predictions. In this paper we present a new deep visualsemantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches stateoftheart performance on the 1000class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zeroshot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

[5] 
M. Stevens, S. Bengio, and Y. Singer.
Efficient learning of sparse ranking functions. In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical Inference. Springer, 2013. [ weblink  abstract] Algorithms for learning to rank can be inefficient when they employ risk functions that use structural information. We describe and analyze a learning algorithm that efficiently learns a ranking function using a domination loss. This loss is designed for problems in which we need to rank a small number of positive examples over a vast number of negative examples. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present an extension that incorporates regularization thus extending Vapnik¿s notion of regularized empirical risk minimization to ranking learning. We also discuss an extension to the case of multivalues feedback. Experiments performed on several benchmark datasets and large scale Google internal dataset demonstrate the effectiveness of learning algorithm in constructing compact models while retaining the empirical performance accuracy.

[1] 
S. Bengio.
Large scale visual semantic extraction. In Frontiers of Engineering  Reports on LeadingEdge Engineering from the 2011 Symposium, 2012. [ weblink  abstract] Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a socalled "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semanticonly trees, such as WordNet, which do not take into account the visual appearance of concepts.

[1] 
C. Dimitrakakis and S. Bengio.
Phoneme and sentencelevel ensembles for speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2011, 2011. [ .ps.gz  .pdf  .djvu  weblink  abstract] We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phonemelevel bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.

[2] 
J. Weston, S. Bengio, and P. Hamel.
Multitasking with joint semantic spaces for largescale music annotation and retrieval. Journal of New Music Research, 40:337348, 2011. [ .ps.gz  .pdf  .djvu  weblink  abstract] Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single lowdimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multitask learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

[3] 
J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011. [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

[1] 
S. Bengio.
Statistical machine learning for HCI. In J.P. Thiran, F. Marqués, and H. Bourlard, editors, Multimodal Signal Processing: Theory and Applications for HumanComputer Interaction, pages 723. Academic Press, 2010. [ weblink  abstract] This chapter introduces the main concepts of statistical machine learning, as they are pivot in most algorithms tailored for multimodal signal processing. In particular, the chapter will cover a general introduction to machine learning and how it is used in classification, regression and density estimation. Following this introduction, two particularly well known models will be presented, together with their associated learning algorithm: support vector machines, which are wellknown for classification tasks, and hidden Markov models, which are tailored for sequence processing tasks such as speech recognition.

[2] 
S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multiclass tasks. In Advances in Neural Information Processing Systems, NIPS, 2010. [ .ps.gz  .pdf  .djvu  abstract] Multiclass classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than nonembedding approaches and has superior accuracy to Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including OnevsRest while being orders of magnitude faster.

[3] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, JMLR, 11:11091135, 2010. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[4] 
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, JMLR, 11:625660, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of autoencoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pretraining phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pretraining work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pretraining with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pretraining. The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pretraining.

[5] 
R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik.
Sound retrieval and ranking using sparse auditory representations. Neural Computation, 22(9):23902416, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a soundranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machinevision method, the “passiveaggressive model for image retrieval” (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large queryterm space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from highdimensional auditory images. We tested auditory models that use adaptive polezero filter cascade (PZFC) auditory filterbank and sparsecode feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional MelFrequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vectorquantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC frontend.

[6] 
J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint wordimage embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECMLPKDD, 2010. Best Paper Award in Machine Learning [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[7] 
J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint wordimage embeddings. Machine Learning Journal, 81(1):2135, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[1] 
S. Bengio and J. Keshet.
Introduction. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 310. Wiley, 2009. [ weblink  abstract] One of the most natural communication tools used by humans is their voice. It is hence natural that a lot of research has been devoted to analyzing and understanding human uttered speech for various applications. The most obvious one is automatic speech recognition, where the goal is to transcribe a recorded speech utterance into its corresponding sequence of words. Other applications include speaker recognition, where the goal is to determine either the claimed identity of the speaker (verification) or who is speaking (identification), and speaker segmentation or diarization, where the goal is to segment an acoustic sequence in terms of the underlying speakers (such as during a dialog). Although an enormous amount of research has been devoted to speech processing, there appears to be some form of local optimum in terms of the fundamental tools used to approach these problems. The aim of this book is to introduce the speech researcher community to radically different approaches based on more recent kernel based machine learning methods. In this introduction, we first briefly review the predominant speech processing approach, based on hidden Markov models, as well as its known problems; we then introduce the most well known kernel based approach, the Support Vector Machine (SVM), and finally outline the various contributions of this book.

[2] 
S. Bengio, F. Pereira, Y. Singer, and D. Strelow.
Group sparse coding. In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009. [ .ps.gz  .pdf  .djvu  abstract] Bagofwords document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearestneighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixednorm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.

[3] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Largescale online learning of image similarity through ranking: Extended abstract. In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address realworld largescale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that queryindependent similarity could be accurately learned even for largescale datasets that could not be handled before.

[4] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning. In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of nonzero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than stateoftheart methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[5] 
D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent.
The difficulty of training deep architectures and the effect of unsupervised pretraining. In D. van Dyk and M. Wellings, editors, Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 5 of JMLR Workshop and Conference Procedings, pages 153160, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Whereas theoretical work suggests that deep architectures might be more efficient at representing highlyvarying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pretraining. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pretraining in terms of optimization and its role as a regularizer. We empirically show the influence of pretraining with respect to architecture depth, model capacity, and number of training examples.

[6] 
D. Grangier, J. Keshet, and S. Bengio.
Discriminative keyword spotting. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 175194. Wiley, 2009. [ weblink  abstract] This chapter introduces a discriminative method for detecting and spotting keywords in spoken utterances. Given a word represented as a sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time span of the phoneme sequence in the spoken utterance along with a confidence. If the prediction confidence is above certain level the keyword is declared to be spoken in the utterance within the predicted time span, otherwise the keyword is declared as not spoken. The problem of keyword spotting training is formulated as a discriminative task where the model parameters are chosen so the utterance in which the keyword is spoken would have higher confidence than any other spoken utterance in which the keyword is not spoken. It is shown theoretically and empirically that the proposed training method resulted with a high area under the Receiver Operating Characteristic (ROC) curve, the most common measure to evaluate keyword spotters. We present an iterative algorithm to train the keyword spotter efficiently. The proposed approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Several experiments performed on TIMIT and WSJ corpora show the advantage of our approach over HMMbased alternatives.

[7] 
J. Keshet and S. Bengio, editors.
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. Wiley, 2009. [ weblink  abstract] This is the first book dedicated to uniting research related to speech and speaker recognition based on the recent advances in large margin and kernel methods. The first part of the book presents theoretical and practical foundations of large margin and kernel methods, from Support Vector Machines to large margin methods for structured learning. The second part of the book is dedicated to acoustic modeling of continuous speech recognizers, where the grounds for practical large margin sequence learning are set. The third part introduces large margin methods for discriminative language modeling. The last part of the book is dedicated to the application of keyword spotting, speaker verification and spectral clustering. The book is an important reference to researchers and practitioners in the field of modern speech and speaker recognition. The purpose of the book is twofold: first, to set the theoretical foundation of large margin and kernel methods relevant to the speech recognition domain; second, to propose a practical guide on implementation of these methods to the speech recognition domain. The reader is presumed to have basic knowledge of large margin and kernel methods and of basic algorithms in speech and speaker recognition.

[8] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. Speech Communication, 51:317329, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vectorspace, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional contextindependent HMMbased system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGIStories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional contextindependent HMMbased system.

[9] 
J. Mariéthoz, S. Bengio, and Y. Grandvalet.
Kernelbased textindependent speaker verification. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 195220. Wiley, 2009. [ weblink  abstract] The goal of a person authentication system is to certify/attest the claimed identity of a user. When this authentication is based on the voice of the user, without respect to what the user exactly said, the system is called a textindependent speaker verification system. Speaker verification systems are increasingly often used to secure personal information, particularly for mobile phone based applications. Furthermore, textindependent versions of speaker verification systems are most used for their simplicity, as they do not require complex speech recognition modules. The most common approach to this task is based on Gaussian Mixture Models (GMMs) (Reynolds et al. 2000), which do not take into account any temporal information. GMMs have been intensively used thanks to their good performance, especially with the use of the Maximum a posteriori (MAP) (Gauvain and Lee 1994) adaptation algorithm. This approach is based on the density estimation of an impostor data distribution, followed by its adaptation to a specific client data set. Note that the estimation of these densities is not the final goal of speaker verification systems, which is rather to discriminate the client and impostor classes; hence discriminative approaches might appear good candidates for this task as well. As a matter of fact, Support Vector Machine (SVM) based systems have been the subject of several recent publications in the speaker verification community, in which they obtain Ma performance similar to or even better than GMMs on several textindependent speaker verification tasks. In order to use SVMs or any other discriminant approaches for speaker verification, several modifications of the classical techniques need to be performed. The purpose of this chapter is to present an overview of discriminant approaches that have been used successfully for the task of textindependent speaker verification, to analyze their differences from and their similarities to each other and to classical generative approaches based on GMMs. An opensource version of the C++ source code used to performed all experiments described in this chapter can be found at http://speaker.abracadoudou.com.

[10] 
J.F. Paiement, S. Bengio, and D. Eck.
Probabilistic models for melodic prediction. Artificial Intelligence Journal, 173(14):12661274, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.

[11] 
J.F. Paiement, Y. Grandvalet, and S. Bengio.
Predictive models for music. Connection Science, 21(2 & 3):253272, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Modeling longterm dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce predictive models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, these models are able to generate realistic melodies given appropriate musical contexts.

[12] 
M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparsecode representations. In ICML 2009 Workshop on Sparse Method for Music Audio, 2009. [ .ps.gz  .pdf  .djvu  abstract] The task of ranking sounds from text queries is a good test application for machinehearing techniques, and particularly for comparison and evaluation of alternative sound representations in a largescale setting. We have adapted a machinevision system, “passiveaggressive model for image retrieval” (PAMIR), which efficiently learns, using a rankingbased cost function, a linear mapping from a very large sparse feature space to a large queryterm space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from highdimensional auditory images. In addition to two main auditoryimage models, we also include and compare a family of more conventional MelFrequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vectorquantized MFCCs. The two auditory models tested use the adaptive polezero filter cascade (PZFC) auditory filterbank and sparsecode feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precisionattopk performance measures, the best results are about 72% top1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

[1] 
G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Largescale contentbased audio retrieval from text queries. In ACM International Conference on Multimedia Information Retrieval, MIR, 2008. [ .ps.gz  .pdf  .djvu  abstract] In contentbased audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This contentbased approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses freeform text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passiveaggressive model for image retrieval (PAMIR), and compare it to two stateoftheart approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large realworld datasets: a collection of short sound effects, and a noisier and larger collection of usercontributed userlabeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[2] 
D. Grangier and S. Bengio.
A discriminative kernelbased model to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):13711384, 2008. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernelbased classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over stateoftheart alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multipleword queries.

[3] 
J.F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms. In International Conference on Machine Learning, ICML, 2008. [ .ps.gz  .pdf  .djvu  abstract] Modeling longterm dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[4] 
H. PaugamMoisy, R. Martinez, and S. Bengio.
Delay learning and polychronization for reservoir computing. Neurocomputing, 71(79):11431158, 2008. [ .ps.gz  .pdf  .djvu  weblink  abstract] We propose a multitimescale learning rule for spiking neuron networks, in the line of the recently emerging field of reservoir computing. The reservoir is a network model of spiking neurons, with random topology and driven by STDP (SpikeTimeDependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that affects the synaptic delays linking the network to the readout neurons, with classification as a goal task. The network processing and the resulting performance can be explained by the concept of polychronization, proposed by Izhikevich (2006, Neural Computation, 18:2), on physiological grounds. The model emphasizes that polychronization can be used as a tool for exploiting the computational power of synaptic delays and for monitoring the topology and activity of a spiking neuron network.

[1] 
S. Bengio and J. Mariéthoz.
Biometric person authentication is a multiple classifier problem. In M. Haindl, J. Kittler, and F. Roli, editors, 7th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 4472. SpringerVerlag, 2007. [ .ps.gz  .pdf  .djvu  abstract] Several papers have already shown the interest of using multiple classifiers in order to enhance the performance of biometric person authentication systems. In this paper, we would like to argue that the core task of Biometric Person Authentication is actually a multiple classifier problem as such: indeed, in order to reach stateoftheart performance, we argue that all current systems , in one way or another, try to solve several tasks simultaneously and that without such joint training (or sharing), they would not succeed as well. We explain hereafter this perspective, and according to it, we propose some ways to take advantage of it, ranging from more parameter sharing to similarity learning.

[2] 
D. Grangier and S. Bengio.
Learning the interframe distance for discriminative templatebased keyword detection. In Proceedings of the 10th European Conference on Speech Communication and Technology, EurospeechInterspeech, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a discriminative approach to templatebased keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

[3] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on nonlinearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vectorspace which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMMbased approach.

[4] 
J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to textindependent speaker verification systems. Pattern Recognition, 40:23152324, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to stateoftheart GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[5] 
J.F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms. In NIPS Workshop on Brain, Music and Cognition, 2007. [ .ps.gz  .pdf  .djvu  abstract] Modeling music involves capturing longterm dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[6] 
H. PaugamMoisy, R. Martinez, and S. Bengio.
A supervised learning approach based on STDP and polychronization in spiking neuron networks. In European Symposium on Artificial Neural Networks, ESANN, 2007. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] We propose a network model of spiking neurons, without preimposed topology and driven by STDP (SpikeTimeDependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that has effect on the synaptic delays linking the network to the output neurons, with classification as a goal task. The network processing and the resulting performance are completely explainable by the concept of polychronization, proposed by Izhikevich [?]. The model emphasizes the computational capabilities of this concept.

[7] 
N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007. [ .ps.gz  .pdf  .djvu  abstract] Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test classconditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.

[8] 
N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint userspecific and sample bootstraps. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(3):492498, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] Biometric authentication performance is often depicted by a decision error tradeoff (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a twostep bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.

[9] 
S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006. volume 4299 of Lecture Notes in Computer Science. SpringerVerlag, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 14, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme)  the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence  and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was colocated with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT06 evaluation continued evaluation tasks in the areas of speechtotext, whospokewhen, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction  and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included humanhuman communication modeling, speech and visual processing, multimodal processing, fusion and fission, humancomputer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick MurraySmith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.

[10] 
S. Sonnenburg, M. L. Braun, C. Soon Ong, S Bengio, L. Bottou, G. Holmes,
Y. LeCun, K.R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf,
A. Smola, P. Vincent, J. Weston, and R. Williamson.
The need for open source software in machine learning. Journal of Machine Learning Research, JMLR, 8:24432466, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] Open source tools have recently reached a level of maturity which makes them suitable for building largescale realworld systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not utilized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

[11] 
D. Zhang and S. Bengio.
Exploring contextual information in a layered framework for group action recognition. In IEEE International Conference on Multimedia & Expo, ICME, 2007. [ .ps.gz  .pdf  .djvu  abstract] Contextual information is important for sequence modeling. Hidden Markov models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommodate overlapping features or longterm contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state alpha and gamma posteriors (as usually referred to in the HMM formalism). The third method is based on conditional random fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

[1] 
F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images. IEEE Transactions on Signal Processing, 54(1):361373, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best tradeoff in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.

[2] 
O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation. In Tenth Conference on Computational Natural Language Learning, CONLL, 2006. [ .ps.gz  .pdf  .djvu  abstract] This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.

[3] 
D. Grangier and S. Bengio.
A neural network to retrieve images from text queries. In Proceedings of the 16th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 4132. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.

[4] 
D. Grangier, F. Monay, and S. Bengio.
A discriminative approach for the retrieval of images from text queries. In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science, volume LNCS 4212. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the nonrelevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed PassiveAggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current stateoftheart approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.

[5] 
D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model. In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006. [ .ps.gz  .pdf  .djvu  abstract] This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the stateoftheart (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).

[6] 
J. Keshet, S. ShalevShwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernelbased phoneme sequence recognition. In Proceedings of the International Conference on Spoken Language Processing, InterspeechICSLP, 2006. [ .ps.gz  .pdf  .djvu  abstract] We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMMbased approaches, our method uses a discriminative kernelbased training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vectorspace endowed with an innerproduct that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMMbased approach.

[7] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds. In Proceedings of the International Conference on Spoken Language Processing, InterspeechICSLP, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.

[8] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.

[9] 
M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems. In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186195. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we present a text independent online writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the strokebased feature set outperforms the pointbased feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.

[10] 
J. Mariéthoz and S. Bengio.
A max kernel for textindependent speaker verification systems. In Second Workshop on Multimodal User Authentication, MMUA, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to stateoftheart GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.

[11] 
J.F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization. In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218229. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.

[12] 
N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[13] 
N. Poh and S. Bengio.
Database, protocol and tools for evaluating scorelevel fusion algorithms in biometric authentication. Pattern Recognition, 39(2):223233, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some stateoftheart tools to evaluate the fusion performance.

[14] 
N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess userdependent variabilities. In Second Workshop on Multimodal User Authentication, MMUA, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[15] 
A. Pozdnoukhov and S. Bengio.
Graphbased invariant manifolds for invariant pattern recognition with kernel methods. In International Conference on Pattern Recognition, ICPR, 2006. [ .ps.gz  .pdf  .djvu  abstract] We present here an approach for applying the technique of modeling data transformation manifolds for invariant learning with kernel methods. The approach is based on building a kernel function on the graph modeling the invariant manifold. It provides a way for taking into account nearly arbitrary transformations of the input samples. The approach is verified experimentally on the task of optical character recognition, providing stateoftheart performance on harder problem settings.

[16] 
A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects. Pattern Recognition Letters, 27(10):10871097, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electroencephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.

[17] 
A. Pozdnoukhov and S. Bengio.
Semisupervised kernel methods for regression estimation. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] The paper presents a semisupervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed datadependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two realworld datasets were used to demonstrate the performance of the proposed algorithm.

[18] 
S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005. volume 3869 of Lecture Notes in Computer Science. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 1113 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT05) evaluation of speech technologies within the meeting domain. Building on the success of the RT04 spring evaluation, the RT05 evaluation continued the speechtotext and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Humanhuman communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Humanhuman interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.

[19] 
Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems. Image and Vision Computing, 24(8):882893, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.

[20] 
C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for nonfrontal face verification. Pattern Recognition, 39(2):288302, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for nonfrontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for nonfrontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCAderived features) and a local feature system (based on DCTderived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.

[21] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs. IEEE Transactions on Multimedia, 8(3):509520, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently groupbased (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a twolayer process, one that models basic individual activities from lowlevel audiovisual features, and another one that models the interactions. We propose a twolayer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on lowdimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public fivehour meeting corpus. Experiments and comparison with a singlelayer HMM baseline system show its validity.

[22] 
D. Zhang, D. GaticaPerez, D. Roy, and S. Bengio.
Modeling interactions from email communication. In IEEE International Conference on Multimedia & Expo, ICME, 2006. [ .ps.gz  .pdf  .djvu  abstract] Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topicbased interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

[1] 
S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004. volume 3361 of Lecture Notes in Computer Science. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 2123, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction  and more specifically to multimodal meeting processing. These areas included humanhuman communication modeling, speech and visual processing, multimodal processing, fusion and fission, multimodal dialog modeling, humanhuman interaction modeling, multimodal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.

[2] 
S. Bengio and H. Bourlard.
Multi channel sequence processing. In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 2236. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper summarizes some of the current research challenges arising from multichannel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multistream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multichannel tasks, ranging from audiovisual speech recognition to automatic meeting analysis.

[3] 
S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve. In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the socalled breakeven point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a nonparametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[4] 
C. Dimitrakakis and S. Bengio.
Boosting word error rates. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501504, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the wellknown Numbers 95 database that indicate the importance of this temporal probability distribution.

[5] 
C. Dimitrakakis and S. Bengio.
Gradientbased estimates of return distributions. In PASCAL Workshop on Principled Methods of Trading Exploration and Exploitation, 2005. [ .ps.gz  .pdf  .djvu  abstract] We present a general method for maintaining estimates of the distribution of parameters in arbitrary models. This is then applied to the estimation of probability distributions over actions in valuebased reinforcement learning. While this approach is similar to other techniques that maintain a confidence measure for actionvalues, it nevertheless offers an insight into current techniques and hints at potential avenues of further research.

[6] 
C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers. Neurocomputing, 64:211221, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Qlearning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.

[7] 
D. GaticaPerez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections. In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005. [ .ps.gz  .pdf  .djvu  abstract] Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audiovisual sensors, discussing open issues in this domain.

[8] 
D. GaticaPerez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interestlevel in meetings. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489492, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interestlevel that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of highinterest from audiovisual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interestlevel. On a 50meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable groundtruth for automatic methods. For the automatic detection of highinterest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single and multistream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interestlevel is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multimodal approach is beneficial.

[9] 
N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how? In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123138. Office for Official Publications of the European Communities, Luxembourg, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: KNearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.

[10] 
Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, we show that the hinge loss can be interpreted as the negloglikelihood of a semiparametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semiparametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is intervalvalued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over stateoftheart procedures.

[11] 
D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model. In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005. [ .ps.gz  .pdf  .djvu  abstract] Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d inC according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b >sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the stateoftheart Okapi approach, when applied over a nonhyperlinked retrieval corpus.

[12] 
D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks. In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b >sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.

[13] 
M. Keller and S. Bengio.
A neural network for text representation. In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667672. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, nonprobabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.

[14] 
M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking nonparametric statistical tests. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Although nonparametric tests have already been proposed for that purpose, statistical significance tests for nonstandard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets nonparametric tests are relatively reliable in all conditions.

[15] 
H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multistream posterior based speech recognition system. In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior” definition (typically used in standard HMMs training) extended to the case of multistream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multistream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speechtotext (CTS) task, this resulted in significant performance improvement, compared to the stateoftheart Tandem systems.

[16] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems. In Proceedings of the 9th European Conference on Speech Communication and Technology, EurospeechInterspeech, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the “gamma” distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a stateoftheart hybrid HMM/ANN system.

[17] 
J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification. IEEE Signal Processing Letters, 12(7):532535, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The purpose of this paper is to unify several of the stateoftheart score normalization techniques applied to textindependent speaker verification systems. We propose a new framework for this purpose. The two wellknown Z and Tnormalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these wellknown score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.

[18] 
I. McCowan, D. GaticaPerez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(3):305317, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMMbased approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audiobased events, motivating a multimodal approach to meeting analysis.

[19] 
J.F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions. In International Conference on Music Information Retrieval, ISMIR, 2005. [ .ps.gz  .pdf  .djvu  abstract] Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional outofsample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[20] 
J.F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space. In International Conference on Machine Learning, ICML, 2005. [ .ps.gz  .pdf  .djvu  abstract] Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional outofsample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[21] 
N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments? In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various reallife applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a socalled a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply “true users”) taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.

[22] 
N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks. In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 7485. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometricenabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several basesystems. How exactly do correlation and imbalance nature of basesystem performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the classdependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark scorelevel fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.

[23] 
N. Poh and S. Bengio.
Fratio clientdependent normalisation for biometric authentication tasks. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721724, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates a new clientdependent normalisation to improve biometric authentication systems. There exists many clientdependent score normalisation techniques applied to speaker authentication, such as ZNorm, DNorm and TNorm. Such normalisation is intended to adjust the variation across different client models. We propose “Fratio” normalisation, or FNorm, applied to face and speaker authentication systems. This normalisation requires only that as few as two clientdependent accesses are available (the more the better). Different from previous normalisation techniques, FNorm considers the client and impostor distributions simultaneously. We show that Fratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that FNorm actually “interpolates” between clientindependent and clientdependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over ZNorm, clientdependent threshold normalisation or no normalisation.

[24] 
N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks? IEEE Transactions on Signal Processing, 53(11):43844396, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in reallife biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of baseclassifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance ReductionEqual Error Rate (VREER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of Fratio, which itself is a function of 1) correlation, 2) variance of baseexperts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of baseexperts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be missleading. Experimental results on the BANCA multimodal database confirm our findings using VREER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated baseexperts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (classdependent) variance. Moreover, by linking the concept of biasvariancecovariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[25] 
N. Poh and S. Bengio.
Improving fusion with marginderived confidence in biometric authentication tasks. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 10591068. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, “after observing the score of a given system, what is the confidence (or risk) associated to that given access?”. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.

[26] 
N. Poh and S. Bengio.
A novel approach to combining clientdependent and confidence information in multimodal biometrics. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 11201129. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The issues of fusion with clientdependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a secondlevel classifier, across different experts. Although the formulation of such twostep solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the secondlevel classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call “Prior Knowledge Incorporation” has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS scorelevel fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new stateoftheart performance on this database.

[27] 
N. Poh and S. Bengio.
A scorelevel fusion benchmark database for biometric authentication. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474483. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some stateoftheart tools to evaluate the fusion performance.

[28] 
V. Popovici, S. Bengio, and J.P. Thiran.
Kernel matching pursuit for large datasets. Pattern Recognition, 38(12):23852390, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernelinduced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the tradeoff between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.

[29] 
A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems. In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for interclass invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVMbased classifier.

[30] 
C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity tradeoffs in face verification. In IEEE International Conference on Information Technology and Applications, ICITA, pages 638643, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D and pseudo2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best tradeoff in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.

[31] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Semisupervised adapted HMMs for unusual event detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised modelbased approaches. We propose a semisupervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in predefining unusual events. Experiments on audio, visual, and audiovisual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.

[32] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Semisupervised meeting event recognition with adapted HMMs. In IEEE International Conference on Multimedia Expo, ICME, pages 611618, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] This paper investigates the use of unlabeled data to help labeled data for audiovisual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semisupervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a wellestimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audiovisual events defined in meetings. Experiments and comparison with the fullysupervised baseline method show the validity of the proposed semisupervised approach.

[33] 
D. Zhang, D. GaticaPerez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  abstract] We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a twolevel structure: individuallevel and grouplevel. Individual level models actions of each player, and the grouplevel models actions of the team as a whole. Experiments on synthetic multiplayer games and a multiparty meeting corpus show the effectiveness of the proposed model.

[1] 
S. Bengio.
Multimodal speech processing using asynchronous hidden markov models. Information Fusion, 5(2):8189, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An ExpectationMaximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audiovisual speech processing tasks, namely speech recognition and textdependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[2] 
S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2class classification tasks.

[3] 
S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).

[4] 
H. Bourlard, S. Bengio, M. Magimai Doss, Q. Zhu, B. Mesot, and N. Morgan.
Towards using hierarchical posteriors for flexible automatic speech recognition systems. In Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speechtotext) Rich Transcription (RT'04) Workshop, 2004. [ .ps.gz  .pdf  .djvu  abstract] Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speechtotext (CTS) task.

[5] 
F. Cardinaux, C. Sanderson, and S. Bengio.
Face verification using adapted generative models. In International Conference on Automatic Face and Gesture Recognition, FG, pages 825830, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has been shown previously that systems based on local features and relatively complex generative models, namely 1D Hidden Markov Models (HMMs) and pseudo2D HMMs, are suitable for face recognition (here we mean both identification and verification). Recently a simpler generative model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In this paper we first propose to increase the performance of the GMM approach (without sacrificing its simplicity) through the use of local features with embedded positional information; we show that the performance obtained is comparable to 1D HMMs. Secondly, we evaluate different training techniques for both GMM and HMM based systems. We show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available; we propose to tackle this problem through the use of Maximum a Posteriori (MAP) training, where the lack of data problem can be effectively circumvented; we show that models estimated with MAP are significantly more robust and are able to generalize to adverse conditions present in the BANCA database.

[6] 
S. Chiappa and S. Bengio.
HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems. In European Symposium on Artificial Neural Networks, ESANN, 2004. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] We compare the use of two Markovian models, HMMs and IOHMMs, to discriminate between three mental tasks for brain computer interface systems using an asynchronous protocol. We show that IOHMMs outperform HMMs but that, probably due to the lack of any prior information on the state dynamics, no practical advantage in the use of these models over their static counterparts is obtained.

[7] 
R. Collobert and S. Bengio.
A gentle hessian for efficient gradient descent. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 517520, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Several secondorder optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n^{2}) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a blockdiagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of blockdiagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

[8] 
R. Collobert and S. Bengio.
Links between perceptrons, MLPs and SVMs. In International Conference on Machine Learning, ICML, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

[9] 
F. de Wet, K. Weber, L. Boves, B. Cranen, S. Bengio, and H. Bourlard.
Evaluation of formantlike features for automatic speech recognition. Journal of the Acoustical Society of America (JASA), 116(3):17811792, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates possibilities to find a lowdimensional, formantrelated physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formantlike features and stateoftheart, noiserobust features have previously been shown to be more robust in adverse conditions than stateoftheart features alone. However, it is not clear how these automatically extracted formantlike features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formantlike features, i.e. robust formants and HMM2 features, and to compare these features to handlabeled formants as well as to melfrequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and handlabeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 30993111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the handlabeled formants.

[10] 
C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621624, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[11] 
C. Dimitrakakis and S. Bengio.
Online policy adaptation for ensemble classifiers. In European Symposium on Artificial Neural Networks, ESANN, 2004. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper, the idea of using an adaptive policy for training and combining the base classifiers is put forward. The effectiveness of this approach for online learning is demonstrated by experimental results on several UCI benchmark databases.

[12] 
M. Magimai Doss, S. Bengio, and H. Bourlard.
Joint decoding for phonemegrapheme continuous speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 177180, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Standard ASR systems typically use phoneme as the subword units. Preliminary studies have shown that the performance of the ASR system could be improved by using grapheme as additional subword units. In this paper, we investigate such a system where the word models are defined in terms of two different subword units, i.e., phoneme and grapheme. During training, models for both the subword units are trained, and then during recognition either both or just one subword unit is used. We have studied this system for a continuous speech recognition task in American English language. Our studies show that grapheme information used along with phoneme information improves the performance of ASR.

[13] 
M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation. In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Automatic Text Processing tasks, documents are usually represented in the bagofword space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[14] 
I. McCowan, D. GaticaPerez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions. In Machine Learning for Multimodal Interaction: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 5675. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] People meet in order to interact  disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[15] 
K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux,
S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyz,
L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes,
E. Vidal, L. Bai, L. Shen, Y. Wang, C. YuehHsuan, L. HsienChang,
H. YiPing, A. Heinrichs, M. Muller, A. Tewes, C. von der Malsburg, R. Wurtz,
Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and
H. Schneiderman.
Face authentication test on the BANCA database. In International Conference on Pattern Recognition, ICPR, volume 4, pages 523532, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper details the results of a Face Authentication Test (FAT2004) held in conjunction with the 17th International Conference on Pattern Recognition. The contest was held on the publicly available BANCA database according to a defined protocol. The competition also had a sequestered part in which institutions had to submit their algorithms for independent testing. 13 different verification algorithms from 10 institutions submitted results. Also, a standard set of face recognition software packages from the Internet were used to provide a baseline performance measure.

[16] 
K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio,
F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz,
L. Vandendorpe, J. Ng, H. Cheung, and B. Tang.
Face authentication competition on the BANCA database. In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 815. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper details the results of a face verification competition held in conjunction with the First International Conference on Biometric Authe ntication. The contest was held on the publically available BANCA database according to a defined protocol. Six different verification algorithms from 4 academic and commercial institutions submitted results. Also, a standard set of face recognition software from the internet was used to provide a baseline performance measure.

[17] 
N. Poh and S. Bengio.
Noiserobust multistream fusion for textindependent speaker authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multistream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noiserobust multistream textindependent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noiserobustness is mainly due to the combination mechanism. This twostep approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to substreams. An important finding is that a tradeoff is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this tradeoff, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.

[18] 
N. Poh and S. Bengio.
Towards predicting optimal subsets of base classifiers in biometric authentication tasks. In S. Bengio and H. Bourlard, editors, Machine Learning for Multimodal Interactions: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 159172. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extend in various reallife applications. However, combining too many systems (baseexperts) will also increase both hardware and computation costs. One way to selecting a subset of optimal baseexperts out of N is to carry out the experiments explicitly. There are 2^{N}1 possible combinations. In this paper, we propose an analytical solution to this task when weighted sum fusion mechanism is used. The proposed approach is at least valid in the domain of person authentication. It has a complexity that is additive between the number of examples and the number of possible combinations while the conventional approach, using bruteforce experimenting, is multiplicative between these two terms. Hence, our approach will scale better with large fusion problems. Experiments on the BANCA multimodal database verified our approach. While we will consider here fusion in the context of identity verification via biometrics, or simply biometric authentication, it can also have an important impact in meetings because this a priori information can assist in retrieving highlights in meeting analysis as in “who said what”. Furthermore, automatic meeting analysis also requires many systems working together and involves possibly many audiovisual media streams. Development in fusion of identity verification will provide insights into how fusion in meetings can be done. The ability to predict fusion performance is another important step towards understanding the fusion problem.

[19] 
N. Poh and S. Bengio.
Why do multistream, multiband and multimodal approaches work on biometric user authentication tasks? In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893896, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multiband, multistream and multimodal approaches have proven to be very successful both in experiments and in reallife applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, MultiLayerPerceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[20] 
N. Poh, C. Sanderson, and S. Bengio.
Spectral subband centroids as complementary features for speaker authentication. In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 631639. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, e.g., Melscale Filterbank Cepstrum Coefficients (MFCCs), Linearscale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTAPLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to formant frequencies but are limited to a given subband.Empirical experiments carried out on the NIST2001 database using SSCs, MFCCs, LFCCs and their combinations by concatenation suggest that SSCs are somewhat more robust compared to conventional MFCC and LFCC features as well as being partially complementary.

[21] 
A. Pozdnoukhov and S. Bengio.
Tangent vector kernels for invariant image classification with SVMs. In International Conference on Pattern Recognition, ICPR, volume 3, pages 486489, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents an application of the general sampletoobject approach to the problem of invariant image classification. The approach results in defining new SVM kernels based on tangent vectors that take into account prior information on known invariances. Real data of face images are used for experiments. The presented approach integrates virtual sample and tangent distance methods. We observe a significant increase in performance with respect to standard approaches. The experiments also illustrate (as expected) that prior knowledge becomes more important as the amount of training data decreases.

[22] 
Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Estimating the quality of face localization for face verification. In IEEE International Conference on Image Processing, ICIP, pages 581584, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Face localization is the process of finding the exact position of a face in a given image. This can be useful in several applications such as face tracking or person authentication. The purpose of this paper is to show that the error made during the localization process may have different impacts depending on the final application. Hence in order to evaluate the performance of a face localization algorithm, we propose to embed the final application (here face verification) into the performance measuring process. Moreover, in this paper, we estimate this embedding using either a multilayer perceptron or a K nearest neighbor algorithm in order to speedup the evaluation process. We show on the BANCA database that our proposed measure best matches the final verification results when comparing several localization algorithms, on various performance measures currently used in face localization.

[23] 
C. Sanderson and S. Bengio.
Extrapolating single view face models for multiview recognition. In International Conference on Intelligente Sensors, Sensor Networks and Information Processings, ISSNIP, pages 581586, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Performance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for nonfrontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for nonfrontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.

[24] 
C. Sanderson and S. Bengio.
Statistical transformations of frontal models for nonfrontal face verification. In IEEE International Conference on Image Processing, ICIP, pages 585588, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In the framework of a face verification system using local features and a Gaussian Mixture Model based classifier, we address the problem of nonfrontal face verification (when only a single (frontal) training image is available) by extending each client's frontal face model with artificially synthesized models for nonfrontal views. Furthermore, we propose the Maximum Likelihood Shift (MLS) synthesis technique and compare its performance against a Maximum Likelihood Linear Regression (MLLR) based technique (originally developed for adapting speech recognition systems) and the recently proposed "difference between two Universal Background Models" (UBMdiff) technique. All techniques rely on prior information and learn how a generic face model for the frontal view is related to generic models at nonfrontal views. Experiments on the FERET database suggest that that the proposed MLS technique is more suitable than MLLR (due to a lower number of free parameters) and UBMdiff (due to lack of heuristics). The results further suggest that extending frontal models considerably reduces errors.

[25] 
A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(6):709720, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ~50% for single writer data and by ~25% for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

[26] 
D. Zhang, D. GaticaPerez, S. Bengio, I. McCowan, and G. Lathoud.
Modeling individual and group actions in meetings: a twolayer hmm framework. In IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition, CVPR, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently groupbased (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a twolayer process, one that models basic individual activities from lowlevel audiovisual features, and another one that models the interactions. We propose a twolayer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that between has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on lowdimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public fivehour meeting corpus. Experiments and comparison with a singlelayer HMM baseline system show its validity.

[27] 
D. Zhang, D. GaticaPerez, S. Bengio, I. McCowan, and G. Lathoud.
Multimodal group action clustering in meetings. In ACM Multimedia Workshop on Video Surveillance and Sensor Networks, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of clustering multimodal group actions in meetings using a twolayer HMM framework. Meetings are structured as sequences of group actions. Our approach aims at creating one cluster for each group action, where the number of group actions and the action boundaries are unknown a priori. In our framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and lowlevel audiovisual features. A number of options that explicitly model certain aspects of the data (e.g., asynchrony) were considered. The second layer models the group actions using unsupervised HMM learning. The two layers are linked by a set of probabilitybased features produced by the individual action layer as input to the group action layer. The methodology was assessed on a set of multimodal turntaking group actions, using a public fivehour meeting corpus. The results show that the use of multiple modalities and the layered framework are advantageous, compared to various baseline methods.

[1] 
E. BaillyBaillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz,
J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.P. Thiran.
The BANCA database and evaluation protocol. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625638. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we describe the acquistion and content of a new large, realistic and challenging multimodal database intended for training and testing multimodal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] 
M. Barnard, J.M. Odobez, and S. Bengio.
Multimodal audiovisual event recognition for football analysis. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 469478, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The recognition of events within multimodal data is a challenging problem. In this paper we focus on the recognition of events by using both audio and video data. We investigate the use of data fusion techniques in order to recognise these sequences within the framework of Hidden Markov Models (HMM) used to model audio and video data sequences. Specifically we look at the recognition of play and break sequences in football and the segmentation of football games based on these two events. Recognising relatively simple semantic events such as this is an important step towards full automatic indexing of such video material. These experiments were done using approximately 3 hours of data from two games of the Euro96 competition. We propose that modelling the audio and video streams separately for each sequence and fusing the decisions from each stream should yield an accurate and robust method of segmenting multimodal data.

[3] 
S. Bengio.
An asynchronous hidden markov model for audiovisual speech recognition. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, pages 12371244. MIT Press, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audiovisual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.

[4] 
S. Bengio.
Multimodal authentication using asynchronous HMMs. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770777. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audiovisual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other stateoftheart techniques.

[5] 
H. Bourlard, S. Bengio, and K. Weber.
Towards robust and adaptive speech recognition models. In M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld, editors, Mathematical Foundations of Speech and Language Processing, Institute for Mathematics and its Applications (IMA) Series, Volume 138, pages 169189. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multistyle” training. More specifically, we will motivate and briefly describe new approaches based on multistream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multistream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[6] 
R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures. International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349365, 2003. [ .ps.gz  .pdf  .djvu  weblink  abstract] A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative loglikelihood.

[7] 
J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audiovisual person identity verification. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752760. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[8] 
M. Magimai Doss, T. A. Stephenson, H. Bourlard, and S. Bengio.
Phonemegrapheme based speech recognition system. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, pages 9498, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Stateoftheart ASR systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined interms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units.

[9] 
D. GaticaPerez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard.
On automatic annotation of meeting databases. In IEEE International Conference on Image Processing, ICIP, volume 3, pages 629632, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multimodal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audiovisual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a welldefined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.

[10] 
N. Gilardi and S. Bengio.
Comparison of four machine learning algorithms for spatial data analysis. In G. Dubois, J. Malczewski, and M. DeCort, editors, Mapping radioactivity in the environment  Spatial Interpolation Comparison 97, pages 222237. Office for Official Publications of the European Communities, Luxembourg, 2003. [ .ps.gz  .pdf  .djvu  abstract] This chapter proposes a clear methodology on how to use machine learning algorithms for spatial data analysis in order to avoid any bias and eventually obtain fair estimation of their performance on new data. Four different machine learning algorithms are presented, namely multilayer perceptrons (MLP), mixture of experts (ME), support vector regression (SVR) and a local version of the latter (local SVR). Evaluation criteria adapted to geostatistical problems are also presented in order to compare adequately different models on the same dataset. Finally, an experimental comparison is given on the SIC97 dataset as well as an analysis of the results.

[11] 
Q. Le and S. Bengio.
Client dependent GMMSVM models for speaker verification. In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443451. Springer Verlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A crossvalidation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[12] 
I. McCowan, S. Bengio, D. GaticaPerez, G. Lathoud, F. Monay, D. Moore,
P. Wellner, and H. Bourlard.
Modeling human interaction in meetings. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 4, pages 748751, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMMbased approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.

[13] 
I. McCowan, D. GaticaPerez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions. In Ambient Intelligence, Lecture Notes in Computer Science, volume LNCS 2875, pages 235251, Eindhoven, 2003. SpringerVerlag. [ .ps.gz  .pdf  .djvu  weblink  abstract] People meet in order to interact  disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[14] 
K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio,
F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou,
W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar,
F. Deravi, and N. Mavity.
Face verification competition on the XM2VTS database. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 964974. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink ] 
[15] 
N. Poh and S. Bengio.
Nonlinear variance reduction techniques in biometric authentication. In IEEE Multimodal User Authentication Workshop, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple synthetic samples, different extractors (features) and biometric modalities. The scores are combined using the average operator, MultiLayer Perceptron and Support Vector Machines. It is found empirically that VR via modalities is the best technique, followed by VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[16] 
N. Poh, S. Marcel, and S. Bengio.
Improving face authentication using virtual samples. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 3, pages 233236, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we present a simple yet effective way to improve a face verification system by generating multiple virtual samples from the unique image corresponding to an access request. These images are generated using simple geometric transformations. This method is often used during training to improve accuracy of a neural network model by making it robust against minor translation, scale and orientation change. The main contribution of this paper is to introduce such method during testing. By generating N images from one single image and propagating them to a trained network model, one obtains N scores. By merging these scores using a simple mean operator, we show that the variance of merged scores is decreased by a factor between 1 and N. An experiment is carried out on the XM2VTS database which achieves new stateoftheart performances.

[17] 
C. Sanderson and S. Bengio.
Augmenting frontal face models for nonfrontal verification. In IEEE Multimodal User Authentication Workshop, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this work we propose to address the problem of nonfrontal face verification when only a frontal training image is available (e.g. a passport photograph) by augmenting a client's frontal face model with artificially synthesized models for nonfrontal views. In the framework of a Gaussian Mixture Model (GMM) based classifier, two techniques are proposed for the synthesis: UBMdiff and LinReg. Both techniques rely on a priori information and learn how face models for the frontal view are related to face models at a nonfrontal view. The synthesis and augmentation approach is evaluated by applying it to two face verification systems: Principal Component Analysis (PCA) based and DCTmod2 (Sanderson et al, 2003) based; the two systems are a representation of holistic and nonholistic approaches, respectively. Results from experiments on the FERET database suggest that in almost all cases, frontal model augmentation has beneficial effects for both systems; they also suggest that the LinReg technique (which is based on multivariate regression of classifier parameters) is more suited to the PCA based system and that the UBMdiff technique (which is based on differences between two general face models) is more suited to the DCTmod2 based system. The results also support the view that the standard DCTmod2/GMM system (trained on frontal faces) is less affected by outofplane rotations than the corresponding PCA/GMM system;moreover, the DCTmod2/GMM system using augmented models is, in almost all cases, more robust than the corresponding PCA/GMM system.

[18] 
C. Sanderson and S. Bengio.
Robust features for frontal authentication in difficult image conditions. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 495504. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper we extend the recently proposed DCTmod2 feature extraction technique (which utilizes polynomial coefficients derived from 2D DCT coefficients obtained from horizontally & vertically neighbouring blocks) via the use of various windows and diagonally neighbouring blocks. We also propose enhanced PCA, where traditional PCA feature extraction is combined with DCTmod2. Results using test images corrupted by a linear and a nonlinear illumination change, white Gaussian noise and compression artefacts, show that use of diagonally neighbouring blocks and windowing is detrimental to robustness against illumination changes while being useful for increasing robustness against white noise and compression artefacts. We also show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness against white noise and compression artefacts) while also being robust to illumination changes; moreover, enhanced PCA outperforms PCA with histogram equalisation preprocessing.

[19] 
C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F.
BenZeghiba, F. Cardinaux, and S. Marcel.
Speech & face based biometric authentication at IDIAP. In International Conference on Multimedia and Expo, ICME, volume 3, pages 14, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers usercustomised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.

[20] 
A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of large vocabulary cursive handwritten text. In International Conference on Document Analysis and Recognition, ICDAR, pages 11011105, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a system for the offline recognition of cursive handwritten lines of text. The system is based on continuous density HMMs and Statistical Language Models. The system recognizes data produced by a single writer. No apriori knowledge is used about the content of the text to be recognized. Changes in the experimental setup with respect to the recognition of single words are highlighted. The results show a recognition rate of ~85% with a lexicon containing 50'000 words. The experiments were performed over a publicly available database.

[21] 
K. Weber, S. Ikbal, S. Bengio, and H. Bourlard.
Robust speech recognition and feature extraction using HMM2. Computer, Speech and Language, 17(23):195211, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to correspond to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.

[1] 
S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification. Information Fusion, 3(4):267276, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[2] 
H. Bourlard, T. Adali, S. Bengio, J. Larsen, and S. Douglas, editors.
Proceedings of the Twelfth IEEE Workshop on Neural Networks for Signal Processing (NNSP). IEEE Press, 2002. 
[3] 
H. Bourlard and S. Bengio.
Hidden markov models and other finite state automata for sequence processing. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, Second Edition. The MIT Press, 2002. [ .ps.gz  .pdf  .djvu  idiapRR ] 
[4] 
R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5):11051114, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Support Vector Machines (SVMs) are currently the stateoftheart models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve reallife problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[5] 
R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633640. MIT Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] Support Vector Machines (SVMs) are currently the stateoftheart models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve reallife problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.

[6] 
R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures. In S. Lee and A. Verri, editors, International Workshop on Pattern Recognition with Support Vector Machines, SVM, Lecture Notes in Computer Science, volume LNCS 2388, pages 823. SpringerVerlag, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative loglikelihood.

[7] 
N. Gilardi, S. Bengio, and M. Kanevski.
Conditional gaussian mixture models for environmental risk mapping. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 777786, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper proposes the use of Gaussian Mixture Models to estimate conditional probability density functions in an environmental risk mapping context. A conditional Gaussian Mixture Model has been compared to the geostatistical method of Sequential Gaussian Simulations and shows good performances in reconstructing local PDF. The data sets used for this comparison are parts of the digital elevation model of Switzerland.

[8] 
S. Marcel and S. Bengio.
Improving face verification using skin color information. In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 2, pages 1115. IEEE Computer Society Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. Stateoftheart methods often use the grayscale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance.

[9] 
S. Marcel, C. Marcel, and S. Bengio.
A stateoftheart neural network for robust face verification. In COST275 Workshop on the advent of Biometrics on the Internet, 2002. [ .ps.gz  .pdf  .djvu  abstract] The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. Stateoftheart methods often use the grayscale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance and that the proposed model achieves robust stateoftheart results.

[10] 
J. Mariéthoz and S. Bengio.
A comparative study of adaptation methods for speaker verification. In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Reallife speaker verification systems are often implemented using client model adaptation methods, since the amount of data available for each client is often too low to consider plain Maximum Likelihood methods. While the Bayesian Maximum A Posteriori (MAP) adaptation method is commonly used in speaker verification, other methods have proven to be successful in related domains such as speech recognition. This paper proposes an experimental comparison between three wellknown adaptation methods, namely MAP, Maximum Likelihood Linear Regression, and finally EigenVoices. All three methods are compared to the more classical Maximum Likelihood method, and results are given for a subset of the 1999 NIST Speaker Recognition Evaluation database.

[11] 
N. Poh, S. Bengio, and J. Korczak.
A multisample multisource model for biometric authentication. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 375384, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this study, two techniques that can improve the authentication process are examined: (i) multiple samples and (ii) multiple biometric sources. We propose the fusion of multiple samples obtained from multiple biometric sources at the score level. By using the average operator, both the theoretical and empirical results show that integrating as many samples and as many biometric sources as possible can improve the overall reliability of the system. This strategy is called multisample multisource approach. This strategy was tested on a reallife database using neural networks trained in oneversusall configuration.

[12] 
A. Vinciarelli and S. Bengio.
Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features. In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 3, pages 8184. IEEE Computer Society Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This work presents an Offline Cursive Word Recognition System dealing with single writer samples. The system was a continuous density hiddden Markov model trained using either the raw data, or data transformed using Principal Component Analysis or Independent Component Analysis. Both techniques significantly improved the recognition rate of the system. Preprocessing, normalization and feature extraction are described in detail as well as the training technique adopted. Several experiments were performed using a publicly available database. The accuracy obtained is the highest presented in the literature over the same data.

[13] 
A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based offline cursive script recognition. In Proceedings of the 8th International Conference on Frontiers in Handwriting Recognition, pages 287291, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work presents the application of HMM adaptation techniques to the problem of OffLine Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[14] 
A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based offline cursive script recognition. Pattern Recognition Letters, 23(8):905916, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This work presents the application of HMM adaptation techniques to the problem of OffLine Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[15] 
K. Weber, S. Bengio, and H. Bourlard.
Increasing speech recognition noise robustness with HMM2. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 929932, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noiserobust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multistream combination of (already noiserobust) cepstral features and formantlike features (extracted by HMM2) improves the noise robustness of a stateoftheart automatic speech recognition system.

[16] 
K. Weber, F. de Wet, B. Cranen, L. Boves, S. Bengio, and H. Bourlard.
Evaluation of formantlike features for ASR. In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002. [ .ps.gz  .pdf  .djvu  abstract] This paper investigates possibilities to automatically find a lowdimensional, formantrelated physical representation of the speech signal, which is suitable for automatic speech recognition (ASR). This aim is motivated by the fact that formants have been shown to be discriminant features for ASR. Combinations of automatically extracted formantlike features and `conventional', noiserobust, stateoftheart features (such as MFCCs including spectral subtraction and cepstral mean subtraction) have previously been shown to be more robust in adverse conditions than stateoftheart features alone. However, it is not clear how these automatically extracted formantlike features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formantlike features, and to compare these features to handlabeled formant tracks as well as to standard MFCCs in terms of their performance on a vowel classification task

[1] 
S. Bengio and J. Mariéthoz.
Learning the decision function for speaker verification. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 425428, 2001. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks always yielded performance improvements and sometimes significantly.

[2] 
H. Bourlard, S. Bengio, and K. Weber.
New approaches towards robust and adaptive speech recognition. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, NIPS 13, pages 751757. MIT Press, 2001. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multistream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multistream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[3] 
R. Collobert and S. Bengio.
SVMTorch: Support vector machines for largescale regression problems. Journal of Machine Learning Research, JMLR, 1:143160, 2001. [ .ps.gz  .pdf  .djvu  weblink  abstract] Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVMLight proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve largescale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for largescale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[4] 
J.L. DesGranges, P. Agin, and S. Bengio.
The use of predictive models of breeding bird assemblages for assessing and monitoring forest bird diversity. In A. Franc, O. Laroussinie, and T. Karjalainen, editors, Criteria and Indicators for Sustainable Forest Management at the Forest Management Unit Level, volume 38, pages 181200. European Forest Institute Proceedings, 2001. 
[5] 
K. Weber, S. Bengio, and H. Bourlard.
HMM2 extraction of formant features and their use for robust ASR. In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 2001. [ .ps.gz  .pdf  .djvu  abstract] As recently introduced, an HMM2 can be considered as a particular case of an HMM mixture in which the HMM emission probabilities (usually estimated through Gaussian mixtures or an artificial neural network) are modeled by statedependent, featurebased HMM (referred to as frequency HMM). A general EM training algorithm for such a structure has already been developed. Although there are numerous motivations for using such a structure, and many possible ways to exploit it, this paper will mainly focus on one particular instantiation of HMM2 in which the frequency HMM will be used to extract formant structure information, which will then be used as additional acoustic features in a standard Automatic Speech Recognition (ASR) system. While the fact that this architecture is able to automatically extract meaningful formant information is interesting by itself, empirical results will also show the robustness of these features to noise, and their potential to enhance stateoftheart noiserobust HMMbased ASR.

[6] 
K. Weber, S. Bengio, and H. Bourlard.
Speech recognition using advanced HMM2 features. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU, pages 6568, 2001. [ .ps.gz  .pdf  .djvu  weblink  abstract] HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) statedependent frequencybased HMMs. As shown previously, a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formantlike features for standard ASR system. HMM2 performs a nonlinear, statedependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To further improve the HMM2 system, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4dimensional feature vector (3 formantlike parameters and one time index) were obtained.

[1] 
S. Bengio and Y. Bengio.
Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, 11(3):550557, 2000. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The curse of dimensionality is severe when modeling highdimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling highdimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[2] 
Y. Bengio and S. Bengio.
Modeling highdimensional discrete data with multilayer neural networks. In S.A. Solla, T.K. Leen, and K.R. Müller, editors, Advances in Neural Information Processing Systems, NIPS 12, pages 400406. MIT Press, 2000. [ .ps.gz  .pdf  .djvu  weblink  abstract] The curse of dimensionality is severe when modeling highdimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling highdimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[3] 
N. Gilardi and S. Bengio.
Local machine learning models for spatial data analysis. Journal of Geographic Information and Decision Analysis, 4(1):1128, 2000. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper, we compare different machine learning algorithms applied to non stationary spatial data analysis. We show that models taking into account local variability of the data are better than models which are trained globally on the whole dataset. Two global models (Support Vector Regression and Multilayer Perceptrons) and two local models (a local version of Support Vector Regression and Mixture of Experts) were compared over the Spatial Interpolation Comparison 97 (SIC97) dataset, and the results are presented and compared to previous results obtained on the same dataset.

[4] 
T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris.
Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables. In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Current technology for automatic speech recognition (ASR) uses hidden Markov models (HMMs) that recognize spoken speech using the acoustic signal. However, no use is made of the causes of the acoustic signal: the articulators. We present here a dynamic Bayesian network (DBN) model that utilizes an additional variable for representing the state of the articulators. A particular strength of the system is that, while it uses measured articulatory data during its training, it does not need to know these values during recognition. As Bayesian networks are not used often in the speech community, we give an introduction to them. After describing how they can be used in ASR, we present a system to do isolated word recognition using articulatory information. Recognition results are given, showing that a system with both acoustics and inferred articulatory positions performs better than a system with only acoustics.

[5] 
K. Weber, S. Bengio, and H. Bourlard.
HMM2 a novel approach to HMM emission probability estimation. In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we discuss and investigate a new method to estimate local emission probabilities in the framework of hidden Markov models (HMM). Each feature vector is considered to be a sequence and is supposed to be modeled by yet another HMM. Therefore, we call this approach `HMM2'. There is a variety of possible topologies of such HMM2 systems, e.g. incorporating trellis or ergodic HMM structures. Preliminary HMM2 speech recognition experiments on cepstral and spectral features yielded worse results than stateoftheart systems. However, we believe that HMM2 systems have a lot of potential advantages and are therefore worth investigating further.

[1] 
S. Bengio.
Intégration des systèmes tutoriels traditionnels et des systèmes tutoriels intelligents. Master's thesis, Département d'Informatique et de Recherche Opérationnelle, Université de Montréal, 1989. 
[2] 
S. Bengio.
Optimisation d'une règle d'apprentissage pour réseaux de neurones artificiels. PhD thesis, Département d'Informatique et Recherche Opérationnelle. Université de Montréal, 1993. [ .ps.gz  .pdf  .djvu ] 
[3] 
S. Bengio and Y. Bengio.
An EM algorithm for asynchronous input/output hidden markov models. In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996. [ .ps.gz  .pdf  .djvu  abstract] In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an ExpectationMaximization algorithm for training such models.

[4] 
S. Bengio, Y. Bengio, and J. Cloutier.
Use of genetic programming for the search of a new learning rule for neural networks. In Proceedings of the First Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, volume 1, pages 324327, 1994. [ .ps.gz  .pdf  .djvu  weblink ] 
[5] 
S. Bengio, Y. Bengio, and J. Cloutier.
On the search for new learning rules for ANNs. Neural Processing Letters, 2(4):2630, 1995. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper, we present a framework where a learning rule can be optimized within a parametric learning rule space. We define what we callparametric learning rules and present a theoretical study of theirgeneralization properties when estimated from a set of learning tasks and tested over another set of tasks. We corroborate the results of this study with practical experiments.

[6] 
S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Aspects théoriques de l'optimisation d'une règle d'apprentissage. In Actes de la conférence NeuroNîmes 1992, Nîmes, France, 1992. [ .ps.gz  .pdf  .djvu ] 
[7] 
S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule. In Conference on Optimality in Biological and Artificial Networks, Dallas, USA, 1992. [ .ps.gz  .pdf  .djvu ] 
[8] 
S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
Generalization of a parametric learning rule. In S. Gielen and B. Kappen, editors, Proceedings of the International Conference on Artificial Neural Networks, ICANN'93, pages 502502, Amsterdam, Nederlands, 1993. SpringerVerlag. [ .ps.gz  .pdf  .djvu ] 
[9] 
S. Bengio, Y. Bengio, J. Cloutier, and J. Gecsei.
On the optimization of a synaptic learning rule. In D. S. Levine and W. R. Elsberry, editors, Optimality in Biological and Artificial Networks?, pages 265287. Lawrence Erlbaum Associates, 1997. [ .ps.gz  .pdf  .djvu ] 
[10] 
S. Bengio, Y. Bengio, J. Robert, and G. Bélanger.
Stochastic learning of strategic equilibria for auctions. Neural Computation, 11(5):11991209, 1999. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper presents a new application of stochastic adaptive learning algorithms to the computation of strategic equilibria in auctions. The proposed approach addresses the problems of tracking a moving target and balancing exploration (of action space) versus exploitation (of better modeled regions of action space). Neural networks are used to represent a stochastic decision model for each bidder. Experiments confirm the correctness and usefulness of the approach.

[11] 
S. Bengio, G. Brassard, Y. Desmedt, C. Goutier, and J.J. Quisquater.
Secure implementation of identification systems. Journal of Cryptology, 4(3):175183, 1991. [ weblink  abstract] In this paper we demonstrate that widely known identification systems, such as the publicfilebased FeigeFiatShamir scheme, can be insecure if proper care is not taken with their implementation. We suggest possible solutions. On the other hand, identitybased versions of the FeigeFiatShamir scheme are conceptually more complicated than necessary.

[12] 
S. Bengio, F. Clerot, A. Gravey, and D. Collobert.
Dynamical resource reservation schemes in an ATM network using neural networkbased traffic prediction. In D. D. Kouvatsos, editor, Proceedings of the IFIP Fifth International Workshop on Performance Modelling and Evaluation of ATM Networks. Kluwer B. V., 1997. [ .ps.gz  .pdf  .djvu  weblink  abstract] Using real traffic data, we show that neural networkbased prediction techniques can be used to predict the queuing behaviour of highly bursty traffics typical of LAN interconnection in a way accurate enough so as to allow dynamical renegotiation of a DBR traffic contract at the edge of an ATM network. The performances of predictorbased in service renegotiation are evaluated in terms of renegotiation errors and reserved bandwidth for the the DBR traffic handling capability and are shown to be very encouraging for the use of connectionist prediction techniques for the management of bursty traffics in ATM networks.

[13] 
S. Bengio, F. Fessant, and D. Collobert.
A connectionist system for mediumterm horizon time series prediction. In International Workshop on Applications of Neural Networks to Telecommunications, IWANNT, Stockholm, Sweden, 1995. [ .ps.gz  .pdf  .djvu ] 
[14] 
S. Bengio, F. Fessant, and D. Collobert.
Use of modular architectures for time series prediction. Neural Processing Letters, 3(2):101106, 1996. [ .ps.gz  .pdf  .djvu  weblink ] 
[15] 
S. Bengio, C. Frasson, and J. Gecsei.
Integrating traditional and intelligent computerized tutoring. In Fourth International Symposium on Computer and Information Sciences, Cesme, Turkey, 1989. 
[16] 
S. Bengio, C. Frasson, and J. Gecsei.
Utilisation de systèmes d'EAO dans des systèmes d'EIAO. In Sixième symposium canadien sur la technologie pédagogique, Halifax, Canada, 1989. 
[17] 
Y. Bengio and S. Bengio.
Training asynchronous input/output hidden markov models. In AAAI Spring Symposium on Computational Issues in Learning Models of Dynamical Systems, 1996. [ .ps.gz  .pdf  .djvu ] 
[18] 
Y. Bengio, S. Bengio, and J. Cloutier.
Learning a synaptic learning rule. In Proceedings of the International Joint Conference on Neural Networks, IJCNN, volume 2, pages 969974, Seattle, USA, 1991. [ .ps.gz  .pdf  .djvu  weblink ] 
[19] 
Y. Bengio, S. Bengio, J.F. Isabelle, and Y. Singer.
Shared context probabilistic transducers. In Advances in Neural Information Processing Systems, NIPS 10, 1998. [ .ps.gz  .pdf  .djvu  weblink  abstract] Recently, a model for supervised learning of probabilistic transducers represented by suffix trees was introduced. However, this algorithm tends to build very large trees, requiring very large amounts of computer memory. In this paper, we propose a new, more compact, transducer model in which one shares the parameters of distributions associated to contexts yielding similar conditional output distributions. We illustrate the advantages of the proposed algorithm with comparative experiments on inducing a noun phrase recognizer.

[20] 
Y. Bengio, S. Bengio, Y. Pouliot, and P. Agin.
A neural network to detect homologies in proteins. In Advances in Neural Information Processing Systems, NIPS 2, San Mateo, CA, USA, 1990. Morgan Kaufmann. [ .ps.gz  .pdf  .djvu  weblink ] 
[21] 
Y. Desmedt, C. Goutier, and S. Bengio.
Special uses and abuses of the FiatShamir passport protocol. In Advances in Cryptology, Crypto, Lecture Notes in Computer Science, volume LNCS 293, pages 2139, Santa Barbara, USA, 1988. Springer Verlag. [ .ps.gz  .pdf  .djvu  weblink  abstract] If the physical description of a person would be unique and adequately used and tested, then the security of the fiatshamir scheme is not based on zeroknowledge. otherwise some new frauds exist. the feigefiatshamir scheme always suffers from these frauds. using an extended notion of subliminal channels, several other undetectable abuses of the fiatshamir protocol, which are not possible with ordinary passports, are discussed. this technique can be used by a terrorist sponsoring country to communicate 500 new words of secret information each time a tourist passport is verified. a nontrivial solution to avoid these subliminal channel problems is presented. the notion of relative zeroknowledge is introduced.

[22] 
F. Fessant, S. Bengio, and D. Collobert.
On the prediction of solar activity using different neural network models. Annales Geophysicae, 14:2026, 1996. [ .ps.gz  .pdf  .djvu  weblink ] 
[23] 
J.Y. Potvin and S. Bengio.
The vehicle routing problem with time windows  part II: Genetic search. INFORMS Journal on Computing, 8(2):165172, 1996. [ .ps.gz  .pdf  .djvu  weblink ] 
[1] 
S. Bengio.
An asynchronous hidden markov model for audiovisual speech recognition. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems, NIPS 15, pages 12371244. MIT Press, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audiovisual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.

[2] 
S. Bengio.
Multimodal speech processing using asynchronous hidden markov models. Information Fusion, 5(2):8189, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An ExpectationMaximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audiovisual speech processing tasks, namely speech recognition and textdependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[3] 
S. Bengio.
Statistical machine learning for HCI. In J.P. Thiran, F. Marqués, and H. Bourlard, editors, Multimodal Signal Processing: Theory and Applications for HumanComputer Interaction, pages 723. Academic Press, 2010. [ weblink  abstract] This chapter introduces the main concepts of statistical machine learning, as they are pivot in most algorithms tailored for multimodal signal processing. In particular, the chapter will cover a general introduction to machine learning and how it is used in classification, regression and density estimation. Following this introduction, two particularly well known models will be presented, together with their associated learning algorithm: support vector machines, which are wellknown for classification tasks, and hidden Markov models, which are tailored for sequence processing tasks such as speech recognition.

[4] 
S. Bengio and Y. Bengio.
An EM algorithm for asynchronous input/output hidden markov models. In Proceedings of the International Conference on Neural Information Processing, ICONIP, Hong Kong, 1996. [ .ps.gz  .pdf  .djvu  abstract] In learning tasks in which input sequences are mapped to output sequences, it is often the case that the input and output sequences are not synchronous. For example, in speech recognition, acoustic sequences are longer than phoneme sequences. Input/Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length. We extend here this model to the case of asynchronous sequences, and show an ExpectationMaximization algorithm for training such models.

[5] 
S. Bengio, H. Bourlard, and K. Weber.
An EM algorithm for HMMs with emission distributions represented by HMMs. Technical Report IDIAPRR 0011, IDIAP, Martigny, Switzerland, 2000. [ .ps.gz  .pdf  .djvu  abstract] A novel approach to represent emission distributions of Hidden Markov Models is presented in this paper. Whereas they are usually estimated with Gaussian mixtures or neural networks, we propose to estimate them with another HMM, but in feature space. This representation, referred here as HMM^{2}, could enable the model to more accurately represent feature correlations with fewer parameters than standard HMMs. A full derivation of an EM algorithm is given in order to globally train all the HMM^{2} parameters. Preliminary experiments on speech data show promising results.

[6] 
S. Bengio and J. Keshet.
Introduction. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 310. Wiley, 2009. [ weblink  abstract] One of the most natural communication tools used by humans is their voice. It is hence natural that a lot of research has been devoted to analyzing and understanding human uttered speech for various applications. The most obvious one is automatic speech recognition, where the goal is to transcribe a recorded speech utterance into its corresponding sequence of words. Other applications include speaker recognition, where the goal is to determine either the claimed identity of the speaker (verification) or who is speaking (identification), and speaker segmentation or diarization, where the goal is to segment an acoustic sequence in terms of the underlying speakers (such as during a dialog). Although an enormous amount of research has been devoted to speech processing, there appears to be some form of local optimum in terms of the fundamental tools used to approach these problems. The aim of this book is to introduce the speech researcher community to radically different approaches based on more recent kernel based machine learning methods. In this introduction, we first briefly review the predominant speech processing approach, based on hidden Markov models, as well as its known problems; we then introduce the most well known kernel based approach, the Support Vector Machine (SVM), and finally outline the various contributions of this book.

[7] 
Y. Bengio and S. Bengio.
Training asynchronous input/output hidden markov models. In AAAI Spring Symposium on Computational Issues in Learning Models of Dynamical Systems, 1996. [ .ps.gz  .pdf  .djvu ] 
[8] 
H. Bourlard and S. Bengio.
Hidden markov models and other finite state automata for sequence processing. In Michael A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, Second Edition. The MIT Press, 2002. [ .ps.gz  .pdf  .djvu  idiapRR ] 
[9] 
H. Bourlard, S. Bengio, M. Magimai Doss, Q. Zhu, B. Mesot, and N. Morgan.
Towards using hierarchical posteriors for flexible automatic speech recognition systems. In Proceedings of the DARPA EARS (Effective, Affordable, Reusable, Speechtotext) Rich Transcription (RT'04) Workshop, 2004. [ .ps.gz  .pdf  .djvu  abstract] Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posterior estimates, using all possible acoustic information (as available in the whole utterance), as well as possible prior information (such as topological constraints). Furthermore, this approach results in a family of new HMM based systems, where only (local and global) posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the sentence level. Initial results on several speech (as well as other multimodal) tasks resulted in significant improvements. In this paper, we present recognition results on Numbers'95 and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speechtotext (CTS) task.

[10] 
H. Bourlard, S. Bengio, and K. Weber.
New approaches towards robust and adaptive speech recognition. In T.K. Leen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, NIPS 13, pages 751757. MIT Press, 2001. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multistream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multistream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[11] 
H. Bourlard, S. Bengio, and K. Weber.
Towards robust and adaptive speech recognition models. In M. Johnson, S. Khudanpur, M. Ostendorf, and R. Rosenfeld, editors, Mathematical Foundations of Speech and Language Processing, Institute for Mathematics and its Applications (IMA) Series, Volume 138, pages 169189. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, we discuss a family of new Automatic Speech Recognition (ASR) approaches, which somewhat deviate from the usual ASR approaches but which have recently been shown to be more robust to nonstationary noise, without requiring specific adaptation or “multistyle” training. More specifically, we will motivate and briefly describe new approaches based on multistream and subband ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) streams representing the speech signal are processed by different (independent) “experts”, each expert focusing on a different characteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multistream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state specific feature based HMMs responsible for merging the stream information and modeling their possible correlation.

[12] 
F. de Wet, K. Weber, L. Boves, B. Cranen, S. Bengio, and H. Bourlard.
Evaluation of formantlike features for automatic speech recognition. Journal of the Acoustical Society of America (JASA), 116(3):17811792, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates possibilities to find a lowdimensional, formantrelated physical representation of speech signals, which is suitable for automatic speech recognition. This aim is motivated by the fact that formants are known to be discriminant features for speech recognition. Combinations of automatically extracted formantlike features and stateoftheart, noiserobust features have previously been shown to be more robust in adverse conditions than stateoftheart features alone. However, it is not clear how these automatically extracted formantlike features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formantlike features, i.e. robust formants and HMM2 features, and to compare these features to handlabeled formants as well as to melfrequency cepstral coefficients in terms of their performance on a vowel classification task. The speech data and handlabeled formants that were used in this study are a subset of the American English vowels database presented in [Hillenbrand et al., J. Acoust. Soc. Am. 97, 30993111 (1995)]. Classification performance was measured on the original, clean data as well as in (simulated) adverse conditions. In combination with standard automatic speech recognition methods, the classification performance of the robust formant and HMM2 features compare very well to the performance of the handlabeled formants.

[13] 
C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621624, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[14] 
C. Dimitrakakis and S. Bengio.
Boosting word error rates. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501504, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the wellknown Numbers 95 database that indicate the importance of this temporal probability distribution.

[15] 
C. Dimitrakakis and S. Bengio.
Phoneme and sentencelevel ensembles for speech recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2011, 2011. [ .ps.gz  .pdf  .djvu  weblink  abstract] We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phonemelevel bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.

[16] 
M. Magimai Doss, S. Bengio, and H. Bourlard.
Joint decoding for phonemegrapheme continuous speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 177180, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Standard ASR systems typically use phoneme as the subword units. Preliminary studies have shown that the performance of the ASR system could be improved by using grapheme as additional subword units. In this paper, we investigate such a system where the word models are defined in terms of two different subword units, i.e., phoneme and grapheme. During training, models for both the subword units are trained, and then during recognition either both or just one subword unit is used. We have studied this system for a continuous speech recognition task in American English language. Our studies show that grapheme information used along with phoneme information improves the performance of ASR.

[17] 
M. Magimai Doss, T. A. Stephenson, H. Bourlard, and S. Bengio.
Phonemegrapheme based speech recognition system. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, pages 9498, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Stateoftheart ASR systems typically use phoneme as the subword units. In this paper, we investigate a system where the word models are defined interms of two different subword units, i.e., phonemes and graphemes. We train models for both the subword units, and then perform decoding using either both or just one subword unit. We have studied this system for American English language where there is weak correspondence between the grapheme and phoneme. The results from our studies show that there is good potential in using grapheme as auxiliary subword units.

[18] 
D. Grangier and S. Bengio.
Learning the interframe distance for discriminative templatebased keyword detection. In Proceedings of the 10th European Conference on Speech Communication and Technology, EurospeechInterspeech, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a discriminative approach to templatebased keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

[19] 
D. Grangier, J. Keshet, and S. Bengio.
Discriminative keyword spotting. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 175194. Wiley, 2009. [ weblink  abstract] This chapter introduces a discriminative method for detecting and spotting keywords in spoken utterances. Given a word represented as a sequence of phonemes and a spoken utterance, the keyword spotter predicts the best time span of the phoneme sequence in the spoken utterance along with a confidence. If the prediction confidence is above certain level the keyword is declared to be spoken in the utterance within the predicted time span, otherwise the keyword is declared as not spoken. The problem of keyword spotting training is formulated as a discriminative task where the model parameters are chosen so the utterance in which the keyword is spoken would have higher confidence than any other spoken utterance in which the keyword is not spoken. It is shown theoretically and empirically that the proposed training method resulted with a high area under the Receiver Operating Characteristic (ROC) curve, the most common measure to evaluate keyword spotters. We present an iterative algorithm to train the keyword spotter efficiently. The proposed approach contrasts with standard spotting strategies based on Hidden Markov Models (HMMs), for which the training procedure does not maximize a loss directly related to the spotting performance. Several experiments performed on TIMIT and WSJ corpora show the advantage of our approach over HMMbased alternatives.

[20] 
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer.
Endtoend textdependent speaker verification. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we present a datadriven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domainspecific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for textdependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easytomaintain systems with a small footprint.

[21] 
J. Keshet and S. Bengio, editors.
Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods. Wiley, 2009. [ weblink  abstract] This is the first book dedicated to uniting research related to speech and speaker recognition based on the recent advances in large margin and kernel methods. The first part of the book presents theoretical and practical foundations of large margin and kernel methods, from Support Vector Machines to large margin methods for structured learning. The second part of the book is dedicated to acoustic modeling of continuous speech recognizers, where the grounds for practical large margin sequence learning are set. The third part introduces large margin methods for discriminative language modeling. The last part of the book is dedicated to the application of keyword spotting, speaker verification and spectral clustering. The book is an important reference to researchers and practitioners in the field of modern speech and speaker recognition. The purpose of the book is twofold: first, to set the theoretical foundation of large margin and kernel methods relevant to the speech recognition domain; second, to propose a practical guide on implementation of these methods to the speech recognition domain. The reader is presumed to have basic knowledge of large margin and kernel methods and of basic algorithms in speech and speaker recognition.

[22] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on nonlinearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vectorspace which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMMbased approach.

[23] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. Speech Communication, 51:317329, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vectorspace, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional contextindependent HMMbased system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGIStories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional contextindependent HMMbased system.

[24] 
J. Keshet, S. ShalevShwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernelbased phoneme sequence recognition. In Proceedings of the International Conference on Spoken Language Processing, InterspeechICSLP, 2006. [ .ps.gz  .pdf  .djvu  abstract] We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMMbased approaches, our method uses a discriminative kernelbased training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vectorspace endowed with an innerproduct that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMMbased approach.

[25] 
H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multistream posterior based speech recognition system. In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on “state gamma posterior” definition (typically used in standard HMMs training) extended to the case of multistream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multistream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speechtotext (CTS) task, this resulted in significant performance improvement, compared to the stateoftheart Tandem systems.

[26] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems. In Proceedings of the 9th European Conference on Speech Communication and Technology, EurospeechInterspeech, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., “Tandem”) to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the “gamma” distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a stateoftheart hybrid HMM/ANN system.

[27] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds. In Proceedings of the International Conference on Spoken Language Processing, InterspeechICSLP, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.

[28] 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.

[29] 
J. Mariéthoz, S. Bengio, and Y. Grandvalet.
Kernelbased textindependent speaker verification. In J. Keshet and S. Bengio, editors, Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods, pages 195220. Wiley, 2009. [ weblink  abstract] The goal of a person authentication system is to certify/attest the claimed identity of a user. When this authentication is based on the voice of the user, without respect to what the user exactly said, the system is called a textindependent speaker verification system. Speaker verification systems are increasingly often used to secure personal information, particularly for mobile phone based applications. Furthermore, textindependent versions of speaker verification systems are most used for their simplicity, as they do not require complex speech recognition modules. The most common approach to this task is based on Gaussian Mixture Models (GMMs) (Reynolds et al. 2000), which do not take into account any temporal information. GMMs have been intensively used thanks to their good performance, especially with the use of the Maximum a posteriori (MAP) (Gauvain and Lee 1994) adaptation algorithm. This approach is based on the density estimation of an impostor data distribution, followed by its adaptation to a specific client data set. Note that the estimation of these densities is not the final goal of speaker verification systems, which is rather to discriminate the client and impostor classes; hence discriminative approaches might appear good candidates for this task as well. As a matter of fact, Support Vector Machine (SVM) based systems have been the subject of several recent publications in the speaker verification community, in which they obtain Ma performance similar to or even better than GMMs on several textindependent speaker verification tasks. In order to use SVMs or any other discriminant approaches for speaker verification, several modifications of the classical techniques need to be performed. The purpose of this chapter is to present an overview of discriminant approaches that have been used successfully for the task of textindependent speaker verification, to analyze their differences from and their similarities to each other and to classical generative approaches based on GMMs. An opensource version of the C++ source code used to performed all experiments described in this chapter can be found at http://speaker.abracadoudou.com.

[30] 
T. A. Stephenson, H. Bourlard, S. Bengio, and A. C. Morris.
Automatic speech recognition using dynamic Bayesian networks with both acoustic and articulatory variables. In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Current technology for automatic speech recognition (ASR) uses hidden Markov models (HMMs) that recognize spoken speech using the acoustic signal. However, no use is made of the causes of the acoustic signal: the articulators. We present here a dynamic Bayesian network (DBN) model that utilizes an additional variable for representing the state of the articulators. A particular strength of the system is that, while it uses measured articulatory data during its training, it does not need to know these values during recognition. As Bayesian networks are not used often in the speech community, we give an introduction to them. After describing how they can be used in ASR, we present a system to do isolated word recognition using articulatory information. Recognition results are given, showing that a system with both acoustics and inferred articulatory positions performs better than a system with only acoustics.

[31] 
K. Weber, S. Bengio, and H. Bourlard.
HMM2 a novel approach to HMM emission probability estimation. In Proceedings of the International Conference on Speech and Language Processing, ICSLP, Beijing, China, October 2000. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we discuss and investigate a new method to estimate local emission probabilities in the framework of hidden Markov models (HMM). Each feature vector is considered to be a sequence and is supposed to be modeled by yet another HMM. Therefore, we call this approach `HMM2'. There is a variety of possible topologies of such HMM2 systems, e.g. incorporating trellis or ergodic HMM structures. Preliminary HMM2 speech recognition experiments on cepstral and spectral features yielded worse results than stateoftheart systems. However, we believe that HMM2 systems have a lot of potential advantages and are therefore worth investigating further.

[32] 
K. Weber, S. Bengio, and H. Bourlard.
HMM2 extraction of formant features and their use for robust ASR. In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH, 2001. [ .ps.gz  .pdf  .djvu  abstract] As recently introduced, an HMM2 can be considered as a particular case of an HMM mixture in which the HMM emission probabilities (usually estimated through Gaussian mixtures or an artificial neural network) are modeled by statedependent, featurebased HMM (referred to as frequency HMM). A general EM training algorithm for such a structure has already been developed. Although there are numerous motivations for using such a structure, and many possible ways to exploit it, this paper will mainly focus on one particular instantiation of HMM2 in which the frequency HMM will be used to extract formant structure information, which will then be used as additional acoustic features in a standard Automatic Speech Recognition (ASR) system. While the fact that this architecture is able to automatically extract meaningful formant information is interesting by itself, empirical results will also show the robustness of these features to noise, and their potential to enhance stateoftheart noiserobust HMMbased ASR.

[33] 
K. Weber, S. Bengio, and H. Bourlard.
A pragmatic view of the application of HMM2 for ASR. Technical Report IDIAPRR 0123, IDIAP, 2001. [ .ps.gz  .pdf  .djvu  abstract] This report investigates the HMM2 approach recently introduced in the framework of automatic speech recognition. HMM2 can be seen as a mixture of HMMs, where a conventional primary HMM (processing a time series of speech data) is supported on a lower level by a secondary HMM, working along the frequency dimension of a temporal segment of speech. The application of HMM2 to the speech signal is motivated by numerous potential advantages. However, speech recognition results did not show the expected performance improvements. In this paper, the HMM2 approach is pragmatically analyzed and evaluated on speech data, revealing some problems and suggesting potential solutions.

[34] 
K. Weber, S. Bengio, and H. Bourlard.
Speech recognition using advanced HMM2 features. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, ASRU, pages 6568, 2001. [ .ps.gz  .pdf  .djvu  weblink  abstract] HMM2 is a particular hidden Markov model where state emission probabilities of the temporal (primary) HMM are modeled through (secondary) statedependent frequencybased HMMs. As shown previously, a secondary HMM can also be used to extract robust ASR features. Here, we further investigate this novel approach towards using a full HMM2 as feature extractor, working in the spectral domain, and extracting robust formantlike features for standard ASR system. HMM2 performs a nonlinear, statedependent frequency warping, and it is shown that the resulting frequency segmentation actually contains particularly discriminant features. To further improve the HMM2 system, we complement the initial spectral energy vectors with frequency information. Finally, adding temporal information to the HMM2 feature vector yields further improvements. These conclusions are experimentally validated on the Numbers95 database, where word error rates of 15%, using only a 4dimensional feature vector (3 formantlike parameters and one time index) were obtained.

[35] 
K. Weber, S. Bengio, and H. Bourlard.
Increasing speech recognition noise robustness with HMM2. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 929932, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noiserobust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multistream combination of (already noiserobust) cepstral features and formantlike features (extracted by HMM2) improves the noise robustness of a stateoftheart automatic speech recognition system.

[36] 
K. Weber, F. de Wet, B. Cranen, L. Boves, S. Bengio, and H. Bourlard.
Evaluation of formantlike features for ASR. In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002. [ .ps.gz  .pdf  .djvu  abstract] This paper investigates possibilities to automatically find a lowdimensional, formantrelated physical representation of the speech signal, which is suitable for automatic speech recognition (ASR). This aim is motivated by the fact that formants have been shown to be discriminant features for ASR. Combinations of automatically extracted formantlike features and `conventional', noiserobust, stateoftheart features (such as MFCCs including spectral subtraction and cepstral mean subtraction) have previously been shown to be more robust in adverse conditions than stateoftheart features alone. However, it is not clear how these automatically extracted formantlike features behave in comparison with true formants. The purpose of this paper is to investigate two methods to automatically extract formantlike features, and to compare these features to handlabeled formant tracks as well as to standard MFCCs in terms of their performance on a vowel classification task

[37] 
K. Weber, S. Ikbal, S. Bengio, and H. Bourlard.
Robust speech recognition and feature extraction using HMM2. Computer, Speech and Language, 17(23):195211, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to correspond to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.

[1] 
E. BaillyBaillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz,
J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.P. Thiran.
The BANCA database and evaluation protocol. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625638. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we describe the acquistion and content of a new large, realistic and challenging multimodal database intended for training and testing multimodal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] 
M. Barnard, J.M. Odobez, and S. Bengio.
Multimodal audiovisual event recognition for football analysis. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 469478, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The recognition of events within multimodal data is a challenging problem. In this paper we focus on the recognition of events by using both audio and video data. We investigate the use of data fusion techniques in order to recognise these sequences within the framework of Hidden Markov Models (HMM) used to model audio and video data sequences. Specifically we look at the recognition of play and break sequences in football and the segmentation of football games based on these two events. Recognising relatively simple semantic events such as this is an important step towards full automatic indexing of such video material. These experiments were done using approximately 3 hours of data from two games of the Euro96 competition. We propose that modelling the audio and video streams separately for each sequence and fusing the decisions from each stream should yield an accurate and robust method of segmenting multimodal data.

[3] 
S. Bengio.
Multimodal authentication using asynchronous HMMs. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770777. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audiovisual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other stateoftheart techniques.

[4] 
S. Bengio.
Multimodal speech processing using asynchronous hidden markov models. Information Fusion, 5(2):8189, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An ExpectationMaximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audiovisual speech processing tasks, namely speech recognition and textdependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[5] 
S. Bengio.
Statistical machine learning for HCI. In J.P. Thiran, F. Marqués, and H. Bourlard, editors, Multimodal Signal Processing: Theory and Applications for HumanComputer Interaction, pages 723. Academic Press, 2010. [ weblink  abstract] This chapter introduces the main concepts of statistical machine learning, as they are pivot in most algorithms tailored for multimodal signal processing. In particular, the chapter will cover a general introduction to machine learning and how it is used in classification, regression and density estimation. Following this introduction, two particularly well known models will be presented, together with their associated learning algorithm: support vector machines, which are wellknown for classification tasks, and hidden Markov models, which are tailored for sequence processing tasks such as speech recognition.

[6] 
S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004. volume 3361 of Lecture Notes in Computer Science. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 2123, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction  and more specifically to multimodal meeting processing. These areas included humanhuman communication modeling, speech and visual processing, multimodal processing, fusion and fission, multimodal dialog modeling, humanhuman interaction modeling, multimodal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.

[7] 
S. Bengio and H. Bourlard.
Multi channel sequence processing. In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 2236. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper summarizes some of the current research challenges arising from multichannel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multistream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multichannel tasks, ranging from audiovisual speech recognition to automatic meeting analysis.

[8] 
S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification. Information Fusion, 3(4):267276, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[9] 
J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audiovisual person identity verification. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752760. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[10] 
D. GaticaPerez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard.
On automatic annotation of meeting databases. In IEEE International Conference on Image Processing, ICIP, volume 3, pages 629632, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we discuss meetings as an application domain for multimedia content analysis. Meeting databases are a rich data source suitable for a variety of audio, visual and multimodal tasks, including speech recognition, people and action recognition, and information retrieval. We specifically focus on the task of semantic annotation of audiovisual (AV) events, where annotation consists of assigning labels (event names) to the data. In order to develop an automatic annotation system in a principled manner, it is essential to have a welldefined task, a standard corpus and an objective performance measure. In this work we address each of these issues to automatically annotate events based on participant interactions.

[11] 
D. GaticaPerez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections. In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005. [ .ps.gz  .pdf  .djvu  abstract] Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audiovisual sensors, discussing open issues in this domain.

[12] 
D. GaticaPerez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interestlevel in meetings. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489492, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interestlevel that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of highinterest from audiovisual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interestlevel. On a 50meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable groundtruth for automatic methods. For the automatic detection of highinterest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single and multistream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interestlevel is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multimodal approach is beneficial.

[13] 
I. McCowan, S. Bengio, D. GaticaPerez, G. Lathoud, F. Monay, D. Moore,
P. Wellner, and H. Bourlard.
Modeling human interaction in meetings. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 4, pages 748751, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper investigates the recognition of group actions in meetings by modeling the joint behaviour of participants. Many meeting actions, such as presentations, discussions and consensus, are characterised by similar or complementary behaviour across participants. Recognising these meaningful actions is an important step towards the goal of providing effective browsing and summarisation of processed meetings. In this work, a corpus of meetings was collected in a room equipped with a number of microphones and cameras. The corpus was labeled in terms of a predefined set of meeting actions characterised by global behaviour. In experiments, audio and visual features for each participant are extracted from the raw data and the interaction of participants is modeled using HMMbased approaches. Initial results on the corpus demonstrate the ability of the system to recognise the set of meeting actions.

[14] 
I. McCowan, D. GaticaPerez, and S. Bengio.
Meeting data collection specifications. Communication Report IDIAPCOM 0310, IDIAP, 2003. [ .ps.gz  .pdf  .djvu ] 
[15] 
I. McCowan, D. GaticaPerez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(3):305317, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMMbased approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audiobased events, motivating a multimodal approach to meeting analysis.

[16] 
I. McCowan, D. GaticaPerez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions. In Ambient Intelligence, Lecture Notes in Computer Science, volume LNCS 2875, pages 235251, Eindhoven, 2003. SpringerVerlag. [ .ps.gz  .pdf  .djvu  weblink  abstract] People meet in order to interact  disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[17] 
I. McCowan, D. GaticaPerez, S. Bengio, D. Moore, and H. Bourlard.
Towards computer understanding of human interactions. In Machine Learning for Multimodal Interaction: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 5675. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] People meet in order to interact  disseminating information, making decisions, and creating new ideas. Automatic analysis of meetings is therefore important from two points of view: extracting the information they contain, and understanding human interaction processes. Based on this view, this article presents an approach in which relevant information content of a meeting is identified from a variety of audio and visual sensor inputs and statistical models of interacting people. We present a framework for computer observation and understanding of interacting people, and discuss particular tasks within this framework, issues in the meeting context, and particular algorithms that we have adopted. We also comment on current developments and the future challenges in automatic meeting analysis.

[18] 
N. Poh and S. Bengio.
Why do multistream, multiband and multimodal approaches work on biometric user authentication tasks? In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893896, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multiband, multistream and multimodal approaches have proven to be very successful both in experiments and in reallife applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, MultiLayerPerceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[19] 
N. Poh and S. Bengio.
Fratio clientdependent normalisation for biometric authentication tasks. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721724, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates a new clientdependent normalisation to improve biometric authentication systems. There exists many clientdependent score normalisation techniques applied to speaker authentication, such as ZNorm, DNorm and TNorm. Such normalisation is intended to adjust the variation across different client models. We propose “Fratio” normalisation, or FNorm, applied to face and speaker authentication systems. This normalisation requires only that as few as two clientdependent accesses are available (the more the better). Different from previous normalisation techniques, FNorm considers the client and impostor distributions simultaneously. We show that Fratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that FNorm actually “interpolates” between clientindependent and clientdependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over ZNorm, clientdependent threshold normalisation or no normalisation.

[20] 
N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks? IEEE Transactions on Signal Processing, 53(11):43844396, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in reallife biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of baseclassifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance ReductionEqual Error Rate (VREER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of Fratio, which itself is a function of 1) correlation, 2) variance of baseexperts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of baseexperts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be missleading. Experimental results on the BANCA multimodal database confirm our findings using VREER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated baseexperts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (classdependent) variance. Moreover, by linking the concept of biasvariancecovariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[21] 
N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[22] 
N. Poh and S. Bengio.
Database, protocol and tools for evaluating scorelevel fusion algorithms in biometric authentication. Pattern Recognition, 39(2):223233, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some stateoftheart tools to evaluate the fusion performance.

[23] 
N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess userdependent variabilities. In Second Workshop on Multimodal User Authentication, MMUA, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[24] 
S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005. volume 3869 of Lecture Notes in Computer Science. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 1113 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT05) evaluation of speech technologies within the meeting domain. Building on the success of the RT04 spring evaluation, the RT05 evaluation continued the speechtotext and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Humanhuman communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Humanhuman interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.

[25] 
S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006. volume 4299 of Lecture Notes in Computer Science. SpringerVerlag, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 14, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme)  the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence  and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was colocated with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT06 evaluation continued evaluation tasks in the areas of speechtotext, whospokewhen, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction  and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included humanhuman communication modeling, speech and visual processing, multimodal processing, fusion and fission, humancomputer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick MurraySmith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.

[26] 
D. Zhang and S. Bengio.
Exploring contextual information in a layered framework for group action recognition. In IEEE International Conference on Multimedia & Expo, ICME, 2007. [ .ps.gz  .pdf  .djvu  abstract] Contextual information is important for sequence modeling. Hidden Markov models (HMMs) and extensions, which have been widely used for sequence modeling, make simplifying, often unrealistic assumptions on the conditional independence of observations given the class labels, thus cannot accommodate overlapping features or longterm contextual information. In this paper, we introduce a principled layered framework with three implementation methods that take into account contextual information (as available in the whole or part of the sequence). The first two methods are based on state alpha and gamma posteriors (as usually referred to in the HMM formalism). The third method is based on conditional random fields (CRFs), a conditional model that relaxes the independent assumption on the observations required by HMMs for computational tractability. We illustrate our methods with the application of recognizing group actions in meetings. Experiments and comparison with standard HMM baseline showed the validity of the proposed approach.

[27] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Semisupervised adapted HMMs for unusual event detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised modelbased approaches. We propose a semisupervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in predefining unusual events. Experiments on audio, visual, and audiovisual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.

[28] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Semisupervised meeting event recognition with adapted HMMs. In IEEE International Conference on Multimedia Expo, ICME, pages 611618, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] This paper investigates the use of unlabeled data to help labeled data for audiovisual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semisupervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a wellestimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audiovisual events defined in meetings. Experiments and comparison with the fullysupervised baseline method show the validity of the proposed semisupervised approach.

[29] 
D. Zhang, D. GaticaPerez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs. IEEE Transactions on Multimedia, 8(3):509520, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently groupbased (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a twolayer process, one that models basic individual activities from lowlevel audiovisual features, and another one that models the interactions. We propose a twolayer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on lowdimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public fivehour meeting corpus. Experiments and comparison with a singlelayer HMM baseline system show its validity.

[30] 
D. Zhang, D. GaticaPerez, S. Bengio, I. McCowan, and G. Lathoud.
Modeling individual and group actions in meetings: a twolayer hmm framework. In IEEE Workshop on Event Mining at the Conference on Computer Vision and Pattern Recognition, CVPR, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently groupbased (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a twolayer process, one that models basic individual activities from lowlevel audiovisual features, and another one that models the interactions. We propose a twolayer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that between has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on lowdimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public fivehour meeting corpus. Experiments and comparison with a singlelayer HMM baseline system show its validity.

[31] 
D. Zhang, D. GaticaPerez, S. Bengio, I. McCowan, and G. Lathoud.
Multimodal group action clustering in meetings. In ACM Multimedia Workshop on Video Surveillance and Sensor Networks, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We address the problem of clustering multimodal group actions in meetings using a twolayer HMM framework. Meetings are structured as sequences of group actions. Our approach aims at creating one cluster for each group action, where the number of group actions and the action boundaries are unknown a priori. In our framework, the first layer models typical actions of individuals in meetings using supervised HMM learning and lowlevel audiovisual features. A number of options that explicitly model certain aspects of the data (e.g., asynchrony) were considered. The second layer models the group actions using unsupervised HMM learning. The two layers are linked by a set of probabilitybased features produced by the individual action layer as input to the group action layer. The methodology was assessed on a set of multimodal turntaking group actions, using a public fivehour meeting corpus. The results show that the use of multiple modalities and the layered framework are advantageous, compared to various baseline methods.

[32] 
D. Zhang, D. GaticaPerez, D. Roy, and S. Bengio.
Modeling interactions from email communication. In IEEE International Conference on Multimedia & Expo, ICME, 2006. [ .ps.gz  .pdf  .djvu  abstract] Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topicbased interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

[1] 
E. BaillyBaillière, S. Bengio, F. Bimbot, M. Hamouz, J. Kittler, J. Mariéthoz,
J. Matas, K. Messer, V. Popovici, F. Porée, B. Ruiz, and J.P. Thiran.
The BANCA database and evaluation protocol. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 625638. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we describe the acquistion and content of a new large, realistic and challenging multimodal database intended for training and testing multimodal verification systems. The BANCA database was captured in four European languages in two modalities (face and voice). For recording, both high and low quality microphones and cameras were used. The subjects were recorded in three different scenarios, controlled, degraded and adverse over a period of three months. In total 208 people were captured, half men and half women. In this paper we also describe a protocol for evaluating verification algorithms on the database. The database will be made available to the research community through http://banca.ee.surrey.ac.uk.

[2] 
S. Bengio.
Multimodal authentication using asynchronous HMMs. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 770777. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has often been shown that using multiple modalities to authenticate the identity of a person is more robust than using only one. Various combination techniques exist and are often performed at the level of the output scores of each modality system. In this paper, we present a novel HMM architecture able to model the joint probability distribution of pairs of asynchronous sequences (such as speech and video streams) describing the same event. We show how this model can be used for audiovisual person authentication. Results on the M2VTS database show robust performances of the system under various audio noise conditions, when compared to other stateoftheart techniques.

[3] 
S. Bengio.
Multimodal speech processing using asynchronous hidden markov models. Information Fusion, 5(2):8189, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An ExpectationMaximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audiovisual speech processing tasks, namely speech recognition and textdependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.

[4] 
S. Bengio, C. Marcel, S. Marcel, and J. Mariéthoz.
Confidence measures for multimodal identity verification. Information Fusion, 3(4):267276, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multimodal fusion for identity verification has already shown great improvement compared to unimodal algorithms. In this paper, we propose to integrate confidence measures during the fusion process. We present a comparison of three different methods to generate such confidence information from unimodal identity verification systems. These methods can be used either to enhance the performance of a multimodal fusion algorithm or to obtain a confidence level on the decisions taken by the system. All the algorithms are compared on the same benchmark database, namely XM2VTS, containing both speech and face information. Results show that some confidence measures did improve statistically significantly the performance, while other measures produced reliable confidence levels over the fusion decisions.

[5] 
S. Bengio and J. Mariéthoz.
Comparison of client model adaptation schemes. Technical Report IDIAPRR 0125, IDIAP, 2001. [ .ps.gz  .pdf  .djvu ] 
[6] 
S. Bengio and J. Mariéthoz.
Learning the decision function for speaker verification. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 1, pages 425428, 2001. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper explores the possibility to replace the usual thresholding decision rule of log likelihood ratios used in speaker verification systems by more complex and discriminant decision functions based for instance on Linear Regression models or Support Vector Machines. Current speaker verification systems, based on generative models such as HMMs or GMMs, can indeed easily be adapted to use such decision functions. Experiments on both text dependent and text independent tasks always yielded performance improvements and sometimes significantly.

[7] 
S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2class classification tasks.

[8] 
S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).

[9] 
S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve. In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the socalled breakeven point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a nonparametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[10] 
S. Bengio, J. Mariéthoz, and S. Marcel.
Evaluation of biometric technology on XM2VTS. Technical Report IDIAPRR 0121, IDIAP, 2001. [ .ps.gz  .pdf  .djvu ] 
[11] 
F. Cardinaux, C. Sanderson, and S. Bengio.
Face verification using adapted generative models. In International Conference on Automatic Face and Gesture Recognition, FG, pages 825830, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has been shown previously that systems based on local features and relatively complex generative models, namely 1D Hidden Markov Models (HMMs) and pseudo2D HMMs, are suitable for face recognition (here we mean both identification and verification). Recently a simpler generative model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In this paper we first propose to increase the performance of the GMM approach (without sacrificing its simplicity) through the use of local features with embedded positional information; we show that the performance obtained is comparable to 1D HMMs. Secondly, we evaluate different training techniques for both GMM and HMM based systems. We show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available; we propose to tackle this problem through the use of Maximum a Posteriori (MAP) training, where the lack of data problem can be effectively circumvented; we show that models estimated with MAP are significantly more robust and are able to generalize to adverse conditions present in the BANCA database.

[12] 
F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images. IEEE Transactions on Signal Processing, 54(1):361373, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best tradeoff in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.

[13] 
J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe.
Scalability analysis of audiovisual person identity verification. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 752760. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this work, we present a multimodal identity verification system based on the fusion of the face image and the text independent speech data of a person. The system conciliates the monomodal face and speaker verification algorithms by fusing their respective scores. In order to assess the authentication system at different scales, the performance is evaluated at various sizes of the face and speech user template. The user template size is a key parameter when the storage space is limited like in a smart card. Our experimental results show that the multimodal fusion allows to reduce significantly the user template size while keeping a satisfactory level of performance. Experiments are performed on the newly recorded multimodal database BANCA.

[14] 
G. Heigold, I. Moreno, S. Bengio, and N. Shazeer.
Endtoend textdependent speaker verification. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we present a datadriven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domainspecific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for textdependent speaker verification. The proposed approach appears to be very effective for big data applications like ours that require highly accurate, easytomaintain systems with a small footprint.

[15] 
M. Keller, J. Mariéthoz, and S. Bengio.
Significance tests for bizarre measures in 2class classification tasks. Technical Report IDIAPRR 0434, IDIAP, 2004. [ .ps.gz  .pdf  .djvu  abstract] Statistical significance tests are often used in machine learning to compare the performance of two learning algorithms or two models. However, in most cases, one of the underlying assumptions behind these tests is that the error measure used to assess the performance of one model/algorithm is computed as the sum of errors obtained on each example of the test set. This is however not the case for several wellknown measures such as F_{1}, used in text categorization, or DCF, used in person authentication. We propose here a practical methodology to either adapt the existing tests or develop nonparametric solutions for such bizarre measures. We furthermore assess the quality of these tests on a reallife large dataset.

[16] 
Q. Le and S. Bengio.
Hybrid generativediscriminative models for speech and speaker recognition. Technical Report IDIAPRR 0206, IDIAP, 2002. [ .ps.gz  .pdf  .djvu  abstract] Generative probability models such as Hidden Markov Models are usually used for modeling sequences of data because of their ability to handle variable size sequences and missing information. On the other hand, because of their discriminative properties, discriminative models like Support Vector Machines (SVMs) usually yield better performance in classification problem and can construct flexible decision boundaries. An ideal classifier should have all the power of these two complementary approaches. A series of recent papers has suggested some techniques for mixing generative models and discriminative models. In one of them a fixed size vector (the Fisher score) containing sufficient statistics of a sequence is computed for a previously trained HMM and can then be used as input to a discriminative model for classification. The purpose of this project is thus to study, experiment, enhance and adapt these new approaches of integrating discriminative models such as SVM into generative models for sequence processing problems, such as speaker and speech recognition.

[17] 
Q. Le and S. Bengio.
Client dependent GMMSVM models for speaker verification. In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443451. Springer Verlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A crossvalidation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[18] 
M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems. In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186195. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper we present a text independent online writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the strokebased feature set outperforms the pointbased feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.

[19] 
S. Marcel and S. Bengio.
Improving face verification using skin color information. In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 2, pages 1115. IEEE Computer Society Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. Stateoftheart methods often use the grayscale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance.

[20] 
S. Marcel, C. Marcel, and S. Bengio.
A stateoftheart neural network for robust face verification. In COST275 Workshop on the advent of Biometrics on the Internet, 2002. [ .ps.gz  .pdf  .djvu  abstract] The performance of face verification systems has steadily improved over the last few years, mainly focusing on models rather than on feature processing. Stateoftheart methods often use the grayscale face image as input. In this paper, we propose to use an additional feature to the face image: the skin color. The new feature set is tested on a benchmark database, namely XM2VTS, using a simple discriminant artificial neural network. Results show that the skin color information improves the performance and that the proposed model achieves robust stateoftheart results.

[21] 
J. Mariéthoz and S. Bengio.
A comparative study of adaptation methods for speaker verification. In Proceedings of the International Conference on Spoken Language Processing, ICSLP, 2002. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Reallife speaker verification systems are often implemented using client model adaptation methods, since the amount of data available for each client is often too low to consider plain Maximum Likelihood methods. While the Bayesian Maximum A Posteriori (MAP) adaptation method is commonly used in speaker verification, other methods have proven to be successful in related domains such as speech recognition. This paper proposes an experimental comparison between three wellknown adaptation methods, namely MAP, Maximum Likelihood Linear Regression, and finally EigenVoices. All three methods are compared to the more classical Maximum Likelihood method, and results are given for a subset of the 1999 NIST Speaker Recognition Evaluation database.

[22] 
J. Mariéthoz and S. Bengio.
An alternative to silence removal for textindependent speaker verification. Technical Report IDIAPRR 0351, IDIAP, 2003. [ .ps.gz  .pdf  .djvu  abstract] Stateoftheart text independent speaker verification systems use silence/speech detectors to get rid of silence frames which are considered to be non discriminative. This paper explores a possible replacement to this silence/speech detector by considering each Gaussian of a GMM as modeling a specific speech class and by using discriminant models like SVMs and MLPs in order to fuse the corresponding classspecific scores to obtain a final decision. Experiments on the NIST 2000 database yielded statistically significantly better performance for the new model as compared to our best baseline system involving a silence/speech detector, without having to rely on uncertain hypotheses.

[23] 
J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification. IEEE Signal Processing Letters, 12(7):532535, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The purpose of this paper is to unify several of the stateoftheart score normalization techniques applied to textindependent speaker verification systems. We propose a new framework for this purpose. The two wellknown Z and Tnormalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these wellknown score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.

[24] 
J. Mariéthoz and S. Bengio.
A max kernel for textindependent speaker verification systems. In Second Workshop on Multimodal User Authentication, MMUA, 2006. [ .ps.gz  .pdf  .djvu  abstract] In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to stateoftheart GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.

[25] 
J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to textindependent speaker verification systems. Pattern Recognition, 40:23152324, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to stateoftheart GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[26] 
K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, F. Cardinaux,
S. Marcel, S. Bengio, C. Sanderson, N. Poh, Y. Rodriguez, J. Czyz,
L. Vandendorpe, C. McCool, S. Lowther, S. Sridharan, V. Chandran, R. Paredes,
E. Vidal, L. Bai, L. Shen, Y. Wang, C. YuehHsuan, L. HsienChang,
H. YiPing, A. Heinrichs, M. Muller, A. Tewes, C. von der Malsburg, R. Wurtz,
Z. Wang, F. Xue, Y. Ma, Q. Yang, C. Fang, X. Ding, S. Lucey, R. Goss, and
H. Schneiderman.
Face authentication test on the BANCA database. In International Conference on Pattern Recognition, ICPR, volume 4, pages 523532, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper details the results of a Face Authentication Test (FAT2004) held in conjunction with the 17th International Conference on Pattern Recognition. The contest was held on the publicly available BANCA database according to a defined protocol. The competition also had a sequestered part in which institutions had to submit their algorithms for independent testing. 13 different verification algorithms from 10 institutions submitted results. Also, a standard set of face recognition software packages from the Internet were used to provide a baseline performance measure.

[27] 
K. Messer, J. Kittler, M. Sadeghi, M. Hamouz, A. Kostin, S. Marcel, S. Bengio,
F. Cardinaux, C. Sanderson, N. Poh, Y. Rodriguez, K. Kryszczuk, J. Czyz,
L. Vandendorpe, J. Ng, H. Cheung, and B. Tang.
Face authentication competition on the BANCA database. In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 815. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper details the results of a face verification competition held in conjunction with the First International Conference on Biometric Authe ntication. The contest was held on the publically available BANCA database according to a defined protocol. Six different verification algorithms from 4 academic and commercial institutions submitted results. Also, a standard set of face recognition software from the internet was used to provide a baseline performance measure.

[28] 
K. Messer, J. Kittler, M. Sadeghi, S. Marcel, C. Marcel, S. Bengio,
F. Cardinaux, C. Sanderson, J. Czyz, L. Vandendorpe, S. Srisuk, M. Petrou,
W. Kurutach, A. Kadyrov, R. Paredes, B. Kepenekci, F. B. Tek, G. B. Akar,
F. Deravi, and N. Mavity.
Face verification competition on the XM2VTS database. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 964974. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink ] 
[29] 
N. Poh and S. Bengio.
Nonlinear variance reduction techniques in biometric authentication. In IEEE Multimodal User Authentication Workshop, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple synthetic samples, different extractors (features) and biometric modalities. The scores are combined using the average operator, MultiLayer Perceptron and Support Vector Machines. It is found empirically that VR via modalities is the best technique, followed by VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[30] 
N. Poh and S. Bengio.
Variance reduction techniques in biometric authentication. Technical Report IDIAPRR 0317, IDIAP, 2003. [ .ps.gz  .pdf  .djvu  abstract] In this paper, several approaches that can be used to improve biometric authentication applications are proposed. The idea is inspired by the ensemble approach, i.e., the use of several classifiers to solve a problem. Compared to using only one classifier, the ensemble of classifiers has the advantage of reducing the overall variance of the system. Instead of using multiple classifiers, we propose here to examine other possible means of variance reduction (VR), namely through the use of multiple real samples, synthetic samples, different extractors (features) and biometric modalities. It is found empirically that VR via modalities is the best technique, followed by VR via real samples, VR via extractors, VR via classifiers and VR via synthetic samples. This order of effectiveness is due to the corresponding degree of independence of the combined objects (in decreasing order). The theoretical and empirical findings show that the combined experts via VR techniques always perform better than the average of their participating experts. Furthermore, in practice, most combined experts perform better than any of their participating experts.

[31] 
N. Poh and S. Bengio.
Noiserobust multistream fusion for textindependent speaker authentication. In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multistream approaches have proven to be very successful in speech recognition tasks and to a certain extent in speaker authentication tasks. In this study we propose a noiserobust multistream textindependent speaker authentication system. This system has two steps: first train the stream experts under clean conditions and then train the combination mechanism to merge the scores of the stream experts under both clean and noisy conditions. The idea here is to take advantage of the rather predictable reliability and diversity of streams under different conditions. Hence, noiserobustness is mainly due to the combination mechanism. This twostep approach offers several practical advantages: the stream experts can be trained in parallel (e.g., by using several machines); heterogeneous types of features can be used and the resultant system can be robust to different noise types (wide bands or narrow bands) as compared to substreams. An important finding is that a tradeoff is often necessary between the overall good performance under all conditions (clean and noisy) and good performance under clean conditions. To reconcile this tradeoff, we propose to give more emphasis or prior to clean conditions, thus, resulting in a combination mechanism that does not deteriorate under clean conditions (as compared to the best stream) yet is robust to noisy conditions.

[32] 
N. Poh and S. Bengio.
Towards predicting optimal subsets of base classifiers in biometric authentication tasks. In S. Bengio and H. Bourlard, editors, Machine Learning for Multimodal Interactions: First International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3361, pages 159172. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extend in various reallife applications. However, combining too many systems (baseexperts) will also increase both hardware and computation costs. One way to selecting a subset of optimal baseexperts out of N is to carry out the experiments explicitly. There are 2^{N}1 possible combinations. In this paper, we propose an analytical solution to this task when weighted sum fusion mechanism is used. The proposed approach is at least valid in the domain of person authentication. It has a complexity that is additive between the number of examples and the number of possible combinations while the conventional approach, using bruteforce experimenting, is multiplicative between these two terms. Hence, our approach will scale better with large fusion problems. Experiments on the BANCA multimodal database verified our approach. While we will consider here fusion in the context of identity verification via biometrics, or simply biometric authentication, it can also have an important impact in meetings because this a priori information can assist in retrieving highlights in meeting analysis as in “who said what”. Furthermore, automatic meeting analysis also requires many systems working together and involves possibly many audiovisual media streams. Development in fusion of identity verification will provide insights into how fusion in meetings can be done. The ability to predict fusion performance is another important step towards understanding the fusion problem.

[33] 
N. Poh and S. Bengio.
Why do multistream, multiband and multimodal approaches work on biometric user authentication tasks? In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 893896, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Multiband, multistream and multimodal approaches have proven to be very successful both in experiments and in reallife applications, among which speech recognition and biometric authentication are of particular interest here. However, there is a lack of a theoretical study to justify why and how they work, when one combines the streams at the feature or classifier score levels. In this paper, we attempt to cast a light onto the latter subject. While there exists literature discussing this aspect, a study on the relationship between correlation, variance reduction and Equal Error Rate (often used in biometric authentication) has not been treated theoretically as done here, using the mean operator. Our findings suggest that combining several experts using the mean operator, MultiLayerPerceptrons and Support Vector Machines always perform better than the average performance of the underlying experts. Furthermore, in practice, most combined experts using the methods mentioned above perform better than the best underlying expert.

[34] 
N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments? In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various reallife applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a socalled a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply “true users”) taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.

[35] 
N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks. In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 7485. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometricenabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several basesystems. How exactly do correlation and imbalance nature of basesystem performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the classdependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark scorelevel fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.

[36] 
N. Poh and S. Bengio.
Fratio clientdependent normalisation for biometric authentication tasks. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721724, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates a new clientdependent normalisation to improve biometric authentication systems. There exists many clientdependent score normalisation techniques applied to speaker authentication, such as ZNorm, DNorm and TNorm. Such normalisation is intended to adjust the variation across different client models. We propose “Fratio” normalisation, or FNorm, applied to face and speaker authentication systems. This normalisation requires only that as few as two clientdependent accesses are available (the more the better). Different from previous normalisation techniques, FNorm considers the client and impostor distributions simultaneously. We show that Fratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that FNorm actually “interpolates” between clientindependent and clientdependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over ZNorm, clientdependent threshold normalisation or no normalisation.

[37] 
N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks? IEEE Transactions on Signal Processing, 53(11):43844396, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in reallife biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of baseclassifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance ReductionEqual Error Rate (VREER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of Fratio, which itself is a function of 1) correlation, 2) variance of baseexperts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of baseexperts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be missleading. Experimental results on the BANCA multimodal database confirm our findings using VREER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated baseexperts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation “hurts” fusion while negative correlation (greater “diversity”, which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (classdependent) variance. Moreover, by linking the concept of biasvariancecovariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.

[38] 
N. Poh and S. Bengio.
Improving fusion with marginderived confidence in biometric authentication tasks. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 10591068. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, “after observing the score of a given system, what is the confidence (or risk) associated to that given access?”. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.

[39] 
N. Poh and S. Bengio.
A novel approach to combining clientdependent and confidence information in multimodal biometrics. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 11201129. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The issues of fusion with clientdependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a secondlevel classifier, across different experts. Although the formulation of such twostep solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the secondlevel classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call “Prior Knowledge Incorporation” has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS scorelevel fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new stateoftheart performance on this database.

[40] 
N. Poh and S. Bengio.
A scorelevel fusion benchmark database for biometric authentication. In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474483. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some stateoftheart tools to evaluate the fusion performance.

[41] 
N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[42] 
N. Poh and S. Bengio.
Database, protocol and tools for evaluating scorelevel fusion algorithms in biometric authentication. Pattern Recognition, 39(2):223233, 2006. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some stateoftheart tools to evaluate the fusion performance.

[43] 
N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007. [ .ps.gz  .pdf  .djvu  abstract] Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test classconditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.

[44] 
N. Poh, S. Bengio, and J. Korczak.
A multisample multisource model for biometric authentication. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 375384, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this study, two techniques that can improve the authentication process are examined: (i) multiple samples and (ii) multiple biometric sources. We propose the fusion of multiple samples obtained from multiple biometric sources at the score level. By using the average operator, both the theoretical and empirical results show that integrating as many samples and as many biometric sources as possible can improve the overall reliability of the system. This strategy is called multisample multisource approach. This strategy was tested on a reallife database using neural networks trained in oneversusall configuration.

[45] 
N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess userdependent variabilities. In Second Workshop on Multimodal User Authentication, MMUA, 2006. [ .ps.gz  .pdf  .djvu  abstract] Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue  the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a realuser database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a realuser database?. Based on a considerable amount of empirical biometric person authentication experiments (21 realuser data sets and up to 21 ×1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a realuser database with respect to using only a realuser database. Considering the possibly expensive cost involved in collecting the realuser multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.

[46] 
N. Poh, S. Marcel, and S. Bengio.
Improving face authentication using virtual samples. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 3, pages 233236, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper, we present a simple yet effective way to improve a face verification system by generating multiple virtual samples from the unique image corresponding to an access request. These images are generated using simple geometric transformations. This method is often used during training to improve accuracy of a neural network model by making it robust against minor translation, scale and orientation change. The main contribution of this paper is to introduce such method during testing. By generating N images from one single image and propagating them to a trained network model, one obtains N scores. By merging these scores using a simple mean operator, we show that the variance of merged scores is decreased by a factor between 1 and N. An experiment is carried out on the XM2VTS database which achieves new stateoftheart performances.

[47] 
N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint userspecific and sample bootstraps. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(3):492498, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] Biometric authentication performance is often depicted by a decision error tradeoff (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a twostep bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.

[48] 
N. Poh, C. Sanderson, and S. Bengio.
Spectral subband centroids as complementary features for speaker authentication. In International Conference on Biometric Authentication, ICBA, Lecture Notes in Computer Science, volume LNCS 3072, pages 631639. SpringerVerlag, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Most conventional features used in speaker authentication are based on estimation of spectral envelopes in one way or another, e.g., Melscale Filterbank Cepstrum Coefficients (MFCCs), Linearscale Filterbank Cepstrum Coefficients (LFCCs) and Relative Spectral Perceptual Linear Prediction (RASTAPLP). In this study, Spectral Subband Centroids (SSCs) are examined. These features are the centroid frequency in each subband. They have properties similar to formant frequencies but are limited to a given subband.Empirical experiments carried out on the NIST2001 database using SSCs, MFCCs, LFCCs and their combinations by concatenation suggest that SSCs are somewhat more robust compared to conventional MFCC and LFCC features as well as being partially complementary.

[49] 
F. Porée, J. Mariéthoz, S. Bengio, and F. Bimbot.
The BANCA database and experimental protocol for speaker verification. Technical Report IDIAPRR 0213, IDIAP, 2002. [ .ps.gz  .pdf  .djvu  abstract] Identity verification has become a very important research topic recently, particularly using methods based on the face or the voice of the individuals. In the context of the BANCA european project, a novel multimodal database was recently recorded, spanning 5 european languages, 2 modalities (face and voice), 2 microphones, 2 cameras and almost 300 individuals. As we believe that this database offers many advantages for this research community, this paper essentially presents the database and its associated experimental protocol, as well as a baseline stateoftheart system using the voice data for a textindependent speaker verification task.

[50] 
Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Estimating the quality of face localization for face verification. In IEEE International Conference on Image Processing, ICIP, pages 581584, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Face localization is the process of finding the exact position of a face in a given image. This can be useful in several applications such as face tracking or person authentication. The purpose of this paper is to show that the error made during the localization process may have different impacts depending on the final application. Hence in order to evaluate the performance of a face localization algorithm, we propose to embed the final application (here face verification) into the performance measuring process. Moreover, in this paper, we estimate this embedding using either a multilayer perceptron or a K nearest neighbor algorithm in order to speedup the evaluation process. We show on the BANCA database that our proposed measure best matches the final verification results when comparing several localization algorithms, on various performance measures currently used in face localization.

[51] 
Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems. Image and Vision Computing, 24(8):882893, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.

[52] 
C. Sanderson and S. Bengio.
Augmenting frontal face models for nonfrontal verification. In IEEE Multimodal User Authentication Workshop, 2003. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this work we propose to address the problem of nonfrontal face verification when only a frontal training image is available (e.g. a passport photograph) by augmenting a client's frontal face model with artificially synthesized models for nonfrontal views. In the framework of a Gaussian Mixture Model (GMM) based classifier, two techniques are proposed for the synthesis: UBMdiff and LinReg. Both techniques rely on a priori information and learn how face models for the frontal view are related to face models at a nonfrontal view. The synthesis and augmentation approach is evaluated by applying it to two face verification systems: Principal Component Analysis (PCA) based and DCTmod2 (Sanderson et al, 2003) based; the two systems are a representation of holistic and nonholistic approaches, respectively. Results from experiments on the FERET database suggest that in almost all cases, frontal model augmentation has beneficial effects for both systems; they also suggest that the LinReg technique (which is based on multivariate regression of classifier parameters) is more suited to the PCA based system and that the UBMdiff technique (which is based on differences between two general face models) is more suited to the DCTmod2 based system. The results also support the view that the standard DCTmod2/GMM system (trained on frontal faces) is less affected by outofplane rotations than the corresponding PCA/GMM system;moreover, the DCTmod2/GMM system using augmented models is, in almost all cases, more robust than the corresponding PCA/GMM system.

[53] 
C. Sanderson and S. Bengio.
Robust features for frontal authentication in difficult image conditions. In 4th International Conference on Audio and VideoBased Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 2688, pages 495504. SpringerVerlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In this paper we extend the recently proposed DCTmod2 feature extraction technique (which utilizes polynomial coefficients derived from 2D DCT coefficients obtained from horizontally & vertically neighbouring blocks) via the use of various windows and diagonally neighbouring blocks. We also propose enhanced PCA, where traditional PCA feature extraction is combined with DCTmod2. Results using test images corrupted by a linear and a nonlinear illumination change, white Gaussian noise and compression artefacts, show that use of diagonally neighbouring blocks and windowing is detrimental to robustness against illumination changes while being useful for increasing robustness against white noise and compression artefacts. We also show that the enhanced PCA technique retains all the positive aspects of traditional PCA (that is, robustness against white noise and compression artefacts) while also being robust to illumination changes; moreover, enhanced PCA outperforms PCA with histogram equalisation preprocessing.

[54] 
C. Sanderson and S. Bengio.
Extrapolating single view face models for multiview recognition. In International Conference on Intelligente Sensors, Sensor Networks and Information Processings, ISSNIP, pages 581586, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Performance of face recognition systems can be adversely affected by mismatches between training and test poses, especially when there is only one training image available. We address this problem by extending each statistical frontal face model with artificially synthesized models for nonfrontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques utilize prior information on how face models for the frontal view are related to face models for nonfrontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg technique (which is based on a common relation between two sets of points) is more suited than the MLLR based techniques (which in effect are "single point to single point" transforms). For the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR (due to a lower number of free parameters). The results further show that extending frontal models considerably reduces errors.

[55] 
C. Sanderson and S. Bengio.
Statistical transformation techniques for face verification using faces rotated in depth. Technical Report IDIAPRR 0404, IDIAP, 2004. [ .ps.gz  .pdf  .djvu  abstract] In the framework of a Bayesian classifier based on mixtures of gaussians, we address the problem of nonfrontal face verification (when only a single (frontal) training image is available) by extending each frontal face model with artificially synthesized models for nonfrontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for nonfrontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: PCA based (holistic features) and DCTmod2 based (local features). Experiments on the FERET database suggest that for the PCA based system, the LinReg based technique is more suited than the MLLR based techniques; for the DCTmod2 based system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the DCTmod2 based system is less affected by outofplane rotations than the PCA based system; this can be attributed to the local feature representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between face parts, allowing for movement of facial areas.

[56] 
C. Sanderson and S. Bengio.
Statistical transformations of frontal models for nonfrontal face verification. In IEEE International Conference on Image Processing, ICIP, pages 585588, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In the framework of a face verification system using local features and a Gaussian Mixture Model based classifier, we address the problem of nonfrontal face verification (when only a single (frontal) training image is available) by extending each client's frontal face model with artificially synthesized models for nonfrontal views. Furthermore, we propose the Maximum Likelihood Shift (MLS) synthesis technique and compare its performance against a Maximum Likelihood Linear Regression (MLLR) based technique (originally developed for adapting speech recognition systems) and the recently proposed "difference between two Universal Background Models" (UBMdiff) technique. All techniques rely on prior information and learn how a generic face model for the frontal view is related to generic models at nonfrontal views. Experiments on the FERET database suggest that that the proposed MLS technique is more suitable than MLLR (due to a lower number of free parameters) and UBMdiff (due to lack of heuristics). The results further suggest that extending frontal models considerably reduces errors.

[57] 
C. Sanderson, S. Bengio, H. Bourlard, J. Mariéthoz, R. Collobert, M.F.
BenZeghiba, F. Cardinaux, and S. Marcel.
Speech & face based biometric authentication at IDIAP. In International Conference on Multimedia and Expo, ICME, volume 3, pages 14, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We present an overview of recent research at IDIAP on speech & face based biometric authentication. This paper covers usercustomised passwords, adaptation techniques, confidence measures (for use in fusion of audio & visual scores), face verification in difficult image conditions, as well as other related research issues. We also overview the open source Torch library, which has aided in the implementation of the above mentioned techniques.

[58] 
C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for nonfrontal face verification. Pattern Recognition, 39(2):288302, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for nonfrontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multivariate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for nonfrontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCAderived features) and a local feature system (based on DCTderived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.

[59] 
C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity tradeoffs in face verification. In IEEE International Conference on Information Technology and Applications, ICITA, pages 638643, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D and pseudo2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best tradeoff in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.

[1] 
S. Bengio.
Large scale visual semantic extraction. In Frontiers of Engineering  Reports on LeadingEdge Engineering from the 2011 Symposium, 2012. [ weblink  abstract] Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a socalled "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semanticonly trees, such as WordNet, which do not take into account the visual appearance of concepts.

[2] 
S. Bengio and Y. Bengio.
Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transaction on Neural Networks, special issue on data mining and knowledge discovery, 11(3):550557, 2000. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] The curse of dimensionality is severe when modeling highdimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling highdimensional data that requires resources (parameters and computations) that grow at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables (thus reducing significantly the number of parameters). Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[3] 
S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multiclass tasks. In Advances in Neural Information Processing Systems, NIPS, 2010. [ .ps.gz  .pdf  .djvu  abstract] Multiclass classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than nonembedding approaches and has superior accuracy to Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including OnevsRest while being orders of magnitude faster.

[4] 
Y. Bengio and S. Bengio.
Modeling highdimensional discrete data with multilayer neural networks. In S.A. Solla, T.K. Leen, and K.R. Müller, editors, Advances in Neural Information Processing Systems, NIPS 12, pages 400406. MIT Press, 2000. [ .ps.gz  .pdf  .djvu  weblink  abstract] The curse of dimensionality is severe when modeling highdimensional discrete data: the number of possible combinations of the variables explodes exponentially. In this paper we propose a new architecture for modeling highdimensional data that requires resources (parameters and computations) that grow only at most as the square of the number of variables, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions. The neural network can be interpreted as a graphical model without hidden random variables, but in which the conditional distributions are tied through the hidden units. The connectivity of the neural network can be pruned by using dependency tests between the variables. Experiments on modeling the distribution of several discrete data sets show statistically significant improvements over other methods such as naive Bayes and comparable Bayesian networks, and show that significant improvements can be obtained by pruning the network.

[5] 
G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Largescale contentbased audio retrieval from text queries. In ACM International Conference on Multimedia Information Retrieval, MIR, 2008. [ .ps.gz  .pdf  .djvu  abstract] In contentbased audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This contentbased approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses freeform text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passiveaggressive model for image retrieval (PAMIR), and compare it to two stateoftheart approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large realworld datasets: a collection of short sound effects, and a noisier and larger collection of usercontributed userlabeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[6] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Largescale online learning of image similarity through ranking: Extended abstract. In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address realworld largescale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that queryindependent similarity could be accurately learned even for largescale datasets that could not be handled before.

[7] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning. In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of nonzero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than stateoftheart methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[8] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, JMLR, 11:11091135, 2010. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[9] 
R. Collobert and S. Bengio.
On the convergence of SVMTorch, an algorithm for largescale regression problems. Technical Report IDIAPRR 0024, IDIAP, Martigny, Switzerland, 2000. [ .ps.gz  .pdf  .djvu  abstract] Recently, many researchers have proposed decomposition algorithms for SVM regression problems. In a previous paper, we also proposed such an algorithm, named SVMTorch. In this paper, we show that while there is actually no convergence proof for any other decomposition algorithm for SVM regression problems to our knowledge, such a proof does exist for SVMTorch for the particular case where no shrinking is used and the size of the working set is equal to 2, which is the size that gave the fastest results on most experiments we have done. This convergence proof is in fact mainly based on the convergence proof given by Keerthi and Gilbert for their SVM classification algorithm.

[10] 
R. Collobert and S. Bengio.
Support vector machines for largescale regression problems. Technical Report IDIAPRR 0017, IDIAP, Martigny, Switzerland, 2000. [ .ps.gz  .pdf  .djvu  abstract] Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l^{2} memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorchphSVMTorch is available at http://www.idiap.ch/learning/SVMTorch.html., which is similar to SVMLight proposed by Joachims for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve largescale regression problems (more than 20000 examples). Comparisons with Nodelib, another SVM algorithm for largescale regression problems from Flake and Lawrence yielded significant time improvements.

[11] 
R. Collobert and S. Bengio.
SVMTorch: Support vector machines for largescale regression problems. Journal of Machine Learning Research, JMLR, 1:143160, 2001. [ .ps.gz  .pdf  .djvu  weblink  abstract] Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVMLight proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve largescale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for largescale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[12] 
R. Collobert and S. Bengio.
A new marginbased criterion for efficient gradient descent. Technical Report IDIAPRR 0316, IDIAP, 2003. [ .ps.gz  .pdf  .djvu  abstract] During the last few decades, several papers were published about secondorder optimization methods for gradient descent based learning algorithms. Unfortunately, these methods usually have a cost in time close to O(n^{3}) per iteration, and O(n^{2}) in space, where n is the number of parameters to optimize, which is intractable with large optimization systems usually found in reallife problems. Moreover, these methods are usually not easy to implement. Many enhancements have also been proposed in order to overcome these problems, but most of them still cost O(n^{2}) in time per iteration. Instead of trying to solve a hard optimization problem using complex secondorder tricks, we propose to modify the problem itself in order to optimize a simpler one, by simply changing the cost function used during training. Furthermore, we will argue that analyzing the Hessian resulting from the choice of various cost functions is very informative and could help in the design of new machine learning algorithms. For instance, we propose in this paper a version of the Support Vector Machines criterion applied to Multi Layer Perceptrons, which yields very good training and generalization performance in practice. Several empirical comparisons on two benchmark data sets are given to justify this approach.

[13] 
R. Collobert and S. Bengio.
A gentle hessian for efficient gradient descent. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 517520, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Several secondorder optimization methods for gradient descent algorithms have been proposed over the years, but they usually need to compute the inverse of the Hessian of the cost function (or an approximation of this inverse) during training. In most cases, this leads to an O(n^{2}) cost in time and space per iteration, where n is the number of parameters, which is prohibitive for large n. We propose instead a study of the Hessian before training. Based on a second order analysis, we show that a blockdiagonal Hessian yields an easier optimization problem than a full Hessian. We also show that the condition of blockdiagonality in common machine learning models can be achieved by simply selecting an appropriate training criterion. Finally, we propose a version of the SVM criterion applied to MLPs, which verifies the aspects highlighted in this second order analysis, but also yields very good generalization performance in practice, taking advantage of the margin effect. Several empirical comparisons on two benchmark datasets are given to illustrate this approach.

[14] 
R. Collobert and S. Bengio.
Links between perceptrons, MLPs and SVMs. In International Conference on Machine Learning, ICML, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We propose to study links between three important classification algorithms: Perceptrons, MultiLayer Perceptrons (MLPs) and Support Vector Machines (SVMs). We first study ways to control the capacity of Perceptrons (mainly regularization parameters and early stopping), using the margin idea introduced with SVMs. After showing that under simple conditions a Perceptron is equivalent to an SVM, we show it can be computationally expensive in time to train an SVM (and thus a Perceptron) with stochastic gradient descent, mainly because of the margin maximization term in the cost function. We then show that if we remove this margin maximization term, the learning rate or the use of early stopping can still control the margin. These ideas are extended afterward to the case of MLPs. Moreover, under some assumptions it also appears that MLPs are a kind of mixture of SVMs, maximizing the margin in the hidden layer space. Finally, we present a very simple MLP based on the previous findings, which yields better performances in generalization and speed than the other models.

[15] 
R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5):11051114, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Support Vector Machines (SVMs) are currently the stateoftheart models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve reallife problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[16] 
R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, NIPS 14, pages 633640. MIT Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] Support Vector Machines (SVMs) are currently the stateoftheart models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve reallife problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database, yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed on Forest.

[17] 
R. Collobert, S. Bengio, and J. Mariéthoz.
Torch: a modular machine learning software library. Technical Report IDIAPRR 0246, IDIAP, 2002. [ .ps.gz  .pdf  .djvu  abstract] Many scientific communities have expressed a growing interest in machine learning algorithms recently, mainly due to the generally good results they provide, compared to traditional statistical or AI approaches. However, these machine learning algorithms are often complex to implement and to use properly and efficiently. We thus present in this paper a new machine learning software library in which most stateoftheart algorithms have already been implemented and are available in a unified framework, in order for scientists to be able to use them, compare them, and even extend them. More interestingly, this library is freely available under a BSD license and can be retrieved on the web by everyone.

[18] 
R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures. In S. Lee and A. Verri, editors, International Workshop on Pattern Recognition with Support Vector Machines, SVM, Lecture Notes in Computer Science, volume LNCS 2388, pages 823. SpringerVerlag, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative loglikelihood.

[19] 
R. Collobert, Y. Bengio, and S. Bengio.
Scaling large learning problems with hard parallel mixtures. International Journal on Pattern Recognition and Artificial Intelligence (IJPRAI), 17(3):349365, 2003. [ .ps.gz  .pdf  .djvu  weblink  abstract] A challenge for statistical learning is to deal with large data sets, e.g. in data mining. The training time of ordinary Support Vector Machines is at least quadratic, which raises a serious research challenge if we want to deal with data sets of millions of examples. We propose a “hard parallelizable mixture” methodology which yields significantly reduced training time through modularization and parallelization: the training data is iteratively partitioned by a “gater” model in such a way that it becomes easy to learn an “expert” model separately in each region of the partition. A probabilistic extension and the use of a set of generative models allows representing the gater so that all pieces of the model are locally trained. For SVMs, time complexity appears empirically to locally grow linearly with the number of examples, while generalization performance can be enhanced. For the probabilistic version of the algorithm, the iterative algorithm provably goes down in a cost function that is an upper bound on the negative loglikelihood.

[20] 
J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and
H. Adam.
Largescale object classification using label relation graphs. In Proceedings of the European Conference on Computer Vision, ECCV, 2014. [ .ps.gz  .pdf  .djvu  abstract] In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a largescale benchmark. Empirical results demonstrate that our model can significantly improve object classification by exploiting the label relations.

[21] 
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and
T. Mikolov.
DeViSE: A deep visualsemantic embedding model. In Advances In Neural Information Processing Systems, NIPS, 2013. [ .ps.gz  .pdf  .djvu  abstract] Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources  such as text data  both to train visual models and to constrain their predictions. In this paper we present a new deep visualsemantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches stateoftheart performance on the 1000class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zeroshot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

[22] 
D. Grangier and S. Bengio.
A discriminative kernelbased model to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):13711384, 2008. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernelbased classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over stateoftheart alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multipleword queries.

[23] 
D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model. In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006. [ .ps.gz  .pdf  .djvu  abstract] This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the stateoftheart (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).

[24] 
M. R. Gupta, S. Bengio, and J. Weston.
Training highly multiclass classifiers. Journal of Machine Learning Research, JMLR, 15:14611492, 2014. [ .ps.gz  .pdf  .djvu  weblink  abstract] Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the moreconfusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to onevsall linear SVMs and Wsabie.

[25] 
M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking nonparametric statistical tests. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Although nonparametric tests have already been proposed for that purpose, statistical significance tests for nonstandard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole “population”, we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets nonparametric tests are relatively reliable in all conditions.

[26] 
M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S.
Corrado, and J. Dean.
Zeroshot learning by convex combination of semantic embeddings. In International Conference on Learning Representations, ICLR, 2014. [ .ps.gz  .pdf  .djvu  abstract] Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional classification framing of image understanding, particularly in terms of the promise for zeroshot learning  the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zeroshot learning task.

[27] 
M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparsecode representations. In ICML 2009 Workshop on Sparse Method for Music Audio, 2009. [ .ps.gz  .pdf  .djvu  abstract] The task of ranking sounds from text queries is a good test application for machinehearing techniques, and particularly for comparison and evaluation of alternative sound representations in a largescale setting. We have adapted a machinevision system, “passiveaggressive model for image retrieval” (PAMIR), which efficiently learns, using a rankingbased cost function, a linear mapping from a very large sparse feature space to a large queryterm space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from highdimensional auditory images. In addition to two main auditoryimage models, we also include and compare a family of more conventional MelFrequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vectorquantized MFCCs. The two auditory models tested use the adaptive polezero filter cascade (PZFC) auditory filterbank and sparsecode feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precisionattopk performance measures, the best results are about 72% top1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

[28] 
M. Stevens, S. Bengio, and Y. Singer.
Efficient learning of sparse ranking functions. In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical Inference. Springer, 2013. [ weblink  abstract] Algorithms for learning to rank can be inefficient when they employ risk functions that use structural information. We describe and analyze a learning algorithm that efficiently learns a ranking function using a domination loss. This loss is designed for problems in which we need to rank a small number of positive examples over a vast number of negative examples. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present an extension that incorporates regularization thus extending Vapnik¿s notion of regularized empirical risk minimization to ranking learning. We also discuss an extension to the case of multivalues feedback. Experiments performed on several benchmark datasets and large scale Google internal dataset demonstrate the effectiveness of learning algorithm in constructing compact models while retaining the empirical performance accuracy.

[29] 
J. Weston, S. Bengio, and P. Hamel.
Multitasking with joint semantic spaces for largescale music annotation and retrieval. Journal of New Music Research, 40:337348, 2011. [ .ps.gz  .pdf  .djvu  weblink  abstract] Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single lowdimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multitask learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

[30] 
J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint wordimage embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECMLPKDD, 2010. Best Paper Award in Machine Learning [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[31] 
J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint wordimage embeddings. Machine Learning Journal, 81(1):2135, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[32] 
J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011. [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

[1] 
A. Vinciarelli and S. Bengio.
Offline cursive word recognition using continuous density hidden markov models trained with PCA or ICA features. In Proceedings of the 16th International Conference on Pattern Recognition, ICPR, volume 3, pages 8184. IEEE Computer Society Press, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This work presents an Offline Cursive Word Recognition System dealing with single writer samples. The system was a continuous density hiddden Markov model trained using either the raw data, or data transformed using Principal Component Analysis or Independent Component Analysis. Both techniques significantly improved the recognition rate of the system. Preprocessing, normalization and feature extraction are described in detail as well as the training technique adopted. Several experiments were performed using a publicly available database. The accuracy obtained is the highest presented in the literature over the same data.

[2] 
A. Vinciarelli and S. Bengio.
Transforming the feature vectors to improve HMM based cursive word recognition systems. Technical Report IDIAPRR 0232, IDIAP, 2002. [ .ps.gz  .pdf  .djvu  abstract] Although many Offline Cursive Word Recognition systems are based on HMMs, no attention was ever paid, to our knowledge, to the fact that the feature vectors are typically not in the most suitable form for modeling. They are most of the time correlated and embedded in a space of dimension higher than their Intrinsic Dimension. This leads to several problems and has a negative influence on the performance. By applying some transforms it is possible to solve, or at least to attenuate, such problems resulting in data easier to model and in systems with higher recognition rate. In this work, we used Principal Component Analysis (linear and nonlinear) and Independent Component Analysis. A reduction of the error rate by up to 30.3% (over single writer data) and 16.2% (over multiple writer samples) is shown to be achieved.

[3] 
A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based offline cursive script recognition. In Proceedings of the 8th International Conference on Frontiers in Handwriting Recognition, pages 287291, 2002. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work presents the application of HMM adaptation techniques to the problem of OffLine Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[4] 
A. Vinciarelli and S. Bengio.
Writer adaptation techniques in HMM based offline cursive script recognition. Pattern Recognition Letters, 23(8):905916, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This work presents the application of HMM adaptation techniques to the problem of OffLine Cursive Script Recognition. Instead of training a new model for each writer, one first creates a unique model with a mixed database and then adapts it for each different writer using his own small dataset. Experiments on a publicly available benchmark database show that an adapted system has an accuracy higher than 80% even when less than 30 word samples are used during adaptation, while a system trained using the data of the single writer only needs at least 200 words in order to achieve the same performance as the adapted models.

[5] 
A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of large vocabulary cursive handwritten text. In International Conference on Document Analysis and Recognition, ICDAR, pages 11011105, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a system for the offline recognition of cursive handwritten lines of text. The system is based on continuous density HMMs and Statistical Language Models. The system recognizes data produced by a single writer. No apriori knowledge is used about the content of the text to be recognized. Changes in the experimental setup with respect to the recognition of single words are highlighted. The results show a recognition rate of ~85% with a lexicon containing 50'000 words. The experiments were performed over a publicly available database.

[6] 
A. Vinciarelli, S. Bengio, and H. Bunke.
Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(6):709720, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents a system for the offline recognition of large vocabulary unconstrained handwritten texts. The only assumption made about the data is that it is written in English. This allows the application of Statistical Language Models in order to improve the performance of our system. Several experiments have been performed using both single and multiple writer data. Lexica of variable size (from 10,000 to 50,000 words) have been used. The use of language models is shown to improve the accuracy of the system (when the lexicon contains 50,000 words, error rate is reduced by ~50% for single writer data and by ~25% for multiple writer data). Our approach is described in detail and compared with other methods presented in the literature to deal with the same problem. An experimental setup to correctly deal with unconstrained text recognition is proposed.

[1] 
R. Collobert and S. Bengio.
On the convergence of SVMTorch, an algorithm for largescale regression problems. Technical Report IDIAPRR 0024, IDIAP, Martigny, Switzerland, 2000. [ .ps.gz  .pdf  .djvu  abstract] Recently, many researchers have proposed decomposition algorithms for SVM regression problems. In a previous paper, we also proposed such an algorithm, named SVMTorch. In this paper, we show that while there is actually no convergence proof for any other decomposition algorithm for SVM regression problems to our knowledge, such a proof does exist for SVMTorch for the particular case where no shrinking is used and the size of the working set is equal to 2, which is the size that gave the fastest results on most experiments we have done. This convergence proof is in fact mainly based on the convergence proof given by Keerthi and Gilbert for their SVM classification algorithm.

[2] 
R. Collobert and S. Bengio.
Support vector machines for largescale regression problems. Technical Report IDIAPRR 0017, IDIAP, Martigny, Switzerland, 2000. [ .ps.gz  .pdf  .djvu  abstract] Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l^{2} memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorchphSVMTorch is available at http://www.idiap.ch/learning/SVMTorch.html., which is similar to SVMLight proposed by Joachims for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve largescale regression problems (more than 20000 examples). Comparisons with Nodelib, another SVM algorithm for largescale regression problems from Flake and Lawrence yielded significant time improvements.

[3] 
R. Collobert and S. Bengio.
SVMTorch: Support vector machines for largescale regression problems. Journal of Machine Learning Research, JMLR, 1:143160, 2001. [ .ps.gz  .pdf  .djvu  weblink  abstract] Support Vector Machines (SVMs) for regression problems are trained by solving a quadratic optimization problem which needs on the order of l square memory and time resources to solve, where l is the number of training examples. In this paper, we propose a decomposition algorithm, SVMTorch (available at http://www.idiap.ch/learning/SVMTorch.html), which is similar to SVMLight proposed by Joachims (1999) for classification problems, but adapted to regression problems. With this algorithm, one can now efficiently solve largescale regression problems (more than 20000 examples). Comparisons with Nodelib, another publicly available SVM algorithm for largescale regression problems from Flake and Lawrence (2000) yielded significant time improvements. Finally, based on a recent paper from Lin (2000), we show that a convergence proof exists for our algorithm.

[4] 
R. Collobert, S. Bengio, and Y. Bengio.
A parallel mixture of SVMs for very large scale problems. Neural Computation, 14(5):11051114, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Support Vector Machines (SVMs) are currently the stateoftheart models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve reallife problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples). In addition, and that is a surprise, a significant improvement in generalization was observed.

[5] 
Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] In this paper, we show that the hinge loss can be interpreted as the negloglikelihood of a semiparametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semiparametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is intervalvalued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over stateoftheart procedures.

[6] 
D. Grangier and S. Bengio.
A discriminative kernelbased model to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):13711384, 2008. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernelbased classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over stateoftheart alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multipleword queries.

[7] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on nonlinearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vectorspace which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMMbased approach.

[8] 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting. Speech Communication, 51:317329, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vectorspace, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional contextindependent HMMbased system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGIStories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional contextindependent HMMbased system.

[9] 
J. Keshet, S. ShalevShwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernelbased phoneme sequence recognition. In Proceedings of the International Conference on Spoken Language Processing, InterspeechICSLP, 2006. [ .ps.gz  .pdf  .djvu  abstract] We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMMbased approaches, our method uses a discriminative kernelbased training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vectorspace endowed with an innerproduct that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMMbased approach.

[10] 
Q. Le and S. Bengio.
Hybrid generativediscriminative models for speech and speaker recognition. Technical Report IDIAPRR 0206, IDIAP, 2002. [ .ps.gz  .pdf  .djvu  abstract] Generative probability models such as Hidden Markov Models are usually used for modeling sequences of data because of their ability to handle variable size sequences and missing information. On the other hand, because of their discriminative properties, discriminative models like Support Vector Machines (SVMs) usually yield better performance in classification problem and can construct flexible decision boundaries. An ideal classifier should have all the power of these two complementary approaches. A series of recent papers has suggested some techniques for mixing generative models and discriminative models. In one of them a fixed size vector (the Fisher score) containing sufficient statistics of a sequence is computed for a previously trained HMM and can then be used as input to a discriminative model for classification. The purpose of this project is thus to study, experiment, enhance and adapt these new approaches of integrating discriminative models such as SVM into generative models for sequence processing problems, such as speaker and speech recognition.

[11] 
Q. Le and S. Bengio.
Client dependent GMMSVM models for speaker verification. In International Conference on Artificial Neural Networks, ICANN/ICONIP, Lecture Notes in Computer Science, volume LNCS 2714, pages 443451. Springer Verlag, 2003. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Generative Gaussian Mixture Models (GMMs) are known to be the dominant approach for modeling speech sequences in text independent speaker verification applications because of their scalability, good performance and their ability in handling variable size sequences. On the other hand, because of their discriminative properties, models like Support Vector Machines (SVMs) usually yield better performance in static classification problems and can construct flexible decision boundaries. In this paper, we try to combine these two complementary models by using Support Vector Machines to postprocess scores obtained by the GMMs. A crossvalidation method is also used in the baseline system to increase the number of client scores in the training phase, which enhances the results of the SVM models. Experiments carried out on the XM2VTS and PolyVar databases confirm the interest of this hybrid approach.

[12] 
J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to textindependent speaker verification systems. Pattern Recognition, 40:23152324, 2007. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to stateoftheart GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.

[13] 
V. Popovici, S. Bengio, and J.P. Thiran.
Kernel matching pursuit for large datasets. Pattern Recognition, 38(12):23852390, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernelinduced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the tradeoff between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.

[14] 
A. Pozdnoukhov and S. Bengio.
From samples to objects in kernel methods. Technical Report IDIAPRR 0329, IDIAP, 2003. [ .ps.gz  .pdf  .djvu  abstract] This paper presents a general method for incorporating prior knowledge into kernel methods. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. Two implementation techniques of this method, based on analytical kernel jittering and the vicinal risk minimization principle, are considered. Empirical results on one artificial dataset and one real dataset based on EEG signals demonstrate the performance of the proposed method.

[15] 
A. Pozdnoukhov and S. Bengio.
Tangent vector kernels for invariant image classification with SVMs. In International Conference on Pattern Recognition, ICPR, volume 3, pages 486489, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper presents an application of the general sampletoobject approach to the problem of invariant image classification. The approach results in defining new SVM kernels based on tangent vectors that take into account prior information on known invariances. Real data of face images are used for experiments. The presented approach integrates virtual sample and tangent distance methods. We observe a significant increase in performance with respect to standard approaches. The experiments also illustrate (as expected) that prior knowledge becomes more important as the amount of training data decreases.

[16] 
A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems. In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for interclass invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVMbased classifier.

[17] 
A. Pozdnoukhov and S. Bengio.
Graphbased invariant manifolds for invariant pattern recognition with kernel methods. In International Conference on Pattern Recognition, ICPR, 2006. [ .ps.gz  .pdf  .djvu  abstract] We present here an approach for applying the technique of modeling data transformation manifolds for invariant learning with kernel methods. The approach is based on building a kernel function on the graph modeling the invariant manifold. It provides a way for taking into account nearly arbitrary transformations of the input samples. The approach is verified experimentally on the task of optical character recognition, providing stateoftheart performance on harder problem settings.

[18] 
A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects. Pattern Recognition Letters, 27(10):10871097, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electroencephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.

[19] 
A. Pozdnoukhov and S. Bengio.
Semisupervised kernel methods for regression estimation. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006. [ .ps.gz  .pdf  .djvu  abstract] The paper presents a semisupervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed datadependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two realworld datasets were used to demonstrate the performance of the proposed algorithm.

[1] 
S. Bengio.
Large scale visual semantic extraction. In Frontiers of Engineering  Reports on LeadingEdge Engineering from the 2011 Symposium, 2012. [ weblink  abstract] Image annotation is the task of providing textual semantic to new images, by ranking a large set of possible annotations according to how they correspond to a given image. In the large scale setting, there could be millions of images to process and hundreds of thousands of potential distinct annotations. In order to achieve such a task we propose to build a socalled "embedding space", into which both images and annotations can be automatically projected. In such a space, one can then find the nearest annotations to a given image, or annotations similar to a given annotation. One can even build a visiosemantic tree from these annotations, that corresponds to how concepts (annotations) are similar to each other with respect to their visual characteristics. Such a tree will be different from semanticonly trees, such as WordNet, which do not take into account the visual appearance of concepts.

[2] 
S. Bengio and G. Heigold.
Word embeddings for speech recognition. In Proceedings of the 15th Conference of the International Speech Communication Association, Interspeech, 2014. [ .ps.gz  .pdf  .djvu  abstract] Speech recognition systems have used the concept of states as a way to decompose words into subword units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

[3] 
S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve. In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the socalled breakeven point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a nonparametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.

[4] 
S. Bengio, F. Pereira, Y. Singer, and D. Strelow.
Group sparse coding. In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009. [ .ps.gz  .pdf  .djvu  abstract] Bagofwords document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearestneighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixednorm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification.

[5] 
S. Bengio, J. Weston, and D. Grangier.
Label embedding trees for large multiclass tasks. In Advances in Neural Information Processing Systems, NIPS, 2010. [ .ps.gz  .pdf  .djvu  abstract] Multiclass classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than nonembedding approaches and has superior accuracy to Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including OnevsRest while being orders of magnitude faster.

[6] 
G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Largescale contentbased audio retrieval from text queries. In ACM International Conference on Multimedia Information Retrieval, MIR, 2008. [ .ps.gz  .pdf  .djvu  abstract] In contentbased audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This contentbased approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses freeform text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passiveaggressive model for image retrieval (PAMIR), and compare it to two stateoftheart approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large realworld datasets: a collection of short sound effects, and a noisier and larger collection of usercontributed userlabeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.

[7] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Largescale online learning of image similarity through ranking: Extended abstract. In 4th Iberian Conference on Pattern Recognition and Image Analysis IbPRIA, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. Pairwise similarity plays a crucial role in classification algorithms like nearest neighbors, and is practically important for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are both visually similar and semantically related to a given object. Unfortunately, current approaches for learning semantic similarity are limited to small scale datasets, because their complexity grows quadratically with the sample size, and because they impose costly positivity constraints on the learned similarity functions. To address realworld largescale AI problem, like learning similarity over all images on the web, we need to develop new algorithms that scale to many samples, many classes, and many features. The current abstract presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. Comparing OASIS with different symmetric variants, provides unexpected insights into the effect of symmetry on the quality of the similarity. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within two days on a single CPU. Human evaluations showed that 35% of the ten top images ranked by OASIS were semantically relevant to a query image. This suggests that queryindependent similarity could be accurately learned even for largescale datasets that could not be handled before.

[8] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
An online algorithm for large scale image similarity learning. In Advances in Neural Information Processing Systems, NIPS. MIT Press, 2009. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is a fundamental problem in machine learning. It stands in the core of classification methods like kernel machines, and is particularly useful for applications like searching for images that are similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, current approaches for learning similarity do not scale to large datasets, especially when imposing metric constraints on the learned similarity. We describe OASIS, a method for learning pairwise similarity that is fast and scales linearly with the number of objects and the number of nonzero features. Scalability is achieved through online learning of a bilinear model over sparse representations using a large margin criterion and an efficient hinge loss cost. OASIS is accurate at a wide range of scales: on a standard benchmark with thousands of images, it is more precise than stateoftheart methods, and faster by orders of magnitude. On 2.7 million images collected from the web, OASIS can be trained within 3 days on a single CPU. The nonmetric similarities learned by OASIS can be transformed into metric similarities, achieving higher precisions than similarities that are learned as metrics in the first place. This suggests an approach for learning a metric from data that is larger by orders of magnitude than was handled before.

[9] 
G. Chechik, V. Sharma, U. Shalit, and S. Bengio.
Large scale online learning of image similarity through ranking. Journal of Machine Learning Research, JMLR, 11:11091135, 2010. [ .ps.gz  .pdf  .djvu  abstract] Learning a measure of similarity between pairs of objects is an important generic problem in machine learning. It is particularly useful in large scale applications like searching for an image that is similar to a given image or finding videos that are relevant to a given video. In these tasks, users look for objects that are not only visually similar but also semantically related to a given object. Unfortunately, the approaches that exist today for learning such semantic similarity do not scale to large datasets. This is both because typically their CPU and storage requirements grow quadratically with the sample size, and because many methods impose complex positivity constraints on the space of learned similarity functions. The current paper presents OASIS, an Online Algorithm for Scalable Image Similarity learning that learns a bilinear similarity measure over sparse representations. OASIS is an online dual approach using the passiveaggressive family of learning algorithms with a large margin criterion and an efficient hinge loss cost. Our experiments show that OASIS is both fast and accurate at a wide range of scales: for a dataset with thousands of images, it achieves better results than existing stateoftheart methods, while being an order of magnitude faster. For large, web scale, datasets, OASIS can be trained on more than two million images from 150K text queries within 3 days on a single CPU. On this large scale dataset, human evaluations showed that 35% of the ten nearest neighbors of a given test image, as found by OASIS, were semantically relevant to that image. This suggests that query independent similarity could be accurately learned even for large scale datasets that could not be handled before.

[10] 
A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and
T. Mikolov.
DeViSE: A deep visualsemantic embedding model. In Advances In Neural Information Processing Systems, NIPS, 2013. [ .ps.gz  .pdf  .djvu  abstract] Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources  such as text data  both to train visual models and to constrain their predictions. In this paper we present a new deep visualsemantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches stateoftheart performance on the 1000class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zeroshot predictions achieving hit rates of up to 18% across thousands of novel labels never seen by the visual model.

[11] 
O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation. In Tenth Conference on Computational Natural Language Learning, CONLL, 2006. [ .ps.gz  .pdf  .djvu  abstract] This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.

[12] 
D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model. In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005. [ .ps.gz  .pdf  .djvu  abstract] Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d inC according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b >sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the stateoftheart Okapi approach, when applied over a nonhyperlinked retrieval corpus.

[13] 
D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks. In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b >sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.

[14] 
D. Grangier and S. Bengio.
A neural network to retrieve images from text queries. In Proceedings of the 16th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 4132. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.

[15] 
D. Grangier and S. Bengio.
Learning the interframe distance for discriminative templatebased keyword detection. In Proceedings of the 10th European Conference on Speech Communication and Technology, EurospeechInterspeech, 2007. [ .ps.gz  .pdf  .djvu  abstract] This paper proposes a discriminative approach to templatebased keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.

[16] 
D. Grangier and S. Bengio.
A discriminative kernelbased model to rank images from text queries. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):13711384, 2008. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernelbased classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over stateoftheart alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multipleword queries.

[17] 
D. Grangier, F. Monay, and S. Bengio.
A discriminative approach for the retrieval of images from text queries. In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science, volume LNCS 4212. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the nonrelevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed PassiveAggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current stateoftheart approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.

[18] 
D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model. In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006. [ .ps.gz  .pdf  .djvu  abstract] This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the stateoftheart (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).

[19] 
M. R. Gupta, S. Bengio, and J. Weston.
Training highly multiclass classifiers. Journal of Machine Learning Research, JMLR, 15:14611492, 2014. [ .ps.gz  .pdf  .djvu  weblink  abstract] Classification problems with thousands or more classes often have a large variance in the confusability between classes, and we show that the moreconfusable classes add more noise to the empirical loss that is minimized during training. We propose an online solution that reduces the effect of highly confusable classes in training the classifier parameters, and focuses the training on pairs of classes that are easier to differentiate at any given time in the training. We also show that the adagrad method, recently proposed for automatically decreasing step sizes for convex stochastic gradient descent optimization, can also be profitably applied to the nonconvex optimization stochastic gradient descent training of a joint supervised dimensionality reduction and linear classifier. Experiments on ImageNet benchmark datasets and proprietary image recognition problems with 15,000 to 97,000 classes show substantial gains in classification accuracy compared to onevsall linear SVMs and Wsabie.

[20] 
M. Keller and S. Bengio.
Textual data representation. Technical Report IDIAPRR 0349, IDIAP, 2003. [ .ps.gz  .pdf  .djvu  abstract] We address in this report the problem of representing formally textual data. First, this problem is replaced in the context of automatic text processing. Then, the weaknesses of the basic document representation, i.e. the bagofwords representation, are explained and some stateoftheart methods claiming to overcome these weaknesses are reviewed. Moreover we propose a novel graphical model, the Theme Topic Mixture Model, which also claims to do so, in addition of giving a probabilistic framework in which documents are considered.

[21] 
M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation. In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Automatic Text Processing tasks, documents are usually represented in the bagofword space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[22] 
M. Keller and S. Bengio.
A neural network for text representation. In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667672. SpringerVerlag, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, nonprobabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.

[23] 
J. Lee, S. Bengio, S. Kim, G. Lebanon, and Y. Singer.
Local collaborative ranking. In International World Wide Web Conference, WWW, 2014. [ .ps.gz  .pdf  .djvu  abstract] Personalized recommendation systems are used in a wide variety of applications such as electronic commerce, social networks, web search, and more. Collaborative filtering approaches to recommendation systems typically assume that the rating matrix (e.g., movie ratings by viewers) is lowrank. In this paper, we examine an alternative approach in which the rating matrix is locally lowrank. Concretely, we assume that the rating matrix is lowrank within certain neighborhoods of the metric space defined by (user, item) pairs. We combine a recent approach for local lowrank approximation based on the Frobenius norm with a general empirical risk minimization for ranking losses. Our experiments indicate that the combination of a mixture of local lowrank matrices each of which was trained to minimize a ranking loss outperforms many of the currently used stateoftheart recommendation systems. Moreover, our method is easy to parallelize, making it a viable approach for large scale realworld rankbased recommendation systems.

[24] 
J. Lee, S. Kim, G. Lebanon, Y. Singer, and S. Bengio.
LLORMA: Local lowrank matrix approximation. Journal of Machine Learning Research, JMLR, 17:124, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Matrix approximation is a common tool in recommendation systems, text mining, and computer vision. A prevalent assumption in constructing matrix approximations is that the partially observed matrix is lowrank. In this paper, we propose, analyze, and experiment with two procedures, one parallel and the other global, for constructing local matrix approximations. The two approaches approximate the observed matrix as a weighted sum of lowrank matrices. These matrices are limited to a local region of the observed matrix. We analyze the accuracy of the proposed local lowrank modeling. Our experiments show improvements in prediction accuracy over classical approaches for recommendation tasks.

[25] 
R. F. Lyon, M. Rehn, S. Bengio, T. C. Walters, and G. Chechik.
Sound retrieval and ranking using sparse auditory representations. Neural Computation, 22(9):23902416, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] To create systems that understand the sounds that humans are exposed to in everyday life, we need to represent sounds with features that can discriminate among many different sound classes. Here, we use a soundranking framework to quantitatively evaluate such representations in a large scale task. We have adapted a machinevision method, the “passiveaggressive model for image retrieval” (PAMIR), which efficiently learns a linear mapping from a very large sparse feature space to a large queryterm space. Using this approach we compare different auditory front ends and different ways of extracting sparse features from highdimensional auditory images. We tested auditory models that use adaptive polezero filter cascade (PZFC) auditory filterbank and sparsecode feature extraction from stabilized auditory images via multiple vector quantizers. In addition to auditory image models, we also compare a family of more conventional MelFrequency Cepstral Coefficient (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vectorquantized MFCCs. Ranking thousands of sound files with a query vocabulary of thousands of words, the best precision at top1 was 73% and the average precision was 35%, reflecting a 18% improvement over the best competing MFCC frontend.

[26] 
M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S.
Corrado, and J. Dean.
Zeroshot learning by convex combination of semantic embeddings. In International Conference on Learning Representations, ICLR, 2014. [ .ps.gz  .pdf  .djvu  abstract] Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional classification framing of image understanding, particularly in terms of the promise for zeroshot learning  the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing image classifier and a semantic word embedding model, which contains the class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of and indeed outperforms state of the art methods on the ImageNet zeroshot learning task.

[27] 
M. Rehn, R. F. Lyon, S. Bengio, T. C. Walters, and G. Chechik.
Sound ranking using auditory sparsecode representations. In ICML 2009 Workshop on Sparse Method for Music Audio, 2009. [ .ps.gz  .pdf  .djvu  abstract] The task of ranking sounds from text queries is a good test application for machinehearing techniques, and particularly for comparison and evaluation of alternative sound representations in a largescale setting. We have adapted a machinevision system, “passiveaggressive model for image retrieval” (PAMIR), which efficiently learns, using a rankingbased cost function, a linear mapping from a very large sparse feature space to a large queryterm space. Using this system allows us to focus on comparison of different auditory front ends and different ways of extracting sparse features from highdimensional auditory images. In addition to two main auditoryimage models, we also include and compare a family of more conventional MelFrequency Cepstral Coefficients (MFCC) front ends. The experimental results show a significant advantage for the auditory models over vectorquantized MFCCs. The two auditory models tested use the adaptive polezero filter cascade (PZFC) auditory filterbank and sparsecode feature extraction from stabilized auditory images via multiple vector quantizers. The models differ in their implementation of the strobed temporal integration used to generate the stabilized image. Using ranking precisionattopk performance measures, the best results are about 72% top1 precision and 35% average precision, using a test corpus of thousands of sound files and a query vocabulary of hundreds of words.

[28] 
M. Stevens, S. Bengio, and Y. Singer.
Efficient learning of sparse ranking functions. In B. Scholkopf, Z. Luo, and V. Vovk, editors, Empirical Inference. Springer, 2013. [ weblink  abstract] Algorithms for learning to rank can be inefficient when they employ risk functions that use structural information. We describe and analyze a learning algorithm that efficiently learns a ranking function using a domination loss. This loss is designed for problems in which we need to rank a small number of positive examples over a vast number of negative examples. In that context, we propose an efficient coordinate descent approach that scales linearly with the number of examples. We then present an extension that incorporates regularization thus extending Vapnik¿s notion of regularized empirical risk minimization to ranking learning. We also discuss an extension to the case of multivalues feedback. Experiments performed on several benchmark datasets and large scale Google internal dataset demonstrate the effectiveness of learning algorithm in constructing compact models while retaining the empirical performance accuracy.

[29] 
J. Weston, S. Bengio, and P. Hamel.
Multitasking with joint semantic spaces for largescale music annotation and retrieval. Journal of New Music Research, 40:337348, 2011. [ .ps.gz  .pdf  .djvu  weblink  abstract] Music prediction tasks range from predicting tags given a song or clip of audio, predicting the name of the artist, or predicting related songs given a song, clip, artist name or tag. That is, we are interested in every semantic relationship between the different musical concepts in our database. In realistically sized databases, the number of songs is measured in the hundreds of thousands or more, and the number of artists in the tens of thousands or more, providing a considerable challenge to standard machine learning techniques. In this work, we propose a method that scales to such datasets which attempts to capture the semantic similarities between the database items by modeling audio, artist names, and tags in a single lowdimensional semantic embedding space. This choice of space is learnt by optimizing the set of prediction tasks of interest jointly using multitask learning. Our single model learnt by training on the joint objective function is shown experimentally to have improved accuracy over training on each task alone. Our method also outperforms the baseline methods tried and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where the semantic space captures well the similarities of interest.

[30] 
J. Weston, S. Bengio, and N. Usunier.
Large scale image annotation: Learning to rank with joint wordimage embeddings. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECMLPKDD, 2010. Best Paper Award in Machine Learning [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method both outperforms several baseline methods and, in comparison to them, is faster and consumes less memory. We also demonstrate how our method learns an interpretable model, where annotations with alternate spellings or even languages are close in the embedding space. Hence, even when our model does not predict the exact annotation given by a human labeler, it often predicts similar annotations, a fact that we try to quantify by measuring the newly introduced “sibling” precision metric, where our method also obtains excellent results.

[31] 
J. Weston, S. Bengio, and N. Usunier.
Wsabie: Scaling up to large vocabulary image annotation. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, 2011. [ .ps.gz  .pdf  .djvu  abstract] Image annotation datasets are becoming larger and larger, with tens of millions of images and tens of thousands of possible annotations. We propose a strongly performing method that scales to such datasets by simultaneously learning to optimize precision at the top of the ranked list of annotations for a given image and learning a lowdimensional joint embedding space for both images and annotations. Our method, called Wsabie, both outperforms several baseline methods and is faster and consumes less memory.

[1] 
N. Gilardi and S. Bengio.
Local machine learning models for spatial data analysis. Journal of Geographic Information and Decision Analysis, 4(1):1128, 2000. [ .ps.gz  .pdf  .djvu  weblink  abstract] In this paper, we compare different machine learning algorithms applied to non stationary spatial data analysis. We show that models taking into account local variability of the data are better than models which are trained globally on the whole dataset. Two global models (Support Vector Regression and Multilayer Perceptrons) and two local models (a local version of Support Vector Regression and Mixture of Experts) were compared over the Spatial Interpolation Comparison 97 (SIC97) dataset, and the results are presented and compared to previous results obtained on the same dataset.

[2] 
N. Gilardi and S. Bengio.
Comparison of four machine learning algorithms for spatial data analysis. In G. Dubois, J. Malczewski, and M. DeCort, editors, Mapping radioactivity in the environment  Spatial Interpolation Comparison 97, pages 222237. Office for Official Publications of the European Communities, Luxembourg, 2003. [ .ps.gz  .pdf  .djvu  abstract] This chapter proposes a clear methodology on how to use machine learning algorithms for spatial data analysis in order to avoid any bias and eventually obtain fair estimation of their performance on new data. Four different machine learning algorithms are presented, namely multilayer perceptrons (MLP), mixture of experts (ME), support vector regression (SVR) and a local version of the latter (local SVR). Evaluation criteria adapted to geostatistical problems are also presented in order to compare adequately different models on the same dataset. Finally, an experimental comparison is given on the SIC97 dataset as well as an analysis of the results.

[3] 
N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how? In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123138. Office for Official Publications of the European Communities, Luxembourg, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: KNearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.

[4] 
N. Gilardi, S. Bengio, and M. Kanevski.
Conditional gaussian mixture models for environmental risk mapping. In IEEE Workshop on Neural Networks for Signal Processing, NNSP, pages 777786, 2002. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] This paper proposes the use of Gaussian Mixture Models to estimate conditional probability density functions in an environmental risk mapping context. A conditional Gaussian Mixture Model has been compared to the geostatistical method of Sequential Gaussian Simulations and shows good performances in reconstructing local PDF. The data sets used for this comparison are parts of the digital elevation model of Switzerland.

[1] 
S. Bengio, F. Fessant, and D. Collobert.
A connectionist system for mediumterm horizon time series prediction. In International Workshop on Applications of Neural Networks to Telecommunications, IWANNT, Stockholm, Sweden, 1995. [ .ps.gz  .pdf  .djvu ] 
[2] 
S. Bengio, F. Fessant, and D. Collobert.
Use of modular architectures for time series prediction. Neural Processing Letters, 3(2):101106, 1996. [ .ps.gz  .pdf  .djvu  weblink ] 
[3] 
F. Fessant, S. Bengio, and D. Collobert.
On the prediction of solar activity using different neural network models. Annales Geophysicae, 14:2026, 1996. [ .ps.gz  .pdf  .djvu  weblink ] 
[4] 
A. Gravey, S. Bengio, D. Collobert, and F. Clerot.
Utilisation de techniques de prédiction neuromimétiques pour la négotiation dynamique des paramètres de contrat de trafic dans un réseau ATM. Technical Report NT/LAA/EIA/132, France Télécom CNET, Lannion, France, 1996. 
[1] 
C. Dimitrakakis and S. Bengio.
Boosting HMMs with an application to speech recognition. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, volume 5, pages 621624, 2004. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] Boosting is a general method for training an ensemble of classifiers with a view to improving performance relative to that of a single classifier. While the original AdaBoost algorithm has been defined for classification tasks, the current work examines its applicability to sequence learning problems. In particular, different methods for training HMMs on sequences and for combining their output are investigated in the context of automatic speech recognition.

[2] 
C. Dimitrakakis and S. Bengio.
Online policy adaptation for ensemble classifiers. In European Symposium on Artificial Neural Networks, ESANN, 2004. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper, the idea of using an adaptive policy for training and combining the base classifiers is put forward. The effectiveness of this approach for online learning is demonstrated by experimental results on several UCI benchmark databases.

[3] 
C. Dimitrakakis and S. Bengio.
Boosting word error rates. In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501504, 2005. [ .ps.gz  .pdf  .djvu  weblink  idiapRR  abstract] We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the wellknown Numbers 95 database that indicate the importance of this temporal probability distribution.

[4] 
C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers. Neurocomputing, 64:211221, 2005. [ .ps.gz  .pdf  .djvu  weblink  abstract] Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Qlearning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.

[1] 
S. Chiappa and S. Bengio.
HMM and IOHMM modeling of EEG rhythms for asynchronous BCI systems. In European Symposium on Artificial Neural Networks, ESANN, 2004. [ .ps.gz  .pdf  .djvu  idiapRR  abstract] We compare the use of two Markovian models, HMMs and IOHMMs, to discriminate between three mental tasks for brain computer interface systems using an asynchronous protocol. We show that IOHMMs outperform HMMs but that, probably due to the lack of any prior information on the state dynamics, no practical advantage in the use of these models over their static counterparts is obtained.

[2] 
M. Keller and S. Bengio.
Theme topic mixture model: A graphical model for document representation. In PASCAL Workshop on Learning Methods for Text Understanding and Mining, 2004. [ .ps.gz  .pdf  .djvu  weblink  abstract] Automatic Text Processing tasks, documents are usually represented in the bagofword space. However, this representation does not take into account the possible relations between words. We propose here a review of a family of document density estimation models for representing documents. Inside this family we derive another possible model: the Theme Topic Mixture Model (TTMM). This model assumes two types of relations among textual data. Topics link words to each other and Themes gather documents with particular distribution over the topics. An experiment reports the performance of the different models in this family over a common task.

[3] 
J.F. Paiement, S. Bengio, and D. Eck.
Probabilistic models for melodic prediction. Artificial Intelligence Journal, 173(14):12661274, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Chord progressions are the building blocks from which tonal music is constructed. The choice of a particular representation for chords has a strong impact on statistical modeling of the dependence between chord symbols and the actual sequences of notes in polyphonic music. Melodic prediction is used in this paper as a benchmark task to evaluate the quality of four chord representations using two probabilistic model architectures derived from Input/Output Hidden Markov Models (IOHMMs). Likelihoods and conditional and unconditional prediction error rates are used as complementary measures of the quality of each of the proposed chord representations. We observe empirically that different chord representations are optimal depending on the chosen evaluation metric. Also, representing chords only by their roots appears to be a good compromise in most of the reported experiments.

[4] 
J.F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions. In International Conference on Music Information Retrieval, ISMIR, 2005. [ .ps.gz  .pdf  .djvu  abstract] Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional outofsample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[5] 
J.F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization. In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218229. SpringerVerlag, 2006. [ .ps.gz  .pdf  .djvu  weblink  abstract] We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.

[6] 
J.F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space. In International Conference on Machine Learning, ICML, 2005. [ .ps.gz  .pdf  .djvu  abstract] Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional outofsample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.

[7] 
J.F. Paiement, Y. Grandvalet, and S. Bengio.
Predictive models for music. Connection Science, 21(2 & 3):253272, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Modeling longterm dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce predictive models for melodies. We decompose melodic modeling into two subtasks. We first propose a rhythm model based on the distributions of distances between subsequences. Then, we define a generative model for melodies given chords and rhythms based on modeling sequences of Narmour features. The rhythm model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases. Using a similar evaluation procedure, the proposed melodic model consistently outperforms an Input/Output Hidden Markov Model. Furthermore, these models are able to generate realistic melodies given appropriate musical contexts.

[8] 
J.F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms. In NIPS Workshop on Brain, Music and Cognition, 2007. [ .ps.gz  .pdf  .djvu  abstract] Modeling music involves capturing longterm dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[9] 
J.F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms. In International Conference on Machine Learning, ICML, 2008. [ .ps.gz  .pdf  .djvu  abstract] Modeling longterm dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.

[10] 
D. Zhang, D. GaticaPerez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains. In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005. [ .ps.gz  .pdf  .djvu  abstract] We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a twolevel structure: individuallevel and grouplevel. Individual level models actions of each player, and the grouplevel models actions of the team as a whole. Experiments on synthetic multiplayer games and a multiparty meeting corpus show the effectiveness of the proposed model.

[1] 
S. Bengio, L. Deng, H. Larochelle, H. Lee, and R. Salakhutdinov.
Guest editors' introduction: Special section on learning deep architectures. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 35:17951797, 2013. [ .ps.gz  .pdf  .djvu  weblink ] 
[2] 
S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio.
Generating sentences from a continuous space. In SIGNLL Conference on Computational Natural Language Learning, CONLL, 2016. [ .ps.gz  .pdf  .djvu  abstract] The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnnbased variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and highlevel syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and wellformed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.

[3] 
J. Chen, R. Monga, S. Bengio, and R. Jozefowicz.
Revisiting distributed synchronous SGD. In Workshop Track of the International Conference on Learning Representations, ICLR, 2016. [ .ps.gz  .pdf  .djvu  weblink ] 
[4] 
M. Cissé, M. AlShedivat, and S. Bengio.
ADIOS: Architectures deep in output space. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016. [ .ps.gz  .pdf  .djvu  abstract] Multilabel classification is a generalization of binary classification where the task consists in predicting sets of labels. With the availability of ever larger datasets, the multilabel setting has become a natural one in many applications, and the interest in solving multilabel problems has grown significantly. As expected, deep learning approaches are now yielding stateoftheart performance for this class of problems. Unfortunately, they usually do not take into account the often unknown but nevertheless rich relationships between labels. In this paper, we propose to make use of this underlying structure by learning to partition the labels into a Markov Blanket Chain and then applying a novel deep architecture that exploits the partition. Experiments on several popular and large multilabel datasets demonstrate that our approach not only yields significant improvements, but also helps to overcome tradeoffs specific to the multilabel classification setting.

[5] 
D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, and S. Bengio.
Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, JMLR, 11:625660, 2010. [ .ps.gz  .pdf  .djvu  weblink  abstract] Much recent research has been devoted to learning algorithms for deep architectures such as Deep Belief Networks and stacks of autoencoder variants, with impressive results obtained in several areas, mostly on vision and language data sets. The best results obtained on supervised learning tasks involve an unsupervised learning component, usually in an unsupervised pretraining phase. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. The main question investigated here is the following: how does unsupervised pretraining work? Answering this questions is important if learning in deep architectures is to be further improved. We propose several explanatory hypotheses and test them through extensive simulations. We empirically show the influence of pretraining with respect to architecture depth, model capacity, and number of training examples. The experiments confirm and clarify the advantage of unsupervised pretraining. The results suggest that unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training data set; the evidence from these results supports a regularization explanation for the effect of pretraining.

[6] 
D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent.
The difficulty of training deep architectures and the effect of unsupervised pretraining. In D. van Dyk and M. Wellings, editors, Proceedings of The Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 5 of JMLR Workshop and Conference Procedings, pages 153160, 2009. [ .ps.gz  .pdf  .djvu  weblink  abstract] Whereas theoretical work suggests that deep architectures might be more efficient at representing highlyvarying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pretraining. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. Answering these questions is important if learning in deep architectures is to be further improved. We attempt to shed some light on these questions through extensive simulations. The experiments confirm and clarify the advantage of unsupervised pretraining. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pretraining in terms of optimization and its role as a regularizer. We empirically show the influence of pretraining with respect to architecture depth, model capacity, and number of training examples.

[7] 
N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio.
A neural transducer. In Advances In Neural Information Processing Systems, NIPS, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Sequencetosequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequencetosequence models, the Neural Transducer computes the nextstep distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

[8] 
M. Norouzi, S. Bengio, Z. Chen, N. Jaitly, M. Schuster, Y. Wu, and
D. Schuurmans.
Reward augmented maximum likelihood for neural structured prediction. In Advances In Neural Information Processing Systems, NIPS, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] A key problem in structured output prediction is direct optimization of the task reward function that matters for test evaluation. This paper presents a simple and computationally efficient approach to incorporate task reward into a maximum likelihood framework. We establish a connection between the loglikelihood and regularized expected reward objectives, showing that at a zero temperature, they are approximately equivalent in the vicinity of the optimal solution. We show that optimal regularized expected reward is achieved when the conditional distribution of the outputs given the inputs is proportional to their exponentiated (temperature adjusted) rewards. Based on this observation, we optimize conditional logprobability of edited outputs that are sampled proportionally to their scaled exponentiated reward. We apply this framework to optimize edit distance in the output label space. Experiments on speech recognition and machine translation for neural sequence to sequence models show notable improvements over a maximum likelihood baseline by using edit distance augmented maximum likelihood

[9] 
V. Ramanathan, J. Deng, C. Li, W. Han, Z. Li, K. Gu, Y. Song, S. Bengio,
C. Rosenberg, and F.F. Li.
Learning semantic relationships for better action retrieval in images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015. [ .ps.gz  .pdf  .djvu  abstract] Human actions capture a wide variety of interactions between people and objects. As a result, the set of possible actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by leveraging the rich semantic relationship between different actions. A single action is often composed of other smaller actions and is exclusive of certain others. We would like our method to reason about such relationships and extrapolate unobserved actions from known actions. Hence, we propose a novel neural network framework which jointly extracts the relationship between actions and uses them for training better action recognition models. Our model incorporates linguistic, visual and logical consistency based cues to effectively identify theses relationships. We train and test our model on a new largescale image dataset of human actions under two settings with 27K and 2K actions. We show a significant improvement in mean AP compared to different baseline methods including the stateoftheart HEXgraph approach from Deng et al.

[10] 
O. Vinyals, S. Bengio, and M. Kudlur.
Order matters: Sequence to sequence for sets. In International Conference on Learning Representations, ICLR, 2016. [ .ps.gz  .pdf  .djvu  weblink  abstract] Sequences have become first class citizens in supervised learning thanks to the resurgence of recurrent neural networks. Many complex tasks that require mapping from or to a sequence of observations can now be formulated with the sequencetosequence (seq2seq) framework which employs the chain rule to efficiently represent the joint probability of sequences. In many cases, however, variable sized inputs and/or outputs might not be naturally expressed as sequences. For instance, it is not clear how to input a set of numbers into a model where the task is to sort them; similarly, we do not know how to organize outputs when they correspond to random variables and the task is to model their unknown joint probability. In this paper, we first show using various examples that the order in which we organize input and/or output data matters significantly when learning an underlying model. We then discuss an extension of the seq2seq framework that goes beyond sequences and handles input sets in a principled way. In addition, we propose a loss which, by searching over possible orders during training, deals with the lack of structure of output sets. We show empirical evidence of our claims regarding ordering, and on the modifications to the seq2seq framework on benchmark language modeling and parsing tasks, as well as two artificial tasks  sorting numbers and estimating the joint probability of unknown graphical models.

[11] 
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.
Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015. [ .ps.gz  .pdf  .djvu  weblink  abstract] Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current stateoftheart BLEU1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU4 of 27.7, which is the current stateoftheart.
