Samy Bengio - Publications

Some of the files below are copyrighted. They are provided for your convenience, yet you may download them only if you are entitled to do so by your arrangements with the various publishers.

By Year Referred Publications: 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000
Technical Reports: 2008 2007 2006 2005 2004 2003 2002 2001 2000 before 2000

By Topic Large Scale Learning Kernel Machines Ensembles Indexing Biometric Authentication Multimodal
Handwritten Recognition Time Series Geostatistics Speech Graphical Models

Referred Publications

2008

 
G. Chechik, E. Ie, M. Rehn, S. Bengio, and D. Lyon.
Large-scale content-based audio retrieval from text queries.
In ACM International Conference on Multimedia Information Retrieval, MIR, 2008.
[ .ps.gz | .pdf | .djvu | abstract]
In content-based audio retrieval, the goal is to find sound recordings (audio documents) based on their acoustic features. This content-based approach differs from retrieval approaches that index media files using metadata such as file names and user tags. In this paper, we propose a machine learning approach for retrieving sounds that is novel in that it (1) uses free-form text queries rather sound sample based queries, (2) searches by audio content rather than via textual meta data, and (3) can scale to very large number of audio documents and very rich query vocabulary. We handle generic sounds, including a wide variety of sound effects, animal vocalizations and natural scenes. We test a scalable approach based on a passive-aggressive model for image retrieval (PAMIR), and compare it to two state-of-the-art approaches; Gaussian mixture models (GMM) and support vector machines (SVM). We test our approach on two large real-world datasets: a collection of short sound effects, and a noisier and larger collection of user-contributed user-labeled recordings (25K files, 2000 terms vocabulary). We find that all three methods achieved very good retrieval performance. For instance, a positive document is retrieved in the first position of the ranking more than half the time, and on average there are more than 4 positive documents in the first 10 retrieved, for both datasets. PAMIR completed both training and retrieval of all data in less than 6 hours for both datasets, on a single machine. It was one to three orders of magnitude faster than the competing approaches. This approach should therefore scale to much larger datasets in the future.
 
D. Grangier and S. Bengio.
A discriminative kernel-based model to rank images from text queries.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30(8):1371-1384, 2008.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper introduces a discriminative model for the retrieval of images from text queries. Our approach formalizes the retrieval task as a ranking problem, and introduces a learning procedure optimizing a criterion related to the ranking performance. The proposed model hence addresses the retrieval problem directly and does not rely on an intermediate image annotation task, which contrasts with previous research. Moreover, our learning procedure builds upon recent work on the online learning of kernel-based classifiers. This yields an efficient, scalable algorithm, which can benefit from recent kernels developed for image comparison. The experiments performed over stock photography data show the advantage of our discriminative ranking approach over state-of-the-art alternatives (e.g. our model yields 26.3% average precision over the Corel dataset, which should be compared to 22.0%, for the best alternative model evaluated). Further analysis of the results shows that our model is especially advantageous over difficult queries such as queries with few relevant pictures or multiple-word queries.
 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
Speech Communication, to be published, 2008.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on mapping the input acoustic representation of the speech utterance along with the target keyword into a vector space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested on both read (HTIMIT, WSJ) and spontaneous speech (OGI-Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system.
 
J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A distance model for rhythms.
In International Conference on Machine Learning, (ICML), 2008.
[ .ps.gz | .pdf | .djvu | abstract]
Modeling long-term dependencies in time series has proved very difficult to achieve with traditional machine learning methods. This problem occurs when considering music data. In this paper, we introduce a model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.
 
H. Paugam-Moisy, R. Martinez, and S. Bengio.
Delay learning and polychronization for reservoir computing.
Neurocomputing, 71(7-9):1143-1158, 2008.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
We propose a multi-timescale learning rule for spiking neuron networks, in the line of the recently emerging field of reservoir computing. The reservoir is a network model of spiking neurons, with random topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that affects the synaptic delays linking the network to the readout neurons, with classification as a goal task. The network processing and the resulting performance can be explained by the concept of polychronization, proposed by Izhikevich (2006, Neural Computation, 18:2), on physiological grounds. The model emphasizes that polychronization can be used as a tool for exploiting the computational power of synaptic delays and for monitoring the topology and activity of a spiking neuron network.

2007

 
S. Bengio and J. Mariéthoz.
Biometric person authentication is a multiple classifier problem.
In M. Haindl, J. Kittler, and F. Roli, editors, 7th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 4472. Springer-Verlag, 2007.
[ .ps.gz | .pdf | .djvu | abstract]
Several papers have already shown the interest of using multiple classifiers in order to enhance the performance of biometric person authentication systems. In this paper, we would like to argue that the core task of Biometric Person Authentication is actually a multiple classifier problem as such: indeed, in order to reach state-of-the-art performance, we argue that all current systems , in one way or another, try to solve several tasks simultaneously and that without such joint training (or sharing), they would not succeed as well. We explain hereafter this perspective, and according to it, we propose some ways to take advantage of it, ranging from more parameter sharing to similarity learning.
 
D. Grangier and S. Bengio.
Learning the inter-frame distance for discriminative template-based keyword detection.
In Proceedings of the 10th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2007.
[ .ps.gz | .pdf | .djvu | abstract]
This paper proposes a discriminative approach to template-based keyword detection. We introduce a method to learn the distance used to compare acoustic frames, a crucial element for template matching approaches. The proposed algorithm estimates the distance from data, with the objective to produce a detector maximizing the Area Under the receiver operating Curve (AUC), i.e. the standard evaluation measure for the keyword detection problem. The experiments performed over a large corpus, SpeechDatII, suggest that our model is effective compared to an HMM system, e.g. the proposed approach reaches 93.8% of averaged AUC compared to 87.9% for the HMM.
 
J. Keshet, D. Grangier, and S. Bengio.
Discriminative keyword spotting.
In ISCA Research Workshop on Non Linear Speech Processing, NOLISP, 2007.
[ .ps.gz | .pdf | .djvu | abstract]
This paper proposes a new approach for keyword spotting, which is not based on HMMs. The proposed method employs a new discriminative learning procedure, in which the learning phase aims at maximizing the area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based on non-linearly mapping the input acoustic representation of the speech utterance along with the target keyword into an abstract vector space. Building on techniques used for large margin methods for predicting whole sequences, our keyword spotter distills to a classifier in the abstract vector-space which separates speech utterances in which the keyword was uttered from speech utterances in which the keyword was not uttered. We describe a simple iterative algorithm for learning the keyword spotter and discuss its formal properties. Experiments with the TIMIT corpus show that our method outperforms the conventional HMM-based approach.
 
J. Mariéthoz and S. Bengio.
A kernel trick for sequences applied to text-independent speaker verification systems.
Pattern Recognition, 40:2315-2324, 2007.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper present a principled SVM based speaker verification system. We propose a new framework and a new sequence kernel that can make use of any Mercer kernel at the frame level. An extension of the sequence kernel based on the Max operator is also proposed. The new system is compared to state-of-the-art GMM and other SVM based systems found in the literature on the Banca and Polyvar databases. The new system outperforms, most of the time, the other systems, statistically significantly. Finally, the new proposed framework clarifies previous SVM based systems and suggests interesting future research directions.
 
J.-F. Paiement, Y. Grandvalet, S. Bengio, and D. Eck.
A generative model for rhythms.
In NIPS Workshop on Brain, Music and Cognition, 2007.
[ .ps.gz | .pdf | .djvu | abstract]
Modeling music involves capturing long-term dependencies in time series, which has proved very difficult to achieve with traditional statistical methods. The same problem occurs when only considering rhythms. In this paper, we introduce a generative model for rhythms based on the distributions of distances between subsequences. A specific implementation of the model when considering Hamming distances over a simple rhythm representation is described. The proposed model consistently outperforms a standard Hidden Markov Model in terms of conditional prediction accuracy on two different music databases.
 
H. Paugam-Moisy, R. Martinez, and S. Bengio.
A supervised learning approach based on STDP and polychronization in spiking neuron networks.
In European Symposium on Artificial Neural Networks, ESANN, 2007.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
We propose a network model of spiking neurons, without preimposed topology and driven by STDP (Spike-Time-Dependent Plasticity), a temporal Hebbian unsupervised learning mode, biologically observed. The model is further driven by a supervised learning algorithm, based on a margin criterion, that has effect on the synaptic delays linking the network to the output neurons, with classification as a goal task. The network processing and the resulting performance are completely explainable by the concept of polychronization, proposed by Izhikevich [?]. The model emphasizes the computational capabilities of this concept.
 
N. Poh and S. Bengio.
Estimating the confidence interval of expected performance curve in biometric authentication using joint bootstrap.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2007.
[ .ps.gz | .pdf | .djvu | abstract]
Evaluating biometric authentication performance is a complex task because the performance depends on the user set size, composition and the choice of samples. We propose to reduce the performance dependency of these three factors by deriving appropriate confidence intervals. In this study, we focus on deriving a confidence region based on the recently proposed Expected Performance Curve (EPC). An EPC is different from the conventional DET or ROC curve because an EPC assumes that the test class-conditional (client and impostor) score distributions are unknown and this includes the choice of the decision threshold for various operating points. Instead, an EPC selects thresholds based on the training set and applies them on the test set. The proposed technique is useful, for example, to quote realistic upper and lower bounds of the decision cost function used in the NIST annual speaker evaluation. Our findings, based on the 24 systems submitted to the NIST2005 evaluation, show that the confidence region obtained from our proposed algorithm can correctly predict the performance of an unseen database with two times more users with an average coverage of 95% (over all the 24 systems). A coverage is the proportion of the unseen EPC covered by the derived confidence interval.
 
N. Poh, A. Martin, and S. Bengio.
Performance generalization in biometric authentication using joint user-specific and sample bootstraps.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(3):492-498, 2007.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
Biometric authentication performance is often depicted by a decision error trade-off (DET) curve. We show that this curve is dependent on the choice of samples available, the demographic composition and the number of users specific to a database. We propose a two-step bootstrap procedure to take into account of the three mentioned sources of variability. This is an extension to the Bolle 's bootstrap subset technique. Preliminary experiments on the NIST2005 and XM2VTS benchmark databases are encouraging, e.g., the average result across all 24 systems evaluated on NIST2005 indicates that one can predict, with more than 75% of DET coverage, an unseen DET curve with 8 times more users. Furthermore, our finding suggests that with more data available, the confidence intervals become smaller and hence more useful.
 
S. Renals, S. Bengio, and J. G. Fiscus, editors.
Machine Learning for Multimodal Interaction: Third International Workshop, MLMI'2006.
volume 4299 of Lecture Notes in Computer Science. Springer-Verlag, 2007.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the 3rd Workshop on Machine Learning for Multimodal Interaction (MLMI 2006), held in Bethesda MD, USA during May 1­4, 2006. The workshop was organized and sponsored jointly by the US National Institute for Standards and Technology (NIST), three projects supported by the European Commission (Information Society Technologies priority of the sixth Framework Programme) - the AMI and CHIL Integrated Projects, and the PASCAL Network of Excellence - and the Swiss National Science Foundation national research collaboration, IM2. In addition to the main workshop, MLMI 2006 was co-located with the 4th NIST Meeting Recognition Workshop. This workshop was centered on the Rich Transcription 2006 Spring Meeting Recognition (RT-06) evaluation of speech technologies within the meeting domain. Building on the success of previous evaluations in this domain, the RT-06 evaluation continued evaluation tasks in the areas of speech-to-text, who-spoke-when, and speech activity detection. The conference program featured invited talks, full papers (subject to careful peer review, by at least three reviewers), and posters (accepted on the basis of abstracts) covering a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multimodal meeting processing, as addressed by the various sponsoring projects. These areas included human­human communication modeling, speech and visual processing, multimodal processing, fusion and fission, human­computer interaction, and the modeling of discourse and dialog, with an emphasis on the application of machine learning. Out of the submitted full papers, about 50% were accepted for publication in the present volume, after authors had been invited to take review comments and conference feedback into account. The workshop featured invited talks from Roderick Murray-Smith (University of Glasgow), Tsuhan Chen (Carnegie Mellon University) and David McNeill (University of Chicago), and a special session on projects in the area of multimodal interaction including presentations on the VACE, CHIL and AMI projects.
 
S. Sonnenburg, M. L. Braun, C. Soon Ong, S Bengio, L. Bottou, G. Holmes, Y. LeCun, K.-R. Müller, F. Pereira, C. E. Rasmussen, G. Rätsch, B. Schölkopf, A. Smola, P. Vincent, J. Weston, and R. Williamson.
The need for open source software in machine learning.
Journal of Machine Learning Research, 8:2443-2466, 2007.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not utilized, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community.

2006

 
F. Cardinaux, C. Sanderson, and S. Bengio.
User authentication via adapted statistical models of face images.
IEEE Transactions on Signal Processing, 54(1):361-373, 2006.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
It has been previously demonstrated that systems based on local features and relatively complex statistical models, namely 1D Hidden Markov Models (HMMs) and pseudo-2D HMMs, are suitable for face recognition. Recently, a simpler statistical model, namely the Gaussian Mixture Model (GMM), was also shown to perform well. In much of the literature devoted to these models, the experiments were performed with controlled images (manual face localization, controlled lighting, background, pose, etc). However, a practical recognition system has to be robust to more challenging conditions. In this article we evaluate, on the relatively difficult BANCA database, the performance, robustness and complexity of GMM and HMM based approaches, using both manual and automatic face localization. We extend the GMM approach through the use of local features with embedded positional information, increasing performance without sacrificing its low complexity. Furthermore, we show that the traditionally used Maximum Likelihood (ML) training approach has problems estimating robust model parameters when there is only a few training images available. Considerably more precise models can be obtained through the use of Maximum a Posteriori (MAP) training. We also show that face recognition techniques which obtain good performance on manually located faces do not necessarily obtain good performance on automatically located faces, indicating that recognition techniques must be designed from the ground up to handle imperfect localization. Finally, we show that while the pseudo-2D HMM approach has the best overall performance, authentication time on current hardware makes it impractical. The best trade-off in terms of authentication time, robustness and discrimination performance is achieved by the extended GMM approach.
 
O. Glickman, I. Dagan, M. Keller, S. Bengio, and W. Daelemans.
Investigating lexical substitution scoring for subtitle generation.
In Tenth Conference on Computational Natural Language Learning, CONLL, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
This paper investigates an isolated setting of the lexical substitution task of replacing words with their synonyms. In particular, we examine this problem in the setting of subtitle generation and evaluate state of the art scoring methods that predict the validity of a given substitution. The paper evaluates two context independent models and two contextual models. The major findings suggest that distributional similarity provides a useful complementary estimate for the likelihood that two Wordnet synonyms are indeed substitutable, while proper modeling of contextual constraints is still a challenging task for future research.
 
D. Grangier and S. Bengio.
A neural network to retrieve images from text queries.
In Proceedings of the 16th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 4132. Springer-Verlag, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This work presents a neural network for the retrieval of images from text queries. The proposed network is composed of two main modules: the first one extracts a global picture representation from local block descriptors while the second one aims at solving the retrieval problem from the extracted representation. Both modules are trained jointly to minimize a loss related to the retrieval performance. This approach is shown to be advantageous when compared to previous models relying on unsupervised feature extraction: average precision over Corel queries reaches 26.2% for our model, which should be compared to 21.6% for PAMIR, the best alternative.
 
D. Grangier, F. Monay, and S. Bengio.
A discriminative approach for the retrieval of images from text queries.
In European Conference on Machine Learning, ECML, Lecture Notes in Computer Science, volume LNCS 4212. Springer-Verlag, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This work proposes a new approach to the retrieval of images from text queries. Contrasting with previous work, this method relies on a discriminative model: the parameters are selected in order to minimize a loss related to the ranking performance of the model, i.e. its ability to rank the relevant pictures above the non-relevant ones when given a text query. In order to minimize this loss, we introduce an adaptation of the recently proposed Passive-Aggressive algorithm. The generalization performance of this approach is then compared with alternative models over the Corel dataset. These experiments show that our method outperforms the current state-of-the-art approaches, e.g. the average precision over Corel test data is 21.6% for our model versus 16.7% for the best alternative, Probabilistic Latent Semantic Analysis.
 
D. Grangier, F. Monay, and S. Bengio.
Learning to retrieve images from text queries with a discriminative model.
In International Workshop on Adaptive Multimedia Retrieval, AMR, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
This work presents a discriminative model for the retrieval of pictures from text queries. The core idea of this approach is to minimize a loss directly related to the retrieval performance of the model. For that purpose, we rely on a ranking loss which has recently been successfully applied to text retrieval problems. The experiments performed over the Corel dataset show that our approach compares favorably with generative models that constitute the state-of-the-art (e.g. our model reaches 21.6% mean average precision with Blob and SIFT features, compared to 16.7% for PLSA, the best alternative).
 
J. Keshet, S. Shalev-Shwartz, S. Bengio, Y. Singer, and D. Chazan.
Discriminative kernel-based phoneme sequence recognition.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
We describe a new method for phoneme sequence recognition given a speech utterance, which is not based on the HMM. In contrast to HMM-based approaches, our method uses a discriminative kernel-based training procedure in which the learning process is tailored to the goal of minimizing the Levenshtein distance between the predicted phoneme sequence and the correct sequence. The phoneme sequence predictor is devised by mapping the speech utterance along with a proposed phoneme sequence to a vector-space endowed with an inner-product that is realized by a Mercer kernel. Building on large margin techniques for predicting whole sequences, we are able to devise a learning algorithm which distills to separating the correct phoneme sequence from all other sequences. We describe an iterative algorithm for learning the phoneme sequence recognizer and further describe an efficient implementation of it. We present initial encouraging experimental results with the TIMIT and compare the proposed method to an HMM-based approach.
 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Posterior based keyword spotting with a priori thresholds.
In Proceedings of the International Conference on Spoken Language Processing, Interspeech-ICSLP, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
In this paper, we propose a new posterior based scoring approach for keyword and non keyword (garbage) elements. The estimation of these scores is based on HMM state posterior probability definition, taking into account long contextual information and the prior knowledge (e.g. keyword model topology). The state posteriors are then integrated into keyword and garbage posteriors for every frame. These posteriors are used to make a decision on detection of the keyword at each frame. The frame level decisions are then accumulated (in this case, by counting) to make a global decision on having the keyword in the utterance. In this way, the contribution of possible outliers are minimized, as opposed to the conventional Viterbi decoding approach which accumulates likelihoods. Experiments on keywords from the Conversational Telephone Speech (CTS) and Numbers'95 databases are reported. Results show that the new scoring approach leads to better trade off between true and false alarms compared to the Viterbi decoding approach, while also providing the possibility to precalculate keyword specific spotting thresholds related to the length of the keywords.
 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Using more informative posterior probabilities for speech recognition.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
In this paper, we present initial investigations towards boosting posterior probability based speech recognition systems by estimating more informative posteriors taking into account acoustic context (e.g., the whole utterance), as well as possible prior information (such as phonetic and lexical knowledge). These posteriors are estimated based on HMM state posterior probability definition (typically used in standard HMMs training). This approach provides a new, principled, theoretical framework for hierarchical estimation/use of more informative posteriors integrating appropriate context and prior knowledge. In the present work, we used the resulting posteriors as local scores for decoding. On the OGI numbers database, this resulted in significant performance improvement, compared to using MLP estimated posteriors for decoding (hybrid HMM/ANN approach) for clean and more specially for noisy speech. The system is also shown to be much less sensitive to tuning factors (such as phone deletion penalty, language model scaling) compared to the standard HMM/ANN and HMM/GMM systems, thus practically it does not need to be tuned to achieve the best possible performance.
 
M. Liwicki, A. Schlapbach, H. Bunke, S. Bengio, J. Mariéthoz, and J. Richiardi.
Writer identification for smart meeting room systems.
In H. Bunke and A. L. Spitz, editors, Document Analysis Systems VII: 7th International Workshop, DAS, Lecture Notes in Computer Science, volume LNCS 3872, pages 186-195. Springer-Verlag, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
In this paper we present a text independent on-line writer identification system based on Gaussian Mixture Models (GMMs). This system has been developed in the context of research on Smart Meeting Rooms. The GMMs in our system are trained using two sets of features extracted from a text line. The first feature set is similar to feature sets used in signature verification systems before. It consists of information gathered for each recorded point of the handwriting, while the second feature set contains features extracted from each stroke. While both feature sets perform very favorably, the stroke-based feature set outperforms the point-based feature set in our experiments. We achieve a writer identification rate of 100% for writer sets with up to 100 writers. Increasing the number of writers to 200, the identification rate decreases to 94.75%.
 
J. Mariéthoz and S. Bengio.
A max kernel for text-independent speaker verification systems.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
In this paper, we present a principled SVM based speaker verification system. A general approach is developed that enables the use of any kernel at the frame level. An extension of his approach using the Max operator is then proposed. The new system is then compared to state-of-the-art GMM and other SVM based systems found in the literature on the Polyvar database. It is found that the new system outperforms, most of the time, the other systems, statistically significantly.
 
J.-F. Paiement, D. Eck, and S. Bengio.
Probabilistic melodic harmonization.
In L. Lamontagne and M. Marchand, editors, Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI, Lecture Notes in Computer Science, volume LNCS 4013, pages 218-229. Springer-Verlag, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
We propose a representation for musical chords that allows us to include domain knowledge in probabilistic models. We then introduce a graphical model for harmonization of melodies that considers every structural components in chord notation. We show empirically that root notes progressions exhibit global dependencies that can be better captured with a tree structure related to the meter than with a simple dynamical HMM that concentrates on local dependencies. However, a local model seems to be sufficient for generating proper harmonizations when root notes progressions are provided. The trained probabilistic models can be sampled to generate very interesting chord progressions given other polyphonic music components such as melody or root note progressions.
 
N. Poh and S. Bengio.
Chimeric users to construct fusion classifiers in biometric authentication tasks: An investigation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 × 1000 chimeric data sets and two fusion operators), our previous study [Poh and Bengio, MLMI'05] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.
 
N. Poh and S. Bengio.
Database, protocol and tools for evaluating score-level fusion algorithms in biometric authentication.
Pattern Recognition, 39(2):223-233, 2006.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinate d effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.
 
N. Poh, S. Bengio, and A. Ross.
Revisiting Doddington's zoo: A systematic method to assess user-dependent variabilities.
In Second Workshop on Multimodal User Authentication, MMUA, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
Chimeric users have recently been proposed in the field of biometric person authentication as a way to overcome the problem of lack of real multimodal biometric databases as well as an important privacy issue - the fact that too many biometric modalities of a same person stored in a single location can present a higher risk of identity theft. While the privacy problem is indeed solved using chimeric users, it is still an open question of how such chimeric database can be efficiently used. For instance, the following two questions arise: i) Is the performance measured on a chimeric database a good predictor of that measured on a real-user database?, and, ii) can a chimeric database be exploited to improve the generalization performance of a fusion operator on a real-user database?. Based on a considerable amount of empirical biometric person authentication experiments (21 real-user data sets and up to 21 × 1000 chimeric data sets and two fusion operators), our previous study [?] answers no to the first question. The current study aims to answer the second question. Having tested on four classifiers and as many as 3380 face and speech bimodal fusion tasks (over 4 different protocols) on the BANCA database and four different fusion operators, this study shows that generating multiple chimeric databases does not degrade nor improve the performance of a fusion operator when tested on a real-user database with respect to using only a real-user database. Considering the possibly expensive cost involved in collecting the real-user multimodal data, our proposed approach is thus useful to construct a trainable fusion classifier while at the same time being able to overcome the problem of small size training data.
 
A. Pozdnoukhov and S. Bengio.
Graph-based invariant manifolds for invariant pattern recognition with kernel methods.
In International Conference on Pattern Recognition, ICPR, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
We present here an approach for applying the technique of modeling data transformation manifolds for invariant learning with kernel methods. The approach is based on building a kernel function on the graph modeling the invariant manifold. It provides a way for taking into account nearly arbitrary transformations of the input samples. The approach is verified experimentally on the task of optical character recognition, providing state-of-the-art performance on harder problem settings.
 
A. Pozdnoukhov and S. Bengio.
Invariances in kernel methods: From samples to objects.
Pattern Recognition Letters, 27(10):1087-1097, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents a general method for incorporating prior knowledge into kernel methods such as support vector machines. It applies when the prior knowledge can be formalized by the description of an object around each sample of the training set, assuming that all points in the given object share the same desired class. A number of implementation techniques of this method, based on hard geometrical objects and soft objects based on distributions are considered. Tangent vectors are extensively used for object construction. Empirical results on one artificial dataset and two real datasets of electro-encephalogram signals and face images demonstrate the usefulness of the proposed method. The method could establish a foundation for an information retrieval and person identification systems.
 
A. Pozdnoukhov and S. Bengio.
Semi-supervised kernel methods for regression estimation.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
The paper presents a semi-supervised kernel method for regression estimation in the presence of unlabeled patterns. The method exploits a recently proposed data-dependent kernel which is constructed in order to represent the inner geometry of the data. This kernel is implemented into Kernel Regression methods (SVR, KRR). Experimental results aim to highlight the properties of the method and its advantages as compared to fully supervised approaches. The influence of the parameters on the model properties was evaluated experimentally. One artificial and two real-world datasets were used to demonstrate the performance of the proposed algorithm.
 
S. Renals and S. Bengio, editors.
Machine Learning for Multimodal Interaction: Second International Workshop, MLMI'2005.
volume 3869 of Lecture Notes in Computer Science. Springer-Verlag, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the Second Workshop on Machine Learning for Multimodal Interaction (MLMI 2005), held in Edinburgh, Scotland, during 11-13 July 2005. The workshop was organized and sponsored jointly by two European integrated projects, three European Networks of Excellence and a Swiss national research network: AMI, CHIL, HUMAINE, PASCAL, SIMILAR, and IM2. In addition to the main workshop, MLMI 2005 hosted the NIST (US National Institute of Standards and Technology) Meeting Recognition Workshop. This workshop (the third such sponsored by NIST) was centered on the Rich Transcription 2005 Spring Meeting Recognition (RT-05) evaluation of speech technologies within the meeting domain. Building on the success of the RT-04 spring evaluation, the RT-05 evaluation continued the speech-to-text and speaker diarization evaluation tasks and added two new evaluation tasks: speech activity detection and source localization. Given the multiple links between the above projects and several related research areas, and the success of the first MLMI 2004 workshop, it was decided to organize once again a joint workshop bringing together researchers working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction. The motivation for creating such a forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multidisciplinary workshop. The areas covered included: Human-human communication modeling, Speech and visual processing, Multimodal processing, fusion and fission, Multimodal dialog modeling, Human-human interaction modeling, Multimodal data structuring and presentation, Multimedia indexing and retrieval, Meeting structure analysis, Meeting summarizing, Multimodal meeting annotation, and Machine learning applied to the above.
 
Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariéthoz.
Measuring the performance of face localization systems.
Image and Vision Computing, 24(8):882-893, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
The purpose of Face localization is to determine the coordinates of a face in a given image. It is a fundamental research area in computer vision because it serves, as a necessary first step in any face processing system, such as automatic face recognition, face tracking or expression analysis. Most of these techniques assume, in general, that the face region has been perfectly localized. Therefore, their performances depend widely on the accuracy of the face localization process. The purpose of this paper is to mainly show that the error made during the localization process may have different impacts on the final application. We first show the influence of localization errors on the face verification task and then empirically demonstrate the problems of current localization performance measures when applied to this task. In order to properly evaluate the performance of a face localization algorithm, we then propose to embed the final application (here face verification) into the performance measuring process. Using two benchmark databases, BANCA and XM2VTS, we proceed by showing empirically that our proposed method to evaluate localization algorithms better matches the final verification performance.
 
C. Sanderson, S. Bengio, and Y. Gao.
On transforming statistical models for non-frontal face verification.
Pattern Recognition, 39(2):288-302, 2006.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
We address the pose mismatch problem which can occur in face verification systems that have only a single (frontal) face image available for training. In the framework of a Bayesian classifier based on mixtures of gaussians, the problem is tackled through extending each frontal face model with artificially synthesized models for non-frontal views. The synthesis methods are based on several implementations of Maximum Likelihood Linear Regression (MLLR), as well as standard multi-variate linear regression (LinReg). All synthesis techniques rely on prior information and learn how face models for the frontal view are related to face models for non-frontal views. The synthesis and extension approach is evaluated by applying it to two face verification systems: a holistic system (based on PCA-derived features) and a local feature system (based on DCT-derived features). Experiments on the FERET database suggest that for the holistic system, the LinReg based technique is more suited than the MLLR based techniques; for the local feature system, the results show that synthesis via a new MLLR implementation obtains better performance than synthesis based on traditional MLLR. The results further suggest that extending frontal models considerably reduces errors. It is also shown that the local feature system is less affected by view changes than the holistic system; this can be attributed to the parts based representation of the face, and, due to the classifier based on mixtures of gaussians, the lack of constraints on spatial relations between the face parts, allowing for deformations and movements of face areas.
 
D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Modeling individual and group actions in meetings with layered HMMs.
IEEE Transactions on Multimedia, 8(3):509-520, 2006.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of recognizing sequences of human interaction patterns in meetings, with the goal of structuring them in semantic terms. The investigated patterns are inherently group-based (defined by the individual activities of meeting participants, and their interplay), and multimodal (as captured by cameras and microphones). By defining a proper set of individual actions, group actions can be modeled as a two-layer process, one that models basic individual activities from low-level audio-visual features, and another one that models the interactions. We propose a two-layer Hidden Markov Model (HMM) framework that implements such concept in a principled manner, and that has advantages over previous works. First, by decomposing the problem hierarchically, learning is performed on low-dimensional observation spaces, which results in simpler models. Second, our framework is easier to interpret, as both individual and group actions have a clear meaning, and thus easier to improve. Third, different HMM models can be used in each layer, to better reflect the nature of each subproblem. Our framework is general and extensible, and we illustrate it with a set of eight group actions, using a public five-hour meeting corpus. Experiments and comparison with a single-layer HMM baseline system show its validity.
 
D. Zhang, D. Gatica-Perez, D. Roy, and S. Bengio.
Modeling interactions from email communication.
In IEEE International Conference on Multimedia & Expo, ICME, 2006.
[ .ps.gz | .pdf | .djvu | abstract]
Email plays an important role as a medium for the spread of information, ideas, and influence among its users. We present a framework to learn topic-based interactions between pairs of email users, i.e., the extent to which the email topic dynamics of one user are likely to be affected by the others. The proposed framework is built on the influence model and the probabilistic latent semantic analysis (PLSA) language model. This paper makes two contributions. First, we model interactions between email users using the semantic content of email body, instead of email header. Second, our framework models not only email topic dynamics of individual email users, but also the interactions within a group of individuals. Experiments on the Enron email corpus show some interesting results that are potentially useful to discover the hierarchy of the Enron organization.

2005

 
S. Bengio and H. Bourlard, editors.
Machine Learning for Multimodal Interaction: First International Workshop, MLMI'2004.
volume 3361 of Lecture Notes in Computer Science. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This book contains a selection of refereed papers presented at the First Workshop on Machine Learning for Multimodal Interaction (MLMI'04), held in Martigny, Switzerland, from June 21-23, 2004. The workshop was organized and sponsored jointly by three European projects, AMI, PASCAL and M4, as well as a Swiss national research network, IM2. It brings together researchers from different communities working around the common theme of advanced machine learning algorithms for processing and structuring multimodal human interaction in meetings. The motivation for creating such forum, which could be perceived as a number of papers from different research disciplines, evolved from an actual need that arose from these projects and the strong motivation of their partners for such a multi-disciplinary workshop. The conference program covered a wide range of areas related to machine learning applied to multimodal interaction - and more specifically to multi-modal meeting processing. These areas included human-human communication modeling, speech and visual processing, multi-modal processing, fusion and fission, multi-modal dialog modeling, human-human interaction modeling, multi-modal data structuring and presentation, multimedia indexing and retrieval, meeting structure analysis, meeting summarizing, multimodal meeting annotation, and machine learning applied to the above.
 
S. Bengio and H. Bourlard.
Multi channel sequence processing.
In J. Winkler, M. Niranjan, and N. Lawrence, editors, Deterministic and Statistical Methods in Machine Learning: First International Workshop, Lecture Notes in Artificial Intelligence, volume LNAI 3635, pages 22-36. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper summarizes some of the current research challenges arising from multi-channel sequence processing. Indeed, multiple real life applications involve simultaneous recording and analysis of multiple information sources, which may be asynchronous, have different frame rates, exhibit different stationarity properties, and carry complementary (or correlated) information. Some of these problems can already be tackled by one of the many statistical approaches towards sequence modeling. However, several challenging research issues are still open, such as taking into account asynchrony and correlation between several feature streams, or handling the underlying growing complexity. In this framework, we discuss here two novel approaches, which recently started to be investigated with success in the context of large multimodal problems. These include the asynchronous HMM, providing a principled approach towards the processing of multiple feature streams, and the layered HMM approach, providing a good formalism for decomposing large and complex (multi-stream) problems into layered architectures. As briefly reported here, combination of these two approaches yielded successful results on several multi-channel tasks, ranging from audio-visual speech recognition to automatic meeting analysis.
 
S. Bengio, J. Mariéthoz, and M. Keller.
The expected performance curve.
In International Conference on Machine Learning, ICML, Workshop on ROC Analysis in Machine Learning, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In several research domains concerned with classification tasks, curves like ROC are often used to assess the quality of a particular model or to compare two or more models with respect to various operating points. Researchers also often publish some statistics coming from the ROC, such as the so-called break-even point or equal error rate. The purpose of this paper is to first argue that these measures can be misleading in a machine learning context and should be used with care. Instead, we propose to use the Expected Performance Curves (EPC) which provide unbiased estimates of performance at various operating points. Furthermore, we show how to use adequately a non-parametric statistical test in order to produce EPCs with confidence intervals or assess the statistical significant difference between two models under various settings.
 
C. Dimitrakakis and S. Bengio.
Boosting word error rates.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 501-504, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We apply boosting techniques to the problem of word error rate minimisation in speech recognition. This is achieved through a new definition of sample error for boosting and a training procedure for hidden Markov models. For this purpose we define a sample error for sentence examples related to the word error rate. Furthermore, for each sentence example we define a probability distribution in time that represents our belief that an error has been made at that particular frame. This is used to weigh the frames of each sentence in the boosting framework. We present preliminary results on the well-known Numbers 95 database that indicate the importance of this temporal probability distribution.
 
C. Dimitrakakis and S. Bengio.
Gradient-based estimates of return distributions.
In PASCAL Workshop on Principled Methods of Trading Exploration and Exploitation, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
We present a general method for maintaining estimates of the distribution of parameters in arbitrary models. This is then applied to the estimation of probability distributions over actions in value-based reinforcement learning. While this approach is similar to other techniques that maintain a confidence measure for action-values, it nevertheless offers an insight into current techniques and hints at potential avenues of further research.
 
C. Dimitrakakis and S. Bengio.
Online adaptive policies for ensemble classifiers.
Neurocomputing, 64:211-221, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
Ensemble algorithms can improve the performance of a given learning algorithm through the combination of multiple base classifiers into an ensemble. In this paper we attempt to train and combine the base classifiers using an adaptive policy. This policy is learnt through a Q-learning inspired technique. Its effectiveness for an essentially supervised task is demonstrated by experimental results on several UCI benchmark databases.
 
D. Gatica-Perez, D. Zhang, and S. Bengio.
Extracting information from multimedia meeting collections.
In 7th ACM SIGMM International Workshop on Multimedia Information Retrieval, MIR, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
Multimedia meeting collections, composed of unedited audio and video streams, handwritten notes, slides, and electronic documents that jointly constitute a raw record of complex human interaction processes in the workplace, have attracted interest due to the increasing feasibility of recording them in large quantities, by the opportunities for information access and retrieval applications derived from the automatic extraction of relevant meeting information, and by the challenges that the extraction of semantic information from real human activities entails. In this paper, we present a succint overview of recent approaches in this field, largely influenced by our own experiences. We first review some of the existing and potential needs for users of multimedia meeting information systems. We then summarize recent work on various research areas addressing some of these requirements. In more detail, we describe our work on automatic analysis of human interaction patterns from audio-visual sensors, discussing open issues in this domain.
 
D. Gatica-Perez, I. McCowan D. Zhang, and S. Bengio.
Detecting group interest-level in meetings.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 489-492, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Finding relevant segments in meeting recordings is important for summarization, browsing, and retrieval purposes. In this paper, we define relevance as the interest-level that meeting participants manifest as a group during the course of their interaction (as perceived by an external observer), and investigate the automatic detection of segments of high-interest from audio-visual cues. This is motivated by the assumption that there is a relationship between segments of interest to participants, and those of interest to the end user, e.g. of a meeting browser. We first address the problem of human annotation of group interest-level. On a 50-meeting corpus, recorded in a room equipped with multiple cameras and microphones, we found that the annotations generated by multiple people exhibit a good degree of consistency, providing a stable ground-truth for automatic methods. For the automatic detection of high-interest segments, we investigate a methodology based on Hidden Markov Models (HMMs) and a number of audio and visual features. Single- and multi-stream approaches were studied. Using precision and recall as performance measures, the results suggest that (i) the automatic detection of group interest-level is promising, and (ii) while audio in general constitutes the predominant modality in meetings, the use of a multi-modal approach is beneficial.
 
N. Gilardi and S. Bengio.
Machine learning for automatic environmental mapping: when and how?
In G. Dubois, editor, Automatic mapping algorithms for routine and emergency monitoring data. Report on the Spatial Interpolation Comparison (SIC2004) exercise, pages 123-138. Office for Official Publications of the European Communities, Luxembourg, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper discusses the opportunity of using Machine Learning techniques in an automatic environmental mapping context, as was the case for the SIC2004 exercise. First, the Machine Learning methodology is quickly described and compared to Geostatistics. From there, some clues about when to apply Machine Learning are proposed, and what outcomes can be expected from this choice. Finally, three well known regression algorithms: K-Nearest Neighbors, Multi Layer Perceptron and Support Vector Regression, are used on SIC2004 data in a Machine Learning context, and compared to Ordinary Kriging. This illustrates some potential drawbacks of SVR and MLP for applications such as SIC2004.
 
Y. Grandvalet, J. Mariéthoz, and S. Bengio.
A probabilistic interpretation of SVMs with an application to unbalanced classification.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem when decisions result in unequal losses. Experiments on an unbalanced classification loss show improvements over state-of-the-art procedures.
 
D. Grangier and S. Bengio.
Exploiting hyperlinks to learn a retrieval model.
In Proceedings of the NIPS 2005 Workshop on Learning to Rank, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
Information Retrieval (IR) aims at solving a ranking problem: given a query q and a corpus C, the documents of C should be ranked such that the documents relevant to q appear above the others. This task is generally performed by ranking the documents d in C according to their similarity with respect to q, sim (q,d). The identification of an effective function a,b -> sim(a,b) could be performed using a large set of queries with their corresponding relevance assessments. However, such data are especially expensive to label, thus, as an alternative, we propose to rely on hyperlink data which convey analogous semantic relationships. We then empirically show that a measure sim inferred from hyperlinked documents can actually outperform the state-of-the-art Okapi approach, when applied over a non-hyperlinked retrieval corpus.
 
D. Grangier and S. Bengio.
Inferring document similarity from hyperlinks.
In Proceedings of the Conference on Information and Knowledge Management, CIKM, 2005.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
Assessing semantic similarity between text documents is a crucial aspect in Information Retrieval systems. In this work, we propose to use hyperlink information to derive a similarity measure that can then be applied to compare any text documents, with or without hyperlinks. As linked documents are generally semantically closer than unlinked documents, we use a training corpus with hyperlinks to infer a function a,b -> sim(a,b) that assigns a higher value to linked documents than to unlinked ones. Two sets of experiments on different corpora show that this function compares favorably with OKAPI matching on document retrieval tasks.
 
M. Keller and S. Bengio.
A neural network for text representation.
In Proceedings of the 15th International Conference on Artificial Neural Networks: Biological Inspirations, ICANN, Lecture Notes in Computer Science, volume LNCS 3697, pages 667-672. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Text categorization and retrieval tasks are often based on a good representation of textual data. Departing from the classical vector space model, several probabilistic models have been proposed recently, such as PLSA. In this paper, we propose the use of a neural network based, non-probabilistic, solution, which captures jointly a rich representation of words and documents. Experiments performed on two information retrieval tasks using the TDT2 database and the TREC-8 and 9 sets of queries yielded a better performance for the proposed neural network model, as compared to PLSA and the classical TFIDF representations.
 
M. Keller, S. Bengio, and S. Y. Wong.
Benchmarking non-parametric statistical tests.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
Although non-parametric tests have already been proposed for that purpose, statistical significance tests for non-standard measures (different from the classification error) are less often used in the literature. This paper is an attempt at empirically verifying how these tests compare with more classical tests, on various conditions. More precisely, using a very large dataset to estimate the whole ``population'', we analyzed the behavior of several statistical test, varying the class unbalance, the compared models, the performance measure, and the sample size. The main result is that providing big enough evaluation sets non-parametric tests are relatively reliable in all conditions.
 
H. Ketabdar, H. Bourlard, and S. Bengio.
Hierarchical multi-stream posterior based speech recognition system.
In Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, Lecture Notes in Computer Science, volume LNCS 3869, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In this paper, we present initial results towards boosting posterior based speech recognition systems by estimating more informative posteriors using multiple streams of features and taking into account acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). These posteriors are estimated based on ``state gamma posterior'' definition (typically used in standard HMMs training) extended to the case of multi-stream HMMs. This approach provides a new, principled, theoretical framework for hierarchical estimation/use of posteriors, multi-stream feature combination, and integrating appropriate context and prior knowledge in posterior estimates. In the present work, we used the resulting gamma posteriors as features for a standard HMM/GMM layer. On the OGI Digits database and on a reduced vocabulary version (1000 words) of the DARPA Conversational Telephone Speech-to-text (CTS) task, this resulted in significant performance improvement, compared to the state-of-the-art Tandem systems.
 
H. Ketabdar, J. Vepa, S. Bengio, and H. Bourlard.
Developing and enhancing posterior based speech recognition systems.
In Proceedings of the 9th European Conference on Speech Communication and Technology, Eurospeech-Interspeech, 2005.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
Local state or phone posterior probabilities are often investigated as local scores (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') to improve speech recogni tion systems. In this paper, we present initial results towards boosting these approaches by improving posterior estimat es, using acoustic context (e.g., as available in the whole utterance), as well as possible prior information (such as topological constraints). In the present work, the enhanced posterior distribution is associated with the ``gamma'' distribution typically used in standard HMMs training, and estimated from local likelihoods (GMM) or local posteriors (ANN). This approach results in a family of new HMM based systems, where only posterior probabilities are used, while also providing a new, principled, approach towards a hierarchical use/integration of these posteriors, from the frame level up to the phone and word levels, and integrating the appropriate context and prior knowledge in each level. In the present work, we used the resulting posteriors as local scores in a Viter bi decoder. On the OGI Numbers'95 database, this resulted in improved recognition performance, compared to a state-of-the-art hybrid HMM/ANN system.
 
J. Mariéthoz and S. Bengio.
A unified framework for score normalization techniques applied to text independent speaker verification.
IEEE Signal Processing Letters, 12(7):532-535, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The purpose of this paper is to unify several of the state-of-the-art score normalization techniques applied to text-independent speaker verification systems. We propose a new framework for this purpose. The two well-known Z- and T-normalization techniques can be easily interpreted in this framework as different ways to estimate score distributions. This is useful as it helps to understand the various assumptions behind these well-known score normalization techniques, and opens the door for yet more complex solutions. Finally, some experiments on the Switchboard database are performed in order to illustrate the validity of the new proposed framework.
 
I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang.
Automatic analysis of multimodal group actions in meetings.
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 27(3):305-317, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This paper investigates the recognition of group actions in meetings. A statistical framework is proposed in which group actions result from the interactions of the individual participants. The group actions are modelled using different HMM-based approaches, where the observations are provided by a set of audio-visual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modelling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.
 
J.-F. Paiement, D. Eck, and S. Bengio.
A probabilistic model for chord progressions.
In International Conference on Music Information Retrieval, ISMIR, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Estimated probabilities of chord substitutions are derived from this representation and are used to introduce smoothing in graphical models observing chord progressions. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm is used for inference. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
 
J.-F. Paiement, D. Eck, S. Bengio, and D. Barber.
A graphical model for chord progressions embedded in a psychoacoustic space.
In International Conference on Machine Learning, ICML, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
Chord progressions are the building blocks from which tonal music is constructed. Inferring chord progressions is thus an essential step towards modeling long term dependencies in music. In this paper, a distributed representation for chords is designed such that Euclidean distances roughly correspond to psychoacoustic dissimilarities. Parameters in the graphical models are learnt with the EM algorithm and the classical Junction Tree algorithm. Various model architectures are compared in terms of conditional out-of-sample likelihood. Both perceptual and statistical evidence show that binary trees related to meter are well suited to capture chord dependencies.
 
N. Poh and S. Bengio.
Can chimeric persons be used in multimodal biometric authentication experiments?
In S. Renals and S. Bengio, editors, Machine Learning for Multimodal Interactions: Second International Workshop, MLMI, volume LNCS 3869. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources, typically from several data streams is a very promising approach, both in experiments and to some extents in various real-life applications. A system that uses more than one behavioral and physiological characteristics to verify whether a person is who he/she claims to be is called a multimodal biometric authentication system. Due to lack of large true multimodal biometric datasets, the biometric trait of a user from a database is often combined with another different biometric trait of yet another user, thus creating a so-called a chimeric user. In the literature, this practice is justified based on the fact that the underlying biometric traits to be combined are assumed to be independent of each other given the user. To the best of our knowledge, there is no literature that approves or disapproves such practice. We study this topic from two aspects: 1) by clarifying the mentioned independence assumption and 2) by constructing a pool of chimeric users from a pool of true modality matched users (or simply ``true users'') taken from a bimodal database, such that the performance variability due to chimeric user can be compared with that due to true users. The experimental results suggest that for a large proportion of the experiments, such practice is indeed questionable.
 
N. Poh and S. Bengio.
EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks.
In N. C. Oza, R. Polikar, and J. Kittler, editors, 6th International Workshop on Multiple Classifier Systems, MCS, Lecture Notes in Computer Science, volume LNCS 3541, pages 74-85. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Biometric authentication is a process of verifying an identity claim using a person's behavioural and physiological characteristics. Due to the vulnerability of the system to environmental noise and variation caused by the user, fusion of several biometric-enabled systems is identified as a promising solution. In the literature, various fixed rules (e.g. min, max, median, mean) and trainable classifiers (e.g. linear combination of scores or weighted sum) are used to combine the scores of several base-systems. How exactly do correlation and imbalance nature of base-system performance affect the fixed rules and trainable classifiers? We study these joint aspects using the commonly used error measurement in biometric authentication, namely Equal Error Rate (EER). Similar to several previous studies in the literature, the central assumption used here is that the class-dependent scores of a biometric system are approximately normally distributed. However, different from them, the novelty of this study is to make a direct link between the EER measure and the fusion schemes mentioned. Both synthetic and real experiments (with as many as 256 fusion experiments carried out on the XM2VTS benchmark score-level fusion data sets) verify our proposed theoretical modeling of EER of the two families of combination scheme. In particular, it is found that weighted sum can provide the best generalisation performance when its weights are estimated correctly. It also has the additional advantage that score normalisation prior to fusion is not needed, contrary to the rest of fixed fusion rules.
 
N. Poh and S. Bengio.
F-ratio client-dependent normalisation for biometric authentication tasks.
In IEEE International Conference on Acoustic, Speech, and Signal Processing, ICASSP, pages 721-724, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new client-dependent normalisation to improve biometric authentication systems. There exists many client-de-pendent score normalisation techniques applied to speaker authentication, such as Z-Norm, D-Norm and T-Norm. Such normalisation is intended to adjust the variation across different client models. We propose ``F-ratio'' normalisation, or F-Norm, applied to face and speaker authentication systems. This normalisation requires only that as few as two client-dependent accesses are available (the more the better). Different from previous normalisation techniques, F-Norm considers the client and impostor distributions simultaneously. We show that F-ratio is a natural choice because it is directly associated to Equal Error Rate. It has the effect of centering the client and impostor distributions such that a global threshold can be easily found. Another difference is that F-Norm actually ``interpolates'' between client-independent and client-dependent information by introducing a mixture parameter. This parameter can be optimised to maximise the class dispersion (the degree of separability between client and impostor distributions) while the aforementioned normalisation techniques cannot. The results of 13 unimodal experiments carried out on the XM2VTS multimodal database show that such normalisation is advantageous over Z-Norm, client-dependent threshold normalisation or no normalisation.
 
N. Poh and S. Bengio.
How do correlation and variance of base classifiers affect fusion in biometric authentication tasks?
IEEE Transactions on Signal Processing, 53(11):4384-4396, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Combining multiple information sources such as subbands, streams (with different features) and multi modal data has been shown to be a very promising trend, both in experiments and to some extents in real-life biometric authentication applications. Despite considerable efforts in fusions, there is a lack of understanding on the roles and effects of correlation and variance (of both the client and impostor scores of base-classifiers/experts). Often, scores are assumed to be independent. In this paper, we explicitly consider this factor using a theoretical model, called Variance Reduction-Equal Error Rate (VR-EER) analysis. Assuming that client and impostor scores are approximately Gaussian distributed, we showed that Equal Error Rate (EER) can be modeled as a function of F-ratio, which itself is a function of 1) correlation, 2) variance of base-experts and 3) difference of client and impostor means. To achieve lower EER, smaller correlation and average variance of base-experts, and larger mean difference are desirable. Furthermore, analysing any of these factors independently, e.g. focusing on correlation alone, could be miss-leading. Experimental results on the BANCA multimodal database confirm our findings using VR-EER analysis. We analysed four commonly encountered scenarios in biometric authentication which include fusing correlated/uncorrelated base-experts of similar/different performances. The analysis explains and shows that fusing systems of different performances is not always beneficial. One of the most important findings is that positive correlation ``hurts'' fusion while negative correlation (greater ``diversity'', which measures the spread of prediction score with respect to the fused score), improves fusion. However, by linking the concept of ambiguity decomposition to classification problem, it is found that diversity is not sufficient to be an evaluation criterion (to compare several fusion systems), unless measures are taken to normalise the (class-dependent) variance. Moreover, by linking the concept of bias-variance-covariance decomposition to classification using EER, it is found that if the inherent mismatch (between training and test sessions) can be learned from the data, such mismatch can be incorporated into the fusion system as a part of training parameters.
 
N. Poh and S. Bengio.
Improving fusion with margin-derived confidence in biometric authentication tasks.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1059-1068. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
This study investigates a new confidence criterion to improve fusion via a linear combination of scores of several biometric authentication systems. This confidence is based on the margin of making a decision, which answers the question, ``after observing the score of a given system, what is the confidence (or risk) associated to that given access?''. In the context of multimodal and intramodal fusion, such information proves valuable because the margin information can determine which of the systems should be given higher weights. Finally, we propose a linear discriminative framework to fuse the margin information with an existing global fusion function. The results of 32 fusion experiments carried out on the XM2VTS multimodal database show that fusion using margin (product of margin and expert opinion) is superior over fusion without the margin information (i.e., the original expert opinion). Furthermore, combining both sources of information increases fusion performance further.
 
N. Poh and S. Bengio.
A novel approach to combining client-dependent and confidence information in multimodal biometrics.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 1120-1129. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
The issues of fusion with client-dependent and confidence information have been well studied separately in biometric authentication. In this study, we propose to take advantage of both sources of information in a discriminative framework. Initially, each source of information is processed on a per expert basis (plus on a per client basis for the first information and on a per example basis for the second information). Then, both sources of information are combined using a second-level classifier, across different experts. Although the formulation of such two-step solution is not new, the novelty lies in the way the sources of prior knowledge are incorporated prior to fusion using the second-level classifier. Because these two sources of information are of very different nature, one often needs to devise special algorithms to combine both information sources. Our framework that we call ``Prior Knowledge Incorporation'' has the advantage of using the standard machine learning algorithms. Based on 10 times 32=320 intramodal and multimodal fusion experiments carried out on the publicly available XM2VTS score-level fusion benchmark database, it is found that the generalisation performance of combining both information sources improves over using either or none of them, thus achieving a new state-of-the-art performance on this database.
 
N. Poh and S. Bengio.
A score-level fusion benchmark database for biometric authentication.
In T. Kanade, A. Jain, and N. K. Ratha, editors, 5th International Conference on Audio- and Video-Based Biometric Person Authentication, AVBPA, Lecture Notes in Computer Science, volume LNCS 3546, pages 474-483. Springer-Verlag, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
Fusing the scores of several biometric systems is a very promising approach to improve the overall system's accuracy. Despite many works in the literature, it is surprising that there is no coordinated effort in making a benchmark database available. It should be noted that fusion in this context consists not only of multimodal fusion, but also intramodal fusion, i.e., fusing systems using the same biometric modality but different features, or same features but using different classifiers. Building baseline systems from scratch often prevents researchers from putting more efforts in understanding the fusion problem. This paper describes a database of scores taken from experiments carried out on the XM2VTS face and speaker verification database. It then proposes several fusion protocols and provides some state-of-the-art tools to evaluate the fusion performance.
 
V. Popovici, S. Bengio, and J.-P. Thiran.
Kernel matching pursuit for large datasets.
Pattern Recognition, 38(12):2385-2390, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
Kernel Matching Pursuit is a greedy algorithm for building an approximation of a discriminant function as a linear combination of some basis functions selected from a kernel-induced dictionary. Here we propose a modification of the Kernel Matching Pursuit algorithm that aim s at making the method practical for large datasets. Starting from an approximating algorithm, the Weak Greedy Algorithm, we introduce a stochastic method for reducing the search space at each iteration. Then we study the implications of using an approximate algorithm and we show how one can control the trade-off between the accuracy and the need for resources. Finally we present some experiments performed on a large dataset that support our approach and illustrate its applicability.
 
A. Pozdnoukhov and S. Bengio.
Improving kernel classifiers for object categorization problems.
In International Conference on Machine Learning, ICML, Workshop on Learning with Partially Classified Training Data, 2005.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper presents an approach for improving the performance of kernel classifiers applied to object categorization problems. The approach is based on the use of distributions centered around each training points, which are exploited for inter-class invariant image representation with local invariant features. Furthermore, we propose an extensive use of unlabeled images for improving the SVM-based classifier.
 
C. Sanderson, F. Cardinaux, and S. Bengio.
On accuracy/robustness/complexity trade-offs in face verification.
In IEEE International Conference on Information Technology and Applications, ICITA, pages 638-643, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
In much of the literature devoted to face recognition, experiments are performed with controlled images (e.g. manual face localization, controlled lighting, background and pose). However, a practical recognition system has to be robust to more challenging conditions. In this paper we first evaluate, on the relatively difficult BANCA database, the discrimination accuracy, robustness and complexity of Gaussian Mixture Model (GMM), 1D- and pseudo-2D Hidden Markov Model (HMM) based systems, using both manual and automatic face localization. We also propose to extend the GMM approach through the use of local features with embedded positional information, increasing accuracy without sacrificing its low complexity. Experiments show that good accuracy on manually located faces is not necessarily indicative of good accuracy on automatically located faces (which are imperfectly located). The deciding factor is shown to be the degree of constraints placed on spatial relations between face parts. Methods which utilize rigid constraints have poor robustness compared to methods which have relaxed constraints. Furthermore, we show that while the pseudo-2D HMM approach has the best overall accuracy, classification time on current hardware makes it impractical. The best trade-off in terms of complexity, robustness and discrimination accuracy is achieved by the extended GMM approach.
 
D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised adapted HMMs for unusual event detection.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2005.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
We address the problem of temporal unusual event detection. Unusual events are characterized by a number of features (rarity, unexpectedness, and relevance) that limit the application of traditional supervised model-based approaches. We propose a semi-supervised adapted Hidden Markov Model (HMM) framework, in which usual event models are first learned from a large amount of (commonly available) training data, while unusual event models are learned by Bayesian adaptation in an unsupervised manner. The proposed framework has an iterative structure, which adapts a new unusual event model at each iteration. We show that such a framework can address problems due to the scarcity of training data and the difficulty in pre-defining unusual events. Experiments on audio, visual, and audio-visual data streams illustrate its effectiveness, compared with both supervised and unsupervised baseline methods.
 
D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan.
Semi-supervised meeting event recognition with adapted HMMs.
In IEEE International Conference on Multimedia Expo, ICME, pages 611-618, 2005.
[ .ps.gz | .pdf | .djvu | idiap-RR | abstract]
This paper investigates the use of unlabeled data to help labeled data for audio-visual event recognition in meetings. To deal with situations in which it is difficult to collect enough labeled data to capture event characteristics, but collecting a large amount of unlabeled data is easy, we present a semi-supervised framework using HMM adaptation techniques. Instead of directly training one model for each event, we first train a well-estimated general event model for all events using both labeled and unlabeled data, and then adapt the general model to each specific event model using its own labeled data. We illustrate the proposed approach with a set of eight audio-visual events defined in meetings. Experiments and comparison with the fully-supervised baseline method show the validity of the proposed semi-supervised approach.
 
D. Zhang, D. Gatica-Perez, S. Bengio, and D. Roy.
Learning influence among interacting markov chains.
In Advances in Neural Information Processing Systems, NIPS 18. MIT Press, 2005.
[ .ps.gz | .pdf | .djvu | abstract]
We present a model that learns the influence of interacting Markov chains within a team. The proposed model is a dynamic Bayesian network (DBN) with a two-level structure: individual-level and group-level. Individual level models actions of each player, and the group-level models actions of the team as a whole. Experiments on synthetic multi-player games and a multi-party meeting corpus show the effectiveness of the proposed model.

2004

 
S. Bengio.
Multimodal speech processing using asynchronous hidden markov models.
Information Fusion, 5(2):81-89, 2004.
[ .ps.gz | .pdf | .djvu | weblink | abstract]
This paper advocates that for some multimodal tasks involving more than one stream of data representing the same sequence of events, it might sometimes be a good idea to be able to desynchronize the streams in order to maximize their joint likelihood. We thus present a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same sequence of events. An Expectation-Maximization algorithm to train the model is presented, as well as a Viterbi decoding algorithm, which can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model was tested on two audio-visual speech processing tasks, namely speech recognition and text-dependent speaker verification, both using the M2VTS database. Robust performances under various noise conditions were obtained in both cases.
 
S. Bengio and J. Mariéthoz.
The expected performance curve: a new assessment measure for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
ROC and DET curves are often used in the field of person authentication to assess the quality of a model or even to compare several models. We argue in this paper that this measure can be misleading as it compares performance measures that cannot be reached simultaneously by all systems. We propose instead new curves, called Expected Performance Curves (EPC). These curves enable the comparison between several systems according to a criterion, decided by the application, which is used to set thresholds according to a separate validation set. A free sofware is available to compute these curves. A real case study is used throughout the paper to illustrate it. Finally, note that while this study was done on an authentication problem, it also applies to most 2-class classification tasks.
 
S. Bengio and J. Mariéthoz.
A statistical significance test for person authentication.
In Proceedings of Odyssey 2004: The Speaker and Language Recognition Workshop, 2004.
[ .ps.gz | .pdf | .djvu | weblink | idiap-RR | abstract]
Assessing whether two models are statistically significantly different from each other is a very important step in research, although it has unfortunately not received enough attention in the field of person authentication. Several performance measures are often used to compare models, such as half total error rates (HTERs) and equal error rates (EERs), but most being aggregates of two measures (such as the false acceptance rate and the false rejection rate), simple statistical tests cannot be used as is. We show in this paper how to adapt one of these tests in order to compute a confidence interval around one HTER measure or to assess the statistical significantness of the difference between two HTER measures. We also compare our technique with other solutions that are sometimes used in the literature and show why they yield often too optimistic results (resulting in false statements about statistical significantness).
&n