IDIAP : > Projects

[ > Projects > ADASEQ]

ADASEQ - AdaBoost and Other Ensemble Methods for Sequence Processing Problems

Samy Bengio

One of the main objectives of machine learning research is to develop algorithms that learn predictive relationships from data. This is a difficult task since inferring a function from data is in fact an ``ill-posed'' problem: many functions can often ``fit'' a given finite data set, but only some of them will behave adequately on new data drawn from the same distribution. Moreover it could happen that the function that fits best the given training data set will not behave as expected on new data. This is deeply related to the theory of statistical learning, which has been developed in the last years. Many approaches have been proposed recently to select the best function and to evaluate its expected performance on new data.

One approach to such problem is to select not only one function but many different functions and combine their outputs in order to produce a new solution. Nowadays, many machine learning algorithms are based on such technique, and are called ensemble methods. For instance, Bagging creates many functions, each of which being trained using a bootstrap of the data set (a new data set of the same size created by sampling independently from the original data set). The output of Bagging is then a simple average of the outputs of each function. This apparently simple method has been shown to significantly improve the performance on many tasks. More interestingly, AdaBoost also creates many functions, but each of them has been trained by putting more attention on the examples of the data set that produced the worst solutions using the previously trained function. A particular combination method is then applied which gives surprisingly good results over new data.

Most of these ensemble methods have been developed for classification (select a class among a fixed set of classes) or regression problems (predict a real-valued vector given another real-valued vector). On the other hand, some machine learning problems have their solution expressed as a sequence of output values. One such problem is the automatic speech recognition problem where the output is a sequence of words. This problem, as well as most of the sequence processing problems, is usually handled using hidden Markov models (HMMs), which are statistical models specifically designed for sequence processing problems and have given state of the art performance on many sequence problems. Unfortunately, there is currently not many ensemble algorithms specifically designed for HMMs or sequence problems.

The purpose of the present project is thus to study, propose, develop and compare new ensemble methods tailored for sequence processing problems. As the current ensemble methods have usually bring good generalization performance on classification and regression problems, it is expected that it would also bring good performance for sequence processing problems. One of the main problems that will be addressed in the framework of this project will be the search for methods that efficiently combine sequences having a different size and a different confidence degree. Another research area will be to determine how the different models should be trained in order to give different yet complementary results. Finally, a theoretical analysis of these new ensemble techniques will also be done.

Keywords: learning algorithms, ensemble methods, AdaBoost, Bagging, sequence processing, hidden Markov models, speech processing, DNA sequence analysis.

Samy Bengio 2001-11-14