Samy Bengio
One of the main objectives of machine learning research is to develop
algorithms that learn predictive relationships from data. This is a
difficult task since inferring a function from data is in fact an
``illposed'' problem: many functions can often ``fit'' a given finite
data set, but only some of them will behave adequately on new
data drawn from the same distribution. Moreover it could happen that
the function that fits best the given training data set will not behave
as expected on new data. This is deeply related to the theory of
statistical learning, which has been developed in the last years.
Many approaches have been proposed recently to select the best function
and to evaluate its expected performance on new data.
One approach to such problem is to select not only one function
but many different functions and combine their outputs in order to
produce a new solution. Nowadays, many machine learning algorithms are based
on such technique, and are called ensemble methods.
For instance, Bagging creates many functions,
each of which being trained using a bootstrap of the data set
(a new data set of the same size created by sampling independently from
the original data set). The output of Bagging is then
a simple average of the outputs of each function.
This apparently simple method has been shown to significantly improve
the performance on many tasks.
More interestingly, AdaBoost
also creates many functions, but each of them has been
trained by putting more attention on the examples of the data set that produced
the worst solutions using the previously trained function. A
particular combination
method is then applied which gives surprisingly good results over new data.
Most of these ensemble methods have been developed for classification
(select a class among a fixed set of classes) or regression problems
(predict a realvalued vector given another realvalued vector). On the
other hand, some machine learning problems have their solution expressed
as a sequence of output values. One such problem
is the automatic speech recognition problem where the output is
a sequence of words.
This problem, as well as most of the sequence processing problems,
is usually handled using hidden Markov models (HMMs),
which are statistical models specifically
designed for sequence processing problems and have given state of the
art performance on many sequence problems.
Unfortunately, there is currently
not many ensemble algorithms specifically designed for HMMs or
sequence problems.
The purpose of the present project is thus to study, propose, develop
and compare new ensemble methods tailored for sequence processing problems.
As the current ensemble methods have usually bring good generalization
performance on classification and regression problems, it is expected that
it would also bring good performance for sequence processing problems.
One of the main problems that will be addressed in the framework of this
project will
be the search for methods that efficiently combine sequences
having a different size and a different confidence degree.
Another research area will be to
determine how the different models should be trained in order to give
different yet complementary results. Finally, a theoretical
analysis of these new ensemble techniques will also be done.
Keywords: learning algorithms, ensemble methods, AdaBoost, Bagging,
sequence processing, hidden Markov models, speech processing, DNA sequence
analysis.
Samy Bengio
20011114
