7:26 PM 12/12/2013

When will overfitting occur?
Overfitting occurs when a statistical model describes random

error or noise instead of the underlying relationship. Overfitting

generally occurs when a model is excessively complex, such as having

too many parameters relative to the number of observations.

When the model being trained is too complex and it "tries" to

memorize the training data instead of learning the underline relations

among the data. The reason is that the criterion used for training the

model is different from the criterion to evaluate the model. That is to

say, the model is trained on the seen data, and the aim is to predict the

unseen data.

Any informaiton contained in the training data can be

decomposed into two parts -- (1) information relative to the future,

which reflects the underlying relation of the data. (2) information from

the noise, which is irrelevant.

---------------
The rest of the content is from (Domingos, 2000) Bayesian average of

classifiers and the overfitting problem.
------------------

Uniform class noise model is corresponding to our succinct Dawid-

Skene model.

Buntine's (Buntine 1990) perspective on classfication model is that,

implicitly or explicitly, every classification model divides the instance

space into different regions, and labels all instances in each region with

the same label. For example, classification tree (Quinland, 1993).

Bayesian average model are not overfitting robust, on the contrary, it is

very sensitive to it. The previsous opinions think it's the optimal one.

Bagging (Breiman, 1996) is an effective tool to reduce the errors of

classification models. What it does is that bootstrap the data multiple

times and train model on each bootstrapped dataset. Then average the

model prediction. Bagging can be viewed as a form of importance sampling

(Bernardo & Smith, 1994).

The reason why BMA is not robust to overfitting is that the weight of
each model's prediction, i.e., the posteriors is very sensitive to
sample variation. If the sample change a bit, then the posterior could
be exponentially changed, then it is not robust at all.


-------------------------------------------------
Stacking. Dzeroski ML 2004.

The stacking inspires me to think that maintianing a distribution over
a set of workers in crowdsourcing is a good idea or not? This should
be judged by comparing it with the worker selection problem.

Is weighted majority voting always better than a randomly sampled
voting? I.e., sample the answer from each worker with certain
probability, and then predict based on the sampled workers. We can
conduct an experiment on that to see how it works.

Crazy idea: could we combine SVM with Lasso. We know that Lasso is
feature selection -- regulizaed the feature space, what about regulize
the feature probability? Or we call it density reguaizer? Maintain a
distribution over the feature dimensions, this should alleviate the
feautre dependence. Group lasso is taking multiple features into
account, but now what about we do something like Michael Mahoney's
framework --- sampling features, and randomly dropout the coefficient
to be zero, and then predict. Maybe by combining with the probability
density, it will perform even better.

On linear regression, we can have three regime on y=X\beta,:
1) randomly sample a subset of rows of X and train a corresponding
\beta, then stacking or model averaging them ( voting, majority
voting, random voting etc.). This is like bagging --- bootstrap data
and then get multiple models, then averaging them to be the final
prediciton.
2) Randomly sample columns of X and then learn a partial \beta, and
then think about how to aggrate these betas.
3) learn a full beta, but over predictions, randomly dropout some
coefficients of beta ( could be most of them to produce the sparsity),
and and then predict and then averaging.

the idea of using the soft-voting can be directly related to
crowdsourcing with confidence score.

could we design a mechanism of them when we do computer vision task or
clustering when there is a lot of ambiguity?

Could we move these to Lasso based learning?
Treat each classifiers as the features, and we only want to select few features.

没有评论:

发表评论