Nathan Intrator
and
Inna Stainvas
Tel-Aviv University
www.math.tau.ac.il/~nin
Model selection is essential in experts with many parameters and requires an efficient measure of model complexity. While regularization is very useful, it has two fundamental drawbacks: (i) It requires the use of an independent test set for evaluating the regularization parameter, often this is replaced by a computationally expensive leave-out estimation and (ii) for every regularization value, a new model has to be estimated. The second problem implies that the models are very different, thus model interpretation is not unique. This last feature is highly desirable and is achieved when a nested sequence of models can be created, for example in regression or in a tree structured model like CART (Breiman et al., 1984).
A sequence of nested models is computationaly more efficient (there is no need to reestimate the model when the model complexity penalty value is changed) and leads to a natural hierarchy of models. In regression, a sequence of models is achieved by creating a nested sequence of data representations or simply eliminating different data coordinates. This data complexity regularization is not always directly correlated with the desired model complexity regularization. While measuring model complexity by some statistic on the distribution of the model parameters', (e.g. variance or description length, may be very useful; Hinton and van Camp, 1993), it still suffers from the two drawbacks mentioned above.
We introduce a general approach for regularizing models by constructing a nested sequence of models where model complexity is measured by the dimensionality of the model parameters; (The sequence may be only approximately nested when further optimization refinements are used). The likelihood ratio of the nested sequence has a Chi2 distribution with degrees of freedom that equal the number of additional dimensions (Silvey, 1975). This leads to a simple choice of optimal models using the distribution properties. Wavelet theory and in particular, wavelet packets and best basis methods (Coifman and Wickerhauser, 1992) are used in the construction of the nested sequence of models and simplify the parameter estimation in the case of a feed-forward network model.
Comparison of this method with feed-forward architectures (linear and non-linear) on several data sets, suggests that this method can achieve superior results on both small and large data sets.
References