Brown

Estimating Model Complexity
via Minimal Description Length

Nathan Intrator

Tel-Aviv University and Brown University

www.math.tau.ac.il/~nin



Model complexity definition

Given a functional relationship between d and y defined by the joint probability distribution P(d, y) the optimal solution c is given by c(d) = Eyy(d).

We seek ways to estimate the goodness of a family of models Fw(d).


How about this error surface:

  • While the surface is very smooth, it is impossible to get to the true minima.

  • Suggests that models that penalize on smoothness may be misleading.

  • Breiman (1992) has shown that even in simple problems and simple nonlinear models, the degree of generalization is a strong dependent on the stability of the parameters.

    
    
    
    
    Standard approach:

    • Optimize concurrently the likelihood or mean squared error together with a complexity penalty.

    • Some penalties: norm of the weight vector, smoothness, number of terminating leaves (in CART), variance weights, cross validation... etc.

    • Spend most computational time on optimizing the parameter solution via sophisticated Gradient descent methods or even global-minimum seeking methods.

    Alternative approach:

    • Support vectors: Create a representation that leads to a very simple optimization algorithm.

    
    
    
    

    Affecting the error surface:
    Hybrid Unsupervised/Supervised Architecture

    Combined classification/reconstruction architecture

    
    

    Remarks on optimal code

    • In 1961 Horace Barlow suggested the bottleneck approach for finding optimal neuronal code and redundancy reduction.

    • This was supported later by Atick and Redlich, 1992 for visual pathway preprocessing.

    • Barlow later suggested that neurons should carry out an additional task.

      • While neurons should transmit transmit the probability of occurrence of the features they learn to detect;

      • They should also relay the prior probability of the feature they detect.

    • This follows from his seminal work on minimal entropy codes and unsupervised learning Barlow, 1989.
    Mathematical details
    
    
    Addressing the computational complexity

    • Ensemble of experts can give a rough estimation of the posterior distribution.

    • Start with a narrow Gaussian distribution in weight space

    • Start with a mixture of Gaussian's weight distribution (Hinton and Nowlan, 1992) or with a distribution that is set by other means such as PCA etc.
    
    
    
    
    
    
    
    
    
    Model selection properties of the MDL approach

    • Prefers models with a wide basin of attractions that lead to admissible local minima. Reduced sensitivity to the model parameters (flat minima).

    • It prefers non-identifiable models, namely models with many local minima.

    • While smoothness is a key issue, it rejects models with a Golf-course type of smoothness where the local minimum is difficult to find.

    • Results in an ensemble of experts that is more robust to weight changes, and to partial damage to the ensemble.

    • The large ensemble can give some approximation to the posterior probability of weights.
    
    
    
    
    
    
    Further properties of the MDL approach

    • A consequence of the smoothness and wide basin of attraction is that a simple learning rule should suffice.

    • Most of the computation time is spent on finding a good architecture and not on finding a good solution.

    • Finding a good architecture, can be done in parallel making a good use of the multitude of neurons.
    
    
    
    
    
    
    
    
    
    Estimating the decoding complexity of a given representation (code)

    • Discussed so far the complexity of model (learning)

    • Following the above formulation, estimating code complexity is a natural extension;

    • Given the information that is required to be extracted from the code, one can build a scheme that attempt to extract that information and estimate the complexity of that scheme as described above.
    
    
    
    
    
    
    
    
    
    
    Related work