It looks like you're new here. If you want to get involved, click one of these buttons!
Hi, in another thread John mentioned the connection between two different formulations of Occam's Razor in modelling. It's one of those things where I can sort of see the that there ought to be a connection, but I'm not completely clear upon things.
One way of describing the MDL principle is as a trade-off between machinery and parameters. For example, suppose that you've got a set of data and you want to find a model which should have the best generalization ability. Then you can do this by choosing some class of models, say Gaussian distributions centred uat various points. Then for a collection of $N$ Gaussians you can specify a point by specifying which Gaussian you're using and then giving some finite length binary description of the parameter value. Since Gaussians are mostly concentrated near the middle, you need fewer bits to specify a point near the centre of a Gaussian while for points that are much further from the centre you need more bits. So there's some trade-off between whether you use a lot of bits to describe a point relative to a Gaussain that's far away, or actually put a Gaussian close to your point (requiring bits to specify its centre) and then use far fewer bits to describe it that way. Generally it requires slightly more bits to specify part of the machinery rather than a parameter (since the machinery is generally complex) but not hugely so. The sort-of idea behind the MDL is that the "machinery" can get used for all points while the parameters only describe one individual point. So bits allocated to "machinery" are more effective if they'll be used for multiple points, but if you've got just one point that's not "outlying" then it's more effective just to use quite a few bits in its parameter.
This embodies the often stated form of Occam's razor in the sense that the continuous analogue of "not multiplying entities needlessly" is "not adding entities which don't pay for themselves in terms of the reduciton in parameter specification complexity". It's a bit less clear to me how fitting the maximum entropy distribution connects with this: presumably a distribution with any more "features" than necessary is adding new entities, and the Gibbs distribution that maximises the entropy is the most featureless distribution fitting the data that you can get?