Options

Anyone already looked at computing covariance matrices of large datasets accurately?

Quick random question: I don't expect anyone to have come accross a solution to this problem, but just in case they have...

In doing numerical floating point work is full of pitfalls where you can lose almost all accuracy, one of which is when summing/sum-squaring/etc with a large set of small values (so that each individual value is likely to be small relative to the aggregate statistics). I'm thinking about how to compute a covariance matrix for a large set of items subject to the following requirements:

  1. Anything other than a "look a each data item once" approach is unlikely to be viable (ie, not a two-pass algorithm).

  2. There will be covariances for several different data streams being calculated interleaved (ie, it's not "here's a stream for one dataset, get it's covariance then move onto the next data stream").

However, I don't care about the covariance until everything has finished: it doesn't have to be in available form as the data is being processed.

This wikipedia page contains an algorithm, but it's maintaining a finished covariance matrix at all times. I'm just wondering if there's anything that gives higher accuracy through needing final post-processing at the end.

Comments

  • 1.

    I've done things that sound similar to this, but it is not exactly clear to me what you are trying to do.

    If you elaborate a bit, e.g. maybe provide an example, I might have some ideas (maybe).

    Comment Source:I've done things that sound similar to this, but it is not exactly clear to me what you are trying to do. If you elaborate a bit, e.g. maybe provide an example, I might have some ideas (maybe).
  • 2.

    Here's the bigger context:

    I'm considering figuring out an approximation to a compex process in terms of a discretised probabilistic transition over a realatively quite large space (say $256^4=2^32$ states. (In eqns I won't bother using boldface for vectors, but assume all states are vectors). The "exact" transformation is given by

    $$p_t(x_i)=\sum_{x_j\in \Omega} p_{t-1}(x_j) p(x_i | x_j)$$ (where for current purposes let's say we only care about one timestep). By simple direct simulation I can calculate a good approximation to the scalar value $p(x_j | x_i)$ for a particular choice of $i$ and $j$. But there's no way that the full $2^32 \times 2^32$ statespace can be fully evaluated. Maybe simulating $2^32$ elements is possible in reasonable time, which is a large number but still a tiny fraction of the whole state space.

    So one common approximation would be to asssume that the transition function is reasonably "well-behaved" and to approximate it in terms of a sum of Normals (ie Gaussians)

    $$p(x_i|x_j) distributed \sum_k w_k N(\mu_k,\sigma_k)$$ There's no deep reason behind this, it's just a choice that is capable of representing any distribution in the limit of infinite numbers of Normals (I think, maybe there's some constraints to rule out really pathological cases) and it's easy to see how to do estimate this just from keeping track of sufficient statistics: just decide which normal a given sample comes from (ignore this complication for now) and compute the sample mean and sample covariance for all the calculated ${p(x_i|x_j)}$ that have been calcuated that decided belong to that Normal. In principle this is a well-understood, simple thing, but in practice if this requires computing the covariance of a large dataset (let's say somewhere in the range of $2^16$ to $2^20$ sample values), and that's where the accumulation of floating point artifacts can build up (as terms are added to the sums) large relative accuracy errors as the number of samples increases. (To be clear: I'm not worried about being super-accurate, I'm worried about ending up with virtually no accuracy at all if I do the naive obvious thing: you can google floating point summation algorithms to see some of the issues cases in even simple cases.)

    Given the amount of effort that's going into the simulation stage, I'm trying to follow "best practice for floating point algorithms" to avoid blowing all that work on poor estimates of the normal distributions.

    Any insights or ideas gratefully accepted.

    Comment Source:Here's the bigger context: I'm considering figuring out an approximation to a compex process in terms of a discretised probabilistic transition over a realatively quite large space (say $256^4=2^32$ states. (In eqns I won't bother using boldface for vectors, but assume all states are vectors). The "exact" transformation is given by $$p_t(x_i)=\sum_{x_j\in \Omega} p_{t-1}(x_j) p(x_i | x_j)$$ (where for current purposes let's say we only care about one timestep). By simple direct simulation I can calculate a good approximation to the scalar value $p(x_j | x_i)$ for a particular choice of $i$ and $j$. But there's no way that the full $2^32 \times 2^32$ statespace can be fully evaluated. Maybe simulating $2^32$ elements is possible in reasonable time, which is a large number but still a tiny fraction of the whole state space. So one common approximation would be to asssume that the transition function is reasonably "well-behaved" and to approximate it in terms of a sum of Normals (ie Gaussians) $$p(x_i|x_j) distributed \sum_k w_k N(\mu_k,\sigma_k)$$ There's no deep reason behind this, it's just a choice that is capable of representing any distribution in the limit of infinite numbers of Normals (I think, maybe there's some constraints to rule out really pathological cases) and it's easy to see how to do estimate this just from keeping track of _sufficient statistics_: just decide which normal a given sample comes from (ignore this complication for now) and compute the sample mean and sample covariance for all the calculated $\{p(x_i|x_j)\}$ that have been calcuated that decided belong to that Normal. _In principle_ this is a well-understood, simple thing, but in practice if this requires computing the covariance of a large dataset (let's say somewhere in the range of $2^16$ to $2^20$ sample values), and that's where the accumulation of floating point artifacts can build up (as terms are added to the sums) large _relative accuracy_ errors as the number of samples increases. (To be clear: I'm not worried about being super-accurate, I'm worried about ending up with virtually no accuracy at all if I do the naive obvious thing: you can google floating point summation algorithms to see some of the issues cases in even simple cases.) Given the amount of effort that's going into the simulation stage, I'm trying to follow "best practice for floating point algorithms" to avoid blowing all that work on poor estimates of the normal distributions. Any insights or ideas gratefully accepted.
  • 3.

    Hi David,

    Sorry for the slow response. Work is crazy.

    So given a single $i$ and $j$, how do you calculate $p(x_i|x_j)$? I still don't quite understand.

    The first thing I'd try is to find ways to reduce the size of the space. I have never come across a situation that would truly require $2^{32}$ distinct states much less estimating a $2^{32}\times 2^{32}$ matrix. There simply does not exist enough data on the planet for applications I've seen to calibrate such a thing.

    What kind of application is this for? Have you considered a hierarchical approach? For example, start by calibrating a $10\times 10$ matrix of driving factors. Then for each driving factor, maybe a $10\times 10$ matrix of driving subfactors. Continue to as many levels as you need.

    One thing I have done, is develop running computations of variances and covariances. It might help to note the covariances can be derived from variances, i.e.

    $cov_t(x_i,x_j) = \frac{1}{2} \left[ var_t(x_i + x_j) - var_t(x_i) - var_t(x_j)\right].$

    The variances can be expressed in terms of expectations

    $var_t(x_i) = k\left[E_t(x_i^2) - E_t(x_i)^2\right].$

    Finally, the expectation at time $t$ can be expressed in terms of values known at time $t-1$

    $E_t(x_i) = w_t x_{i,t} + e_{t-1}(x_i)$

    where $\sum_t w_t = 1$.

    Putting all this together, the covariance at time $t$ can be determined via functions known at time $t-1$ adjusted by new incoming data at time $t$.

    I'm not sure if this is helpful, but I'd be curious to hear your thoughts.

    Comment Source:Hi David, Sorry for the slow response. Work is crazy. So given a single $i$ and $j$, how do you calculate $p(x_i|x_j)$? I still don't quite understand. The first thing I'd try is to find ways to reduce the size of the space. I have never come across a situation that would truly require $2^{32}$ distinct states much less estimating a $2^{32}\times 2^{32}$ matrix. There simply does not exist enough data on the planet for applications I've seen to calibrate such a thing. What kind of application is this for? Have you considered a hierarchical approach? For example, start by calibrating a $10\times 10$ matrix of driving factors. Then for each driving factor, maybe a $10\times 10$ matrix of driving subfactors. Continue to as many levels as you need. One thing I have done, is develop running computations of variances and covariances. It might help to note the covariances can be derived from variances, i.e. $cov_t(x_i,x_j) = \frac{1}{2} \left[ var_t(x_i + x_j) - var_t(x_i) - var_t(x_j)\right].$ The variances can be expressed in terms of expectations $var_t(x_i) = k\left[E_t(x_i^2) - E_t(x_i)^2\right].$ Finally, the expectation at time $t$ can be expressed in terms of values known at time $t-1$ $E_t(x_i) = w_t x_{i,t} + e_{t-1}(x_i)$ where $\sum_t w_t = 1$. Putting all this together, the covariance at time $t$ can be determined via functions known at time $t-1$ adjusted by new incoming data at time $t$. I'm not sure if this is helpful, but I'd be curious to hear your thoughts.
  • 4.

    Hi Eric, I can sympathise with the overwork situation... :-)

    The overall thing I'm trying to do is related to the discussion of emulators in this thread.

    Suppose we're looking to increase our understanding of the kind of behaviours of a model of population of rabbits and wolves. Then you might have a "natural" state space of size $2^{16} \times 2^{16}=2^{32}$ even if you restrict the maximum individual population size to $2^{16}$. So with even a very simple transition from "previous populations levels" $(r_t,w_t)$ to "new population levels" $(r_{t+1},w_{t+1})$

    a := max(r_t * w_t + N(0,w_t),2^14)
    b :=min( (r_t- w_t) * N(w_t,w_t^2),r_t)
    r_{t+1} := a / b
    w_{t+1}  := if b<2 then 0 else b
    

    (made-up model for illustrative purposes) it's theoretically possible to simply compute $p((r_{t+1},w_{t+1})|(r_t,w_t))$ for each possibility. (At the moment I'm considering a pure simulation, nothing involving any kind of validation with actual real data.) In practice there's far too many to compute directly, so the idea is to figure out a "reduced approximation" that's more tractable. This is similar to Nathan's "emulators", where an expensive Monte Carlo simulation is evaluated at various points in parameter space and then trying to approximate the value at a point where the Monte Carlo simulation wasn't done. While I gather Nathan's simulations involve real measurements, there's no reason it has to. In a sense, I'm trying to build an "emulator" for the transition probability matrix because the whole thing is too big.The big difference is that I don't care about the individual elements in the transition matrix, but to analyse the higher-level approximation/emulator to figure out aggregate behaviour.

    Looking at the expression you can see the 2-D state to 2-D state probabilities are going to give a surface which is "piecewise continuous" (eg, you'll get a transition between behaviour when the $max$ or $min$ operators "trigger"), and it's reasonable to expect that each of the "piecewise" sections of the surface is quite well-behaved (probably even analytic) so that an individual piece should be capable of being approximated quite well. However, just looking at the recipe it's not at all obvious how to determine these boundaries directly. Running a simulation you can figure out which "piecewise segment" a given transition $(r_{t+1},w_{t+1},r_t,w_t))$ has fallen into, then add it to the sufficient statistics for that segment. Then after simulating a lot of samples, you can hopefully create a reasonably good Gaussian mixture model from those sufficient statistics. However, this does happen by computing the sufficient statistics from a large set of individual points added one at a time, hence being apparently quite vulnerable to floating point problems giving you a completely meaningless answer if care isnt' taken. (See eg an example where "the obvious thing" gives a negative variance due to catastrophic cancellation. (Obviously these effects are much less of a problem when dealing with "measured data" since as you pointed out it's unlikely/expensive to get enough data that these effects occur.)

    My vague thoughts at the moment are that if you also divide the state into spatial cubical cells and compute the sufficient statistics within each cell relative to the cell centre, that makes the absolute values of the numbers smaller so that catastrophic cancellation doesn't occur. Then you can merge the Gaussians later in a straightforward way without the floating point problems. Hopefully...

    Comment Source:Hi Eric, I can sympathise with the overwork situation... :-) The overall thing I'm trying to do is related to the discussion of emulators in [this thread](http://forum.azimuthproject.org/discussion/892/2/a-simple-online-climate-model/#Item_11). Suppose we're looking to increase our understanding of the kind of behaviours of a model of population of rabbits and wolves. Then you might have a "natural" state space of size $2^{16} \times 2^{16}=2^{32}$ even if you restrict the maximum individual population size to $2^{16}$. So with even a very simple transition from "previous populations levels" $(r_t,w_t)$ to "new population levels" $(r_{t+1},w_{t+1})$ ~~~~ a := max(r_t * w_t + N(0,w_t),2^14) b :=min( (r_t- w_t) * N(w_t,w_t^2),r_t) r_{t+1} := a / b w_{t+1} := if b<2 then 0 else b ~~~~ (made-up model for illustrative purposes) it's theoretically possible to simply compute $p((r_{t+1},w_{t+1})|(r_t,w_t))$ for each possibility. (At the moment I'm considering a pure simulation, nothing involving any kind of validation with actual real data.) In practice there's far too many to compute directly, so the idea is to figure out a "reduced approximation" that's more tractable. This is similar to Nathan's "emulators", where an expensive Monte Carlo simulation is evaluated at various points in parameter space and then trying to approximate the value at a point where the Monte Carlo simulation wasn't done. While I gather Nathan's simulations involve real measurements, there's no reason it has to. In a sense, I'm trying to build an "emulator" for the transition probability matrix because the whole thing is too big.The big difference is that I don't care about the individual elements in the transition matrix, but to analyse the higher-level approximation/emulator to figure out aggregate behaviour. Looking at the expression you can see the 2-D state to 2-D state probabilities are going to give a surface which is "piecewise continuous" (eg, you'll get a transition between behaviour when the $max$ or $min$ operators "trigger"), and it's reasonable to expect that each of the "piecewise" sections of the surface is quite well-behaved (probably even analytic) so that an individual piece should be capable of being approximated quite well. However, just looking at the recipe it's not at all obvious how to determine these boundaries directly. Running a simulation you can figure out which "piecewise segment" a given transition $(r_{t+1},w_{t+1},r_t,w_t))$ has fallen into, then add it to the sufficient statistics for that segment. Then after simulating a lot of samples, you can hopefully create a reasonably good Gaussian mixture model from those sufficient statistics. However, this does happen by computing the sufficient statistics from a large set of individual points added one at a time, hence being apparently quite vulnerable to floating point problems giving you a completely meaningless answer if care isnt' taken. (See eg [an example where "the obvious thing" gives a negative variance due to catastrophic cancellation](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Example). (Obviously these effects are much less of a problem when dealing with "measured data" since as you pointed out it's unlikely/expensive to get enough data that these effects occur.) My vague thoughts at the moment are that if you also divide the state into spatial cubical cells and compute the sufficient statistics within each cell _relative to the cell centre_, that makes the absolute values of the numbers smaller so that catastrophic cancellation doesn't occur. Then you can merge the Gaussians later in a straightforward way without the floating point problems. Hopefully...
Sign In or Register to comment.