Hi David,

Sorry for the slow response. Work is crazy.

So given a single $i$ and $j$, how do you calculate $p(x_i|x_j)$? I still don't quite understand.

The first thing I'd try is to find ways to reduce the size of the space. I have never come across a situation that would truly require $2^{32}$ distinct states much less estimating a $2^{32}\times 2^{32}$ matrix. There simply does not exist enough data on the planet for applications I've seen to calibrate such a thing.

What kind of application is this for? Have you considered a hierarchical approach? For example, start by calibrating a $10\times 10$ matrix of driving factors. Then for each driving factor, maybe a $10\times 10$ matrix of driving subfactors. Continue to as many levels as you need.

One thing I have done, is develop running computations of variances and covariances. It might help to note the covariances can be derived from variances, i.e.

$cov_t(x_i,x_j) = \frac{1}{2} \left[ var_t(x_i + x_j) - var_t(x_i) - var_t(x_j)\right].$

The variances can be expressed in terms of expectations

$var_t(x_i) = k\left[E_t(x_i^2) - E_t(x_i)^2\right].$

Finally, the expectation at time $t$ can be expressed in terms of values known at time $t-1$

$E_t(x_i) = w_t x_{i,t} + e_{t-1}(x_i)$

where $\sum_t w_t = 1$.

Putting all this together, the covariance at time $t$ can be determined via functions known at time $t-1$ adjusted by new incoming data at time $t$.

I'm not sure if this is helpful, but I'd be curious to hear your thoughts.