[Here's the comment I couldn't post this morning]

Yes, I'm trying to apply the stuff from the blog from tiny datasets
(where it takes a couple of minutes on my laptop) to a vastly bigger
dataset, namely the El Nino temperatures. In particular, if you take
the correlation between two (not necessarily adjacent points) with $N$
total points you get $N \times (N-1)$ ordered pairs. If you look at the
minimum and maximum normalized correlation between the first point (1) now, (2) 3 months ago and (3)
six months ago and the second (1) at the same time, (2) 3 months
preceding and (2) 6 months preceding you get $(3\times 3-1)*2=16$
possibilities.

So the input data -- the feature data if you will -- for 24 points is
either a $552\times 16$ matrix $X$ or a $8832$ element vector $x$
(depending if you "concatenate" it or not). Suppose
that through discussion we can figure out some plausible "real number"
output $y$. Then my plan is to try to generate predictors of the form

1. $\hat{y}=a^T x+c$ if doing linear regression.
2. $\hat{y}=\sum_{i=1:P}a_i^T X b_i + c$ if doing bilinear regression.

However, with different kinds of sparsity prior plus a variable number
of bilinear vectors $P$ (as far as I'm aware no-one has yet shown that
they "nest" in the same way PCA vectors do) that's $6P$ models to learn
on what's quite a medium-size feature vector. (By "big data" standards that's not huge, but the people who do that kind of stuff have big clusters to run on and are using loss functions with known properties that make efficient solution possible, neither of which is true for me.)

These things I'm looking at are (combinations of) models that have been
published quite extensively in the last two or three years. As such, they're known but
not at the point where there is existing easily available software to
solve those models. Part of my reason for focusing on various kinds of
variously sparsified regression is that in that area I
understand the model structure and how to sparsify it without doing
additional cross validation. In addition, I'm hoping that I can
reproduce the "division into two sets" test strategy that Ludescher et
al used, so that it's a quite direct comparsion.

One of the things that makes me a bit
hesitant to look at neural nets, decision forests, etc, at this point
is that I don't understand those well enough to sparsify them
effectively without essentially needing to have a training, test and
validation set which means it'd be looking at a division of the data
into 3 parts, so that it'd be more difficult to compare performance
directly. (Other people might very well understand how to use decision
forests, etc, for this without splitting the data but I don't.)

Yes, there's problems with taking a 7 month average as a proxy for 5
months where the "3-month average" is above some threshold. I'm very
interested if anyone has any ideas of a **non-binary** statistic that
could be used instead as a prediction variable. I suppose one
possibility that could be used is the count of months within the next
5 for which the El Nino 3.4 is above a threshold, although that still
piles a lot of different vectors of feature values at 0. (In case
it's not obvious, the reason I particularly care about this in the
context of (normal|bi-) linear regression is that the regression
function assumes it should try equally hard to hit all of the "target
outputs" you give it, so if there's a heavy concentration onto one
value then it will be heavily biased towards creating a linear
function which goes through that point for lots of outputs, which as
I'm sure you can visualise is quite "unnatural".)