John wrote:

> But I don’t feel that I can say “write code that uses a random forest method to predict El Niños \$n\$ months in advance” and expect one of you to do that in a week.

Daniel wrote:

> This is not that hard.

Wow! I'm really happy that you feel that way, because then we might accomplish something by December 1st!

> The hard part is formulating the problem.

That's the part I can do. I'm not really an expert on climate science, so I can't promise to do a great job - but at least I have specific problems in mind, and I can start by describing some easy ones!

> deciding what signal(s) to predict

I want to predict the Niño 3.4 index, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.

> from what data

For starters, I'd like to predict it from the "average link strength", a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt).

This file has the average link strength, called \$S\$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.)

The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around \$m\$ months before the given month.

The second goal is to measure how good the prediction is and see how it gets worse as we increase \$m\$.

(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:

We predict the Niño 3.4 index for a given month given the values of \$S\$ over the \$k\$ 10-day periods whose final one occurs at the end of the month \$m\$ months before the given month. For example, if \$m = 1\$, \$k = 3\$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.

Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)

> what loss function to use

I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?

Roughly speaking, I'd like predictions that minimize the time average of

|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|\${}^2\$

I bet you can give me ideas here. I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.