Dara wrote:

> I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish.

Great! I hope we can do something like this. A few weeks ago I said:

> But I don’t feel that I can say “write code that uses a **random forest method** to predict El Niños \$n\$ months in advance” and expect one of you to do that in a week.

Daniel wrote:

> This is not that hard.

So I thought it could be done in a week, but obviously it hasn't been done yet!

Daniel wrote:

> The hard part is formulating the problem.

and so I tried to start formulating it. I hoped that my first try would give rise to more questions that would help me formulate it more precisely. But Daniel Mahler didn't ask any more questions. So maybe you can? Maybe David Tweed can?

First:

> deciding what signal(s) to predict

I want to predict the **Niño 3.4 index**, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.

> from what data

For starters, I'd like to predict it from the **average link strength**, a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt).

This file has the average link strength, called \$S\$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.)

The first goal is to **predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around \$m\$ months before the given month**.

The second goal is to **measure how good the prediction is, and see how it gets worse as we increase \$m\$**.

(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:

We predict the Niño 3.4 index for a given month given the values of \$S\$ over the \$k\$ 10-day periods whose final one occurs at the end of the month \$m\$ months before the given month. For example, if \$m = 1\$, \$k = 3\$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.

Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)

> what loss function to use

I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?

Roughly speaking, I'd like predictions that minimize the time average of

|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|\${}^2\$

But I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.