You analysis of what the software outputs is truly grand and you are gifted.
I also need John's theoretical mind because I do not believe these atmospheric and oceanic systems are any simpler than particle systems. I suspect they are even more complex.
Dara
Comment Source:WebHubTel
You analysis of what the software outputs is truly grand and you are gifted.
I also need John's theoretical mind because I do not believe these atmospheric and oceanic systems are any simpler than particle systems. I suspect they are even more complex.
Dara
Thanks, but not gifted at all, more "cursed" by persistent doggedness :)
Another Next Step to consider is the analysis of historical proxy records that have been calibrated to ENSO. One of the lead researchers on this is Kim Cobb from Georgia Tech, who several years ago analyzed old-growth coral from Palmyra Island in the equatorial Pacific. She discovered after correlating the isotopic oxygen content of the newer-growth coral against recent SST records that one can calibrate the two and thus the O content would make a suitable ENSO proxy record going back in time.
http://www.ncdc.noaa.gov/paleo/pubs/cobb2003/cobb2003.html
What I tried to do with the Palmyra proxy results is see if I could use my sloshing-derived equations to model the data with approximately the same parameters that I have been using for my SOI fit. This is the result of that, which I finished a couple of weeks ago.
http://contextearth.com/2014/06/25/proxy-confirmation-of-soim/
The filtering is somewhat tricky because there is both a seasonal signal and a longer-term PDO-like trend riding along with the data. But otherwise, I was surprised by how much the records from hundreds of years ago match in form to what is being measured now. It shows perhaps how stationary the ENSO time series is over many centuries.
All of the data is available from that first NOAA link in case someone wants to look at it.
Comment Source:Thanks, but not gifted at all, more "cursed" by persistent doggedness :)
Another Next Step to consider is the analysis of historical proxy records that have been calibrated to ENSO. One of the lead researchers on this is Kim Cobb from Georgia Tech, who several years ago analyzed old-growth coral from Palmyra Island in the equatorial Pacific. She discovered after correlating the isotopic oxygen content of the newer-growth coral against recent SST records that one can calibrate the two and thus the O content would make a suitable ENSO proxy record going back in time.
<http://www.ncdc.noaa.gov/paleo/pubs/cobb2003/cobb2003.html>
What I tried to do with the Palmyra proxy results is see if I could use my sloshing-derived equations to model the data with approximately the same parameters that I have been using for my SOI fit. This is the result of that, which I finished a couple of weeks ago.
<http://contextearth.com/2014/06/25/proxy-confirmation-of-soim/>
The filtering is somewhat tricky because there is both a seasonal signal and a longer-term PDO-like trend riding along with the data. But otherwise, I was surprised by how much the records from hundreds of years ago match in form to what is being measured now. It shows perhaps how stationary the ENSO time series is over many centuries.
All of the data is available from that first NOAA link in case someone wants to look at it.
I like to do a lot of these sorts of experiments with data to gain experience and be ready for the real stuff coming down the pipe so suggest ideas
Comment Source:I like to do a lot of these sorts of experiments with data to gain experience and be ready for the real stuff coming down the pipe so suggest ideas
Perhaps you might get upset at me saying this but I looked at your diff EQ and I think they need more terms and cross terms. Also they needs to have decaying exponentials.
Comment Source:Perhaps you might get upset at me saying this but I looked at your diff EQ and I think they need more terms and cross terms. Also they needs to have decaying exponentials.
I also think that the diff equations are better suited if you break the time into two component of {year, month}.
In the .mml equations I gave from the SVR's Gaussian Wavelet you see {x1, x2} accordingly.
You do not need to make any mod 12 or discrete year, assume two continuous vars.
My hunch is that the equations produce much better models/solutions if the time is broken into 1 period value and another linear.
You could also add the leap year as a param by itself 1 or 0, and maybe other params like # of sunspots and so on: {x1, x2, x3 ...} and I could compute you a closed form equation from SVR regressor just the same.
Comment Source:I also think that the diff equations are better suited if you break the time into two component of {year, month}.
In the .mml equations I gave from the SVR's Gaussian Wavelet you see {x1, x2} accordingly.
You do not need to make any mod 12 or discrete year, assume two continuous vars.
My hunch is that the equations produce much better models/solutions if the time is broken into 1 period value and another linear.
You could also add the leap year as a param by itself 1 or 0, and maybe other params like # of sunspots and so on: {x1, x2, x3 ...} and I could compute you a closed form equation from SVR regressor just the same.
No, not upset at all. I did not include a first-order drag term to represent viscous dissipation, since one simplifying approximation is to assume an inviscid environment. The first-order terms would definitely add exponential damping to the oscillating waveforms.
As far as cross-terms, I will see what the sloshing literature has for these.
I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
Comment Source:No, not upset at all. I did not include a first-order drag term to represent viscous dissipation, since one simplifying approximation is to assume an inviscid environment. The first-order terms would definitely add exponential damping to the oscillating waveforms.
As far as cross-terms, I will see what the sloshing literature has for these.
I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
Maybe a system differential equations, as opposed to one, each governing the behaviour of a sub-Trend from the original data
Comment Source:>I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
Maybe a system differential equations, as opposed to one, each governing the behaviour of a sub-Trend from the original data
An interesting suggestion from a commenter at my blog is that some of the sub-monthly periods identified by machine learning is due to the non-uniform sampling rate, as monthly data violates the Nyquist criteria because of changing month lengths. He further suggested that non-uniform sampling can routinely extract signal at higher resolutions than uniform sampling, for example in variable star analysis.
This is a typical research article
"Variable Stars: which Nyquist Frequency ?"
L. Eyer and P. Bartholdi http://arxiv.org/pdf/astro-ph/9808176.pdf
And there are implications for machine learning in this chapter
"Data Mining and Machine-Learning in Time-Domain Discovery & Classification"
J.S.Bloom and J.W. Richards http://arxiv.org/pdf/1104.3142.pdf
Comment Source:Pertaining to #46
An interesting suggestion from a commenter at my blog is that some of the sub-monthly periods identified by machine learning is due to the non-uniform sampling rate, as monthly data violates the Nyquist criteria because of changing month lengths. He further suggested that non-uniform sampling can routinely extract signal at higher resolutions than uniform sampling, for example in variable star analysis.
This is a typical research article
"Variable Stars: which Nyquist Frequency ?"
L. Eyer and P. Bartholdi <http://arxiv.org/pdf/astro-ph/9808176.pdf>
And there are implications for machine learning in this chapter
"Data Mining and Machine-Learning in Time-Domain Discovery & Classification"
J.S.Bloom and J.W. Richards <http://arxiv.org/pdf/1104.3142.pdf>
However, I feel very unconfident of my/our ability to do anything very interesting in this direction by December. I could probably do it if I did nothing else between now and then. However, I'll be teaching 2 courses, including a seminar on network theory, and I really want to explain - in the seminar, and on the blog - the great new research my grad students have done. I also need to apply for some grants before mid-November. I would also like to finish writing a book - which is almost done, but everything takes longer than expected.
If people here were organized and willing to work on this project together, it might stil get done in time. But I don't feel that I can say "write code that uses a random forest method to predict El Niños $n$ months in advance" and expect one of you to do that in a week. It just doesn't seem to be working that way. I could say a lot about why not, and how we might become better organized, but I won't now.
For now, I probably need to reduce the ambitiousness of my short-term plans.
So, I think I will try to better understand the network formed by connecting the Ludescher et al sites in the Pacific with edges weighted by (various measures of) correlation between temperature histories. Graham Jones has already written a lot of the software needed to do this! He has already created a lot of graphs and plots that I haven't blogged about yet! There is a lot more that could be done... but I really just need to give a 1-hour talk and not get laughed off the stage.
The first step, then, is for me to blog about the work he's done.
If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like "I want to do this", or "tell me what to do".
I do not believe that wavelet or differential equation based models of El Niño will fit into this talk. Needless to say, our longer-term goals for after December can continue to be extremely ambitious, and they can include such models. It's just this deadline that's giving me an ulcer.
Comment Source:I have been worrying a lot about what to do next in the El Niño Project.
It would be great if we could carry out some interesting machine learning or neural networks approach to El Niño by December 1st, because shortly after that I'm supposed to [give a talk about this stuff at the Neural Information Processing Seminar](http://forum.azimuthproject.org/discussion/1364/networks-in-climate-science/#Item_1).
However, I feel very unconfident of my/our ability to do anything very interesting in this direction by December. I could probably do it if I did nothing else between now and then. However, I'll be teaching 2 courses, including a seminar on network theory, and I really want to explain - in the seminar, and on the blog - the great new research my grad students have done. I also need to apply for some grants before mid-November. I would also like to finish writing a book - which is almost done, but everything takes longer than expected.
If people here were organized and willing to work on this project together, it might stil get done in time. But I don't feel that I can say "write code that uses a random forest method to predict El Niños $n$ months in advance" and expect one of you to do that in a week. It just doesn't seem to be working that way. I could say a lot about why not, and how we might become better organized, but I won't now.
For now, I probably need to reduce the ambitiousness of my short-term plans.
So, I think I will try to better understand the network formed by connecting the Ludescher _et al_ sites in the Pacific with edges weighted by (various measures of) correlation between temperature histories. Graham Jones has already written a lot of the software needed to do this! He has already created a lot of graphs and plots that I haven't blogged about yet! There is a lot more that could be done... but I really just need to give a 1-hour talk and not get laughed off the stage.
The first step, then, is for me to blog about the work he's done.
If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like "I want to do this", or "tell me what to do".
I do not believe that wavelet or differential equation based models of El Niño will fit into this talk. Needless to say, our longer-term goals for after December can continue to be extremely ambitious, and they can include such models. It's just this deadline that's giving me an ulcer.
If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like “I want to do this”, or “tell me what to do”.
John this is what I could do to help out for that talk:
Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on.
Parallelization ideas for the Neural Network computations, I might give you code snippets in OpenMP and CUDA. Without such parallelization usage of Neural computation is nill.
Iffy ideas: how to prepare volumetric data for neural computation
Dara
Comment Source:>If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like “I want to do this”, or “tell me what to do”.
John this is what I could do to help out for that talk:
Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on.
Parallelization ideas for the Neural Network computations, I might give you code snippets in OpenMP and CUDA. Without such parallelization usage of Neural computation is nill.
Iffy ideas: how to prepare volumetric data for neural computation
Dara
In general I propose regularly published analyst reports of math algorithms software visualization and parallelization on actual satellite atmospheric and oceanic data.
Instead of publishing general academic papers, publish specific analysis of specific data.
This might not ring a bell for John, but I think it might due to his interests in network theory
Dara
Comment Source:In general I propose regularly published analyst reports of math algorithms software visualization and parallelization on actual satellite atmospheric and oceanic data.
Instead of publishing general academic papers, publish specific analysis of specific data.
This might not ring a bell for John, but I think it might due to his interests in network theory
Dara
But I don’t feel that I can say “write code that uses a random forest method to predict El Niños n months in advance” and expect one of you to do that in a week.
This is not that hard. The hard part is formulating the problem: deciding what signal(s) to predict, from what data and what loss function to use so that the are meaningful.
It might be easy to predict some signals with a high $R^2$ by simply locking onto seasonal variation, which would be kind of trivial.
Once the problem specification is agreed on the implementation should be fairly simple.
Consider the following example from the scikit library documentation.
Comment Source:> But I don’t feel that I can say “write code that uses a random forest method to predict El Niños n months in advance” and expect one of you to do that in a week.
This is not that hard. The hard part is formulating the problem: deciding what signal(s) to predict, from what data and what loss function to use so that the are meaningful.
It might be easy to predict some signals with a high $R^2$ by simply locking onto seasonal variation, which would be kind of trivial.
Once the problem specification is agreed on the implementation should be fairly simple.
Consider the following [example](http://scikit-learn.org/dev/auto_examples/plot_multioutput_face_completion.html) from the scikit library documentation.
But I don’t feel that I can say “write code that uses a random forest method to predict El Niños $n$ months in advance” and expect one of you to do that in a week.
Daniel wrote:
This is not that hard.
Wow! I'm really happy that you feel that way, because then we might accomplish something by December 1st!
The hard part is formulating the problem.
That's the part I can do. I'm not really an expert on climate science, so I can't promise to do a great job - but at least I have specific problems in mind, and I can start by describing some easy ones!
deciding what signal(s) to predict
I want to predict the Niño 3.4 index, which is one number per month. The data is available here. This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
from what data
For starters, I'd like to predict it from the "average link strength", a quantity we've computed and put here.
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see Part 4 of the El Niño Project series.)
The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month.
The second goal is to measure how good the prediction is and see how it gets worse as we increase $m$.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
what loss function to use
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
I bet you can give me ideas here. I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
Comment Source:John wrote:
> But I don’t feel that I can say “write code that uses a random forest method to predict El Niños $n$ months in advance” and expect one of you to do that in a week.
Daniel wrote:
> This is not that hard.
Wow! I'm really happy that you feel that way, because then we might accomplish something by December 1st!
> The hard part is formulating the problem.
That's the part I can do. I'm not really an expert on climate science, so I can't promise to do a great job - but at least I have specific problems in mind, and I can start by describing some easy ones!
> deciding what signal(s) to predict
I want to predict the Niño 3.4 index, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
> from what data
For starters, I'd like to predict it from the "average link strength", a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt).
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.)
The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month.
The second goal is to measure how good the prediction is and see how it gets worse as we increase $m$.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
> what loss function to use
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
I bet you can give me ideas here. I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR.
Problem is that there are very simple interpolation functions that reduce the error 0, they are very wavy and some might call them over-fitting.
So each minimization algorithm has cooked up a constrained to avoid that over-fitting.
Try to cook something up with read data next couple of days
Dara
Comment Source:John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR.
Problem is that there are very simple interpolation functions that reduce the error 0, they are very wavy and some might call them over-fitting.
So each minimization algorithm has cooked up a constrained to avoid that over-fitting.
Try to cook something up with read data next couple of days
Dara
John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR.
I don't want you to do unnecessary work. To understand how minimization of error is handled, and how to avoid overfitting, all I really need is nice sentences in English, mixed with just a few equations. I often understand general concepts more easily than examples.
Comment Source:Dara wrote:
> John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR.
I don't want you to do unnecessary work. To understand how minimization of error is handled, and how to avoid overfitting, all I really need is nice sentences in English, mixed with just a few equations. I often understand general concepts more easily than examples.
John this is what I could do to help out for that talk:
Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on.
Thanks! But first we need to do the research for this talk. If you can help David Tweed do the project he proposed, that would be great. I believe I can fill in details about what datasets to use (the surface air temperatures in a certain rectangle of points) and what quantities we want to predict (the El Niño 3.4 index). I'm hoping David Tweed or you will ask me for these details.
I would prefer to let him, and Daniel Mahler, and you, be the experts on the methods of prediction. I would like to start by some using "standard" or "routine" methods of prediction, just to see what happens. I hope you guys can explain those methods to me, or point me to explanations.
Later, perhaps after December, we can try more creative and interesting things.
Comment Source:Dara wrote:
> John this is what I could do to help out for that talk:
> Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on.
Thanks! But first we need to do the research for this talk. If you can help David Tweed do [the project he proposed](http://forum.azimuthproject.org/discussion/1445/possible-gpgpu-task-linearbilinear-regression-on-el-nino-dataset/?Focus=12127#Comment_12127), that would be great. I believe I can fill in details about what datasets to use (the surface air temperatures in a certain rectangle of points) and what quantities we want to predict (the El Niño 3.4 index). I'm hoping David Tweed or you will ask me for these details.
I would prefer to let him, and Daniel Mahler, and you, be the experts on the _methods_ of prediction. I would like to start by some using "standard" or "routine" methods of prediction, just to see what happens. I hope you guys can explain those methods to me, or point me to explanations.
Later, perhaps after December, we can try more creative and interesting things.
Thanks, Nad. But I don't literally have an ulcer! That was just a metaphor.
I really want to go to this Neural Information Processing Seminar, since there are probably lots of smart people interested in "networks" there. I was invited to give a minicourse there last year (on information geometry), and I cancelled at the last minute because I was too busy. I don't want to cancel again. If I wanted to give a low-stress talk, I could talk about my mathematical ideas on network theory. But I think for this audience it will be better if I walk about something more "practical". I feel pretty sure I can give a mildly interesting talk on climate networks and El Niño prediction. I'll do this if I can't do something more exciting. But I've been wanting to push myself into new territory... and get some of the machine learning / computation people here involved. We'll see.
Comment Source:Thanks, Nad. But I don't _literally_ have an ulcer! That was just a metaphor.
I really want to go to this Neural Information Processing Seminar, since there are probably lots of smart people interested in "networks" there. I was invited to give a minicourse there last year (on information geometry), and I cancelled at the last minute because I was too busy. I don't want to cancel again. If I wanted to give a low-stress talk, I could talk about my mathematical ideas on network theory. But I think for this audience it will be better if I walk about something more "practical". I feel pretty sure I can give a mildly interesting talk on climate networks and El Niño prediction. I'll do this if I can't do something more exciting. But I've been wanting to push myself into new territory... and get some of the machine learning / computation people here involved. We'll see.
It seems coordinating efforts might be a bit difficult here, so as an individual contributor I could help out and do whatever you need for the talks, pending permission from Graham could fiddle with his code or add some new stuff for you to show.
I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish. It might be the case that most of it you would not like, not an issue we start from scratch again and again.
Again my purpose here is to learn and work with learned men and women.
Dara
Comment Source:Hello John
It seems coordinating efforts might be a bit difficult here, so as an individual contributor I could help out and do whatever you need for the talks, pending permission from Graham could fiddle with his code or add some new stuff for you to show.
I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish. It might be the case that most of it you would not like, not an issue we start from scratch again and again.
Again my purpose here is to learn and work with learned men and women.
Dara
I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish.
Great! I hope we can do something like this. A few weeks ago I said:
But I don’t feel that I can say “write code that uses a random forest method to predict El Niños $n$ months in advance” and expect one of you to do that in a week.
Daniel wrote:
This is not that hard.
So I thought it could be done in a week, but obviously it hasn't been done yet!
Daniel wrote:
The hard part is formulating the problem.
and so I tried to start formulating it. I hoped that my first try would give rise to more questions that would help me formulate it more precisely. But Daniel Mahler didn't ask any more questions. So maybe you can? Maybe David Tweed can?
First:
deciding what signal(s) to predict
I want to predict the Niño 3.4 index, which is one number per month. The data is available here. This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
from what data
For starters, I'd like to predict it from the average link strength, a quantity we've computed and put here.
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see Part 4 of the El Niño Project series.)
The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month.
The second goal is to measure how good the prediction is, and see how it gets worse as we increase $m$.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
what loss function to use
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
But I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
Comment Source:Dara wrote:
> I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish.
Great! I hope we can do something like this. A few weeks ago I said:
> But I don’t feel that I can say “write code that uses a **random forest method** to predict El Niños $n$ months in advance” and expect one of you to do that in a week.
Daniel wrote:
> This is not that hard.
So I thought it could be done in a week, but obviously it hasn't been done yet! <img src = "http://math.ucr.edu/home/baez/emoticons/tongue2.gif" alt = ""/>
Daniel wrote:
> The hard part is formulating the problem.
and so I tried to start formulating it. I hoped that my first try would give rise to more questions that would help me formulate it more precisely. But Daniel Mahler didn't ask any more questions. So maybe you can? Maybe David Tweed can?
First:
> deciding what signal(s) to predict
I want to predict the **Niño 3.4 index**, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
> from what data
For starters, I'd like to predict it from the **average link strength**, a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt).
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.)
The first goal is to **predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month**.
The second goal is to **measure how good the prediction is, and see how it gets worse as we increase $m$**.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
> what loss function to use
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
But I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression here, but it's very slow going (primarily because I'm only able to spend about 30 min per day on it while on the train, which isn't really long enough to fully engage my "concentration mode". Still not at the point of being a working program. I still hope to get something finished before I run out of time.
In terms of things to predict, the only real thoughts I've had so far is that I'm reluctant to try to predict a 3-month average based El Nino 3.4, purely because it's likely to be noisy and hence errors aren't necessarily indicative of errors on the bigger problem. I'm inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observatoins used for prediction, but that's not really more than a rough guess.
A quick note about prediction errors: it may be worth considering things other than squared-error as predictors. As I drone on about in the blog thing I'm working on, squared loss "wants to" reduce any big error to be smaller, even if that means inflating all the other errors up to that level. Sometimes an absolute-value-of-errors, or other alternatives, can be more insightful about classifier performance.
Comment Source:Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression [here](https://github.com/davidtweed/multicoreBilinearRegression), but it's very slow going (primarily because I'm only able to spend about 30 min per day on it while on the train, which isn't really long enough to fully engage my "concentration mode". Still not at the point of being a working program. I still hope to get _something_ finished before I run out of time.
In terms of things to predict, the only real thoughts I've had so far is that I'm reluctant to try to predict a 3-month average based El Nino 3.4, purely because it's likely to be noisy and hence errors aren't necessarily indicative of errors on the bigger problem. I'm inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observatoins used for prediction, but that's not really more than a rough guess.
A quick note about prediction errors: it may be worth considering things other than squared-error as predictors. As I drone on about in the blog thing I'm working on, squared loss "wants to" reduce any big error to be smaller, even if that means inflating all the other errors up to that level. Sometimes an absolute-value-of-errors, or other alternatives, can be more insightful about classifier performance.
I am trying to put together some actionable items + calendar of development in place to help out with your research, of course I have the vested interest of learning from you so there is a selfish reason as well :)
As we do with Paul, to drill down code and attempts to study a problem, we just need more of that and your most valued input, because I could EASILY diverge into useless endeavours.
Comment Source:Hello John
I am trying to put together some actionable items + calendar of development in place to help out with your research, of course I have the vested interest of learning from you so there is a selfish reason as well :)
As we do with Paul, to drill down code and attempts to study a problem, we just need more of that and your most valued input, because I could EASILY diverge into useless endeavours.
Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression here...
Thanks, great! But I don't really know what you're doing. I guess it's something new or nonstandard, because I'm guessing should be some software you can just take "off the shelf" that does approximately what I'm asking for... right?
I'm hoping someone will help me use some "off the shelf" software to make an initial attack on the El Niño prediction problem, just to get some sense of how hard it is. This could be someone less trained in machine learning than you but more friendly with computers than me.
In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem. I’m inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observations used for prediction, but that’s not really more than a rough guess.
I guess you're saying this because an El Niño occurs when the 3-month running average of the Nino 3.4 index is over 0.5 °C for at least 5 months in a row, which involves 1+5+1 months of data?
Of course the 3-month running average being above some value for 5 months in a row is a bit different than the 7-month average being over that value for one month. The former is 5 inequalities while the latter is one.
To some extent we have a choice between predicting things that are easy to predict and predicting things people care about. People care about when there's an official El Niño. But they also care especially when it's a "strong" El Niño. It's hard to predict an El Niño more than 6 months in advance. But for that very reason, this is what people most want to do.
A quick note about prediction errors: it may be worth considering things other than squared-error as predictors.
I have no strong ideology about this sort of thing, so it will be good if you provide us with one.
Comment Source:David Tweed wrote:
> Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression [here](https://github.com/davidtweed/multicoreBilinearRegression)...
Thanks, great! But I don't really know what you're doing. I guess it's something new or nonstandard, because I'm guessing should be some software you can just take "off the shelf" that does _approximately_ what I'm asking for... right?
Are you trying to implement some of the ideas [discussed in your blog article](http://www.azimuthproject.org/azimuth/show/Blog+-+Exploring+regression+on+the+El+Ni%26ntilde%3Bo+data)?
I'm hoping someone will help me use some "off the shelf" software to make an initial attack on the El Niño prediction problem, just to get some sense of how hard it is. This could be someone less trained in machine learning than you but more friendly with computers than me.
> In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem. I’m inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observations used for prediction, but that’s not really more than a rough guess.
I guess you're saying this because an El Niño occurs when the 3-month running average of the Nino 3.4 index is over 0.5 °C for at least 5 months in a row, which involves 1+5+1 months of data?
Of course the 3-month running average being above some value for 5 months in a row is a bit different than the 7-month average being over that value for one month. The former is 5 inequalities while the latter is one.
To some extent we have a choice between predicting things that are easy to predict and predicting things people care about. People care about when there's an official El Niño. But they also care especially when it's a "strong" El Niño. It's hard to predict an El Niño more than 6 months in advance. But for that very reason, this is what people most want to do.
> A quick note about prediction errors: it may be worth considering things other than squared-error as predictors.
I have no strong ideology about this sort of thing, so it will be good if you provide us with one.
Take correlation coefficient for instance. One would think that the (anti) correlation between Tahiti and Darwin, which comprises SOI, would probably be close to -1, considering how they were picked to best illustrate the standing wave pattern. When one place is up in pressure, the other place is down is perfect anti-correlation.
In fact, the anti-correlation is "only" -0.55, according to the time-series comparison below. One can sense that these two time-series are very likely anti-correlated, but the fact that there are a few regions that don't align drags the coefficient quickly away from -1. There must be other metrics that do a better job of identifying how well two curves match.
Comment Source:Agree on the need for better measures of error.
Take correlation coefficient for instance. One would think that the (anti) correlation between Tahiti and Darwin, which comprises SOI, would probably be close to -1, considering how they were picked to best illustrate the standing wave pattern. When one place is up in pressure, the other place is down is perfect anti-correlation.
In fact, the anti-correlation is "only" -0.55, according to the time-series comparison below. One can sense that these two time-series are very likely anti-correlated, but the fact that there are a few regions that don't align drags the coefficient quickly away from -1. There must be other metrics that do a better job of identifying how well two curves match.

RMS error is standard in climate science and IMHO would be quite sufficient for these purposes.
If I wanted to get fancy, I'd try to construct a utility-based loss function that relates the Pacific SST field to, say, the probability of extreme temperature or precipitation events at some populated location, and see if that gives a much different predictive model. (I doubt it...)
Comment Source:RMS error is standard in climate science and IMHO would be quite sufficient for these purposes.
If I wanted to get fancy, I'd try to construct a utility-based loss function that relates the Pacific SST field to, say, the probability of extreme temperature or precipitation events at some populated location, and see if that gives a much different predictive model. (I doubt it...)
Yes, I'm trying to apply the stuff from the blog from tiny datasets
(where it takes a couple of minutes on my laptop) to a vastly bigger
dataset, namely the El Nino temperatures. In particular, if you take
the correlation between two (not necessarily adjacent points) with $N$
total points you get $N \times (N-1)$ ordered pairs. If you look at the
minimum and maximum normalized correlation between the first point (1) now, (2) 3 months ago and (3)
six months ago and the second (1) at the same time, (2) 3 months
preceding and (2) 6 months preceding you get $(3\times 3-1)*2=16$
possibilities.
So the input data -- the feature data if you will -- for 24 points is
either a $552\times 16$ matrix $X$ or a $8832$ element vector $x$
(depending if you "concatenate" it or not). Suppose
that through discussion we can figure out some plausible "real number"
output $y$. Then my plan is to try to generate predictors of the form
$\hat{y}=a^T x+c$ if doing linear regression.
$\hat{y}=\sum_{i=1:P}a_i^T X b_i + c$ if doing bilinear regression.
However, with different kinds of sparsity prior plus a variable number
of bilinear vectors $P$ (as far as I'm aware no-one has yet shown that
they "nest" in the same way PCA vectors do) that's $6P$ models to learn
on what's quite a medium-size feature vector. (By "big data" standards that's not huge, but the people who do that kind of stuff have big clusters to run on and are using loss functions with known properties that make efficient solution possible, neither of which is true for me.)
These things I'm looking at are (combinations of) models that have been
published quite extensively in the last two or three years. As such, they're known but
not at the point where there is existing easily available software to
solve those models. Part of my reason for focusing on various kinds of
variously sparsified regression is that in that area I
understand the model structure and how to sparsify it without doing
additional cross validation. In addition, I'm hoping that I can
reproduce the "division into two sets" test strategy that Ludescher et
al used, so that it's a quite direct comparsion.
One of the things that makes me a bit
hesitant to look at neural nets, decision forests, etc, at this point
is that I don't understand those well enough to sparsify them
effectively without essentially needing to have a training, test and
validation set which means it'd be looking at a division of the data
into 3 parts, so that it'd be more difficult to compare performance
directly. (Other people might very well understand how to use decision
forests, etc, for this without splitting the data but I don't.)
Yes, there's problems with taking a 7 month average as a proxy for 5
months where the "3-month average" is above some threshold. I'm very
interested if anyone has any ideas of a non-binary statistic that
could be used instead as a prediction variable. I suppose one
possibility that could be used is the count of months within the next
5 for which the El Nino 3.4 is above a threshold, although that still
piles a lot of different vectors of feature values at 0. (In case
it's not obvious, the reason I particularly care about this in the
context of (normal|bi-) linear regression is that the regression
function assumes it should try equally hard to hit all of the "target
outputs" you give it, so if there's a heavy concentration onto one
value then it will be heavily biased towards creating a linear
function which goes through that point for lots of outputs, which as
I'm sure you can visualise is quite "unnatural".)
Comment Source:[Here's the comment I couldn't post this morning]
Yes, I'm trying to apply the stuff from the blog from tiny datasets
(where it takes a couple of minutes on my laptop) to a vastly bigger
dataset, namely the El Nino temperatures. In particular, if you take
the correlation between two (not necessarily adjacent points) with $N$
total points you get $N \times (N-1)$ ordered pairs. If you look at the
minimum and maximum normalized correlation between the first point (1) now, (2) 3 months ago and (3)
six months ago and the second (1) at the same time, (2) 3 months
preceding and (2) 6 months preceding you get $(3\times 3-1)*2=16$
possibilities.
So the input data -- the feature data if you will -- for 24 points is
either a $552\times 16$ matrix $X$ or a $8832$ element vector $x$
(depending if you "concatenate" it or not). Suppose
that through discussion we can figure out some plausible "real number"
output $y$. Then my plan is to try to generate predictors of the form
1. $\hat{y}=a^T x+c$ if doing linear regression.
2. $\hat{y}=\sum_{i=1:P}a_i^T X b_i + c$ if doing bilinear regression.
However, with different kinds of sparsity prior plus a variable number
of bilinear vectors $P$ (as far as I'm aware no-one has yet shown that
they "nest" in the same way PCA vectors do) that's $6P$ models to learn
on what's quite a medium-size feature vector. (By "big data" standards that's not huge, but the people who do that kind of stuff have big clusters to run on and are using loss functions with known properties that make efficient solution possible, neither of which is true for me.)
These things I'm looking at are (combinations of) models that have been
published quite extensively in the last two or three years. As such, they're known but
not at the point where there is existing easily available software to
solve those models. Part of my reason for focusing on various kinds of
variously sparsified regression is that in that area I
understand the model structure and how to sparsify it without doing
additional cross validation. In addition, I'm hoping that I can
reproduce the "division into two sets" test strategy that Ludescher et
al used, so that it's a quite direct comparsion.
One of the things that makes me a bit
hesitant to look at neural nets, decision forests, etc, at this point
is that I don't understand those well enough to sparsify them
effectively without essentially needing to have a training, test and
validation set which means it'd be looking at a division of the data
into 3 parts, so that it'd be more difficult to compare performance
directly. (Other people might very well understand how to use decision
forests, etc, for this without splitting the data but I don't.)
Yes, there's problems with taking a 7 month average as a proxy for 5
months where the "3-month average" is above some threshold. I'm very
interested if anyone has any ideas of a **non-binary** statistic that
could be used instead as a prediction variable. I suppose one
possibility that could be used is the count of months within the next
5 for which the El Nino 3.4 is above a threshold, although that still
piles a lot of different vectors of feature values at 0. (In case
it's not obvious, the reason I particularly care about this in the
context of (normal|bi-) linear regression is that the regression
function assumes it should try equally hard to hit all of the "target
outputs" you give it, so if there's a heavy concentration onto one
value then it will be heavily biased towards creating a linear
function which goes through that point for lots of outputs, which as
I'm sure you can visualise is quite "unnatural".)
I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold [...]
Why not the Nino 3.4 index itself? It's a continuous variable.
Comment Source:> I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold [...]
Why not the Nino 3.4 index itself? It's a continuous variable.
Thanks for your long comment, David... especially given that it was eaten by the ether the first time 'round!
One of the things that makes me a bit hesitant to look at neural nets, decision forests, etc, at this point is that I don’t understand those well enough to sparsify them effectively without essentially needing to have a training, test and validation set which means it’d be looking at a division of the data into 3 parts, so that it’d be more difficult to compare performance directly.
I guess I see two goals:
1) Get something done by December 1st for the NIPS talk.
2) Do something really interesting.
with 1) as a kind of warmup for 2). It would be amazing if we could do something that simultaneously met goals 1) and 2), but I'm not counting on that.
For 1) I was imagining some "quick and dirty" ways of doing something very much like what Ludescher et al did, but slightly different, to begin to see how good their approach is. This would let me give a talk about climate networks, their paper, and a kind of critique or evaluation of their paper.
For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index starting from the average link strength computed by Ludescher et al (and available here). This was supposed to be easy, since it's predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using "average link strength", just Ludescher et al's particular way of using it.
Of course if you have limited time it makes sense for you to tackle a type 2) project while someone else (maybe even little old me) tries this simpler thing.
Comment Source:Thanks for your long comment, David... especially given that it was eaten by the ether the first time 'round!
> One of the things that makes me a bit hesitant to look at neural nets, decision forests, etc, at this point is that I don’t understand those well enough to sparsify them effectively without essentially needing to have a training, test and validation set which means it’d be looking at a division of the data into 3 parts, so that it’d be more difficult to compare performance directly.
I guess I see two goals:
1) Get something done by December 1st for the NIPS talk.
2) Do something really interesting.
with 1) as a kind of warmup for 2). It would be amazing if we could do something that simultaneously met goals 1) and 2), but I'm not counting on that.
For 1) I was imagining some "quick and dirty" ways of doing something _very much like_ what Ludescher _et al_ did, but slightly different, to begin to see how good their approach is. This would let me give a talk about climate networks, their paper, and a kind of critique or evaluation of their paper.
For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index _starting from the average link strength computed by Ludescher et al_ (and available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt)). This was supposed to be easy, since it's predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using "average link strength", just Ludescher _et al_'s particular way of using it.
Of course if you have limited time it makes sense for you to tackle a type 2) project while someone else (maybe even little old me) tries this simpler thing.
I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold […]
Nathan wrote:
Why not the Nino 3.4 index itself? It’s a continuous variable.
I'm sure David has thought of that, but earlier he wrote:
In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem.
so presumably he'd consider the Nino 3.4 itself even more noisy and thus worse.
Personally I don't understand machine learning well enough be sure that trying to predict something noisy is worse than trying to predict a smoothed-out substitute like what David suggested (the 7-month average of Nino 3.4). Obviously you can't predict it as well! But does that mean it's a bad thing to do? It's bad if the algorithm winds up putting a lot of work into predicting irrelevant wiggles. But with a suitable measure of what counts as success (sorry, I'm forgetting the jargon here), one might avoid that.
Anyway, I favor predicting either the Nino 3.4 index or, for some technical reason, a time-averaged version of that. Predicting the Nino 3.4 index has the big sociological advantage that this is something people already do.
Like David, I don't want to have predicting a binary quantity like "is there an El Niño?" as our main goal.
Comment Source:David wrote:
> I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold […]
Nathan wrote:
> Why not the Nino 3.4 index itself? It’s a continuous variable.
I'm sure David has thought of that, but earlier he wrote:
> In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem.
so presumably he'd consider the Nino 3.4 itself even more noisy and thus worse.
Personally I don't understand machine learning well enough be sure that trying to predict something noisy is worse than trying to predict a smoothed-out substitute like what David suggested (the 7-month average of Nino 3.4). Obviously you can't predict it as well! But does that mean it's a bad thing to do? It's bad if the algorithm winds up putting a lot of work into predicting irrelevant wiggles. But with a suitable measure of what counts as success (sorry, I'm forgetting the jargon here), one might avoid that.
Anyway, I favor predicting either the Nino 3.4 index or, for some technical reason, a time-averaged version of that. Predicting the Nino 3.4 index has the big sociological advantage that this is something people already do.
Like David, I don't want to have predicting a binary quantity like "is there an El Niño?" as our main goal.
Climatologists often predict a smoothed El Nino index rather than a raw index, but I have your same feelings: of course it's harder to predict the raw index, but that's the thing you actually care about. Anyway, smoothed or not, the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
Comment Source:Climatologists often predict a smoothed El Nino index rather than a raw index, but I have your same feelings: of course it's harder to predict the raw index, but that's the thing you actually care about. Anyway, smoothed or not, the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
Comment Source:Nathan wrote:
>the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
You could take the signal (of any dimensions) and break it down to say s = Trend + Noise and devise a forecast function f:
f(Trend) ---> next Trend
f(Noise) ---> next Noise
Or you could use
f (Trend, Noise) ---> next (Trend, Noise)
In other words use the Noise as if an important part of the data, which it is, or use it in tuple form (Trend, Noise) .
I feel in the discussion it Trend vs. Noise we have to choose one or use the raw signal, and there are other configurations as above indicates.
If you lowered a microphone into Beijing and recorded noises of the city you get Trend + Noise, Trend would give you important info about when people wake up and work and sleep basically their mass schedule and information about the days of the week and hollidays, Noise gives you what machinery other than humans are active in the environment. So you could use both to model or forecast next day activities of the city.
Dara
Comment Source:Hello John
You could take the signal (of any dimensions) and break it down to say s = Trend + Noise and devise a forecast function f:
f(Trend) ---> next Trend
f(Noise) ---> next Noise
Or you could use
f (Trend, Noise) ---> next (Trend, Noise)
In other words use the Noise as if an important part of the data, which it is, or use it in tuple form (Trend, Noise) .
I feel in the discussion it Trend vs. Noise we have to choose one or use the raw signal, and there are other configurations as above indicates.
If you lowered a microphone into Beijing and recorded noises of the city you get Trend + Noise, Trend would give you important info about when people wake up and work and sleep basically their mass schedule and information about the days of the week and hollidays, Noise gives you what machinery other than humans are active in the environment. So you could use both to model or forecast next day activities of the city.
Dara
What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
The continuous index is just a time series of a real variable. The quantized version is where you apply some thresholding function to the time series (e.g., 1 if the index is above some value for some time, 0 otherwise).
Anyway, the point is that it's probably most useful to be working with some simple aggregate function of the observational data (sea surface temperatures), not a discretized variable, or a correlation function, or something like that. It's the temperature itself that matters to people.
Comment Source:> What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
The continuous index is just a time series of a real variable. The quantized version is where you apply some thresholding function to the time series (e.g., 1 if the index is above some value for some time, 0 otherwise).
Anyway, the point is that it's probably most useful to be working with some simple aggregate function of the observational data (sea surface temperatures), not a discretized variable, or a correlation function, or something like that. It's the temperature itself that matters to people.
HI John, I can see where you're coming from. I guess two key points
are (1) I'm not really the best guy to try running pre-canned ML code
on problems and (2) I suffer from having been an academic working on
ML problems for a few years. As such, one of the natural questions I
always have is "if this computerized fitting procedure doesn't produce
an accurate predictor, what can I do in that case?" (It's now great
fun when a canned software package mysteriously produces mediocre results.) I understand regression and sparstiy enough to at least be able to look at the results to explain why it failed, and maybe even see how to tweak things and have another go. My experience with a lot of other models isn't that great: if I was to run a two-level neural network on the problem and got back a poor model that hadn't convincingly detected any features in the data, I wouldn't know what to do about that (beyond try a completely different model!) As such, as well as me being generally a bit flaky, I'm of the opinion that what I'm looking at is probably the most likely avenue for me to make a helpful contribution.
Maybe we have some other Azimuthans who are more familiar with the
properties of other ML toolkits who could help the project in terms of running
other models?
Comment Source:HI John, I can see where you're coming from. I guess two key points
are (1) I'm not really the best guy to try running pre-canned ML code
on problems and (2) I suffer from having been an academic working on
ML problems for a few years. As such, one of the natural questions I
always have is "if this computerized fitting procedure doesn't produce
an accurate predictor, what can I do in that case?" (It's now great
fun when a canned software package mysteriously produces mediocre results.) I understand regression and sparstiy enough to at least be able to look at the results to explain why it failed, and maybe even see how to tweak things and have another go. My experience with a lot of other models isn't that great: if I was to run a two-level neural network on the problem and got back a poor model that hadn't convincingly detected any features in the data, I wouldn't know what to do about that (beyond try a completely different model!) As such, as well as me being generally a bit flaky, I'm of the opinion that what I'm looking at is probably the most likely avenue for me to make a helpful contribution.
Maybe we have some other Azimuthans who are more familiar with the
properties of other ML toolkits who could help the project in terms of running
other models?
what I’m looking at is probably the most likely avenue for me to make a helpful contribution.
Makes sense.
Personally I'm eager to run a two-level neural network on the problem and see how well it does, along with various other "off the shelf" prediction methods, merely to 1) learn about different methods, 2) see how well they do on different variants of this problem, 3) start studying different ways to define "how well they do". I don't mind getting mediocre results! I just want to learn stuff.
It looks like maybe Dara can help me with these things.
Comment Source:David wrote:
> what I’m looking at is probably the most likely avenue for me to make a helpful contribution.
Makes sense.
Personally I'm eager to run a two-level neural network on the problem and see how well it does, along with various other "off the shelf" prediction methods, merely to 1) learn about different methods, 2) see how well they do on different variants of this problem, 3) start studying different ways to define "how well they do". I don't mind getting mediocre results! I just want to learn stuff.
It looks like maybe Dara can help me with these things.
It looks like maybe Dara can help me with these things.
I try John to set this up for you in one or two way see which one is better
Dara
Comment Source:>It looks like maybe Dara can help me with these things.
I try John to set this up for you in one or two way see which one is better
Dara
The MEI has an interesting characteristic in that it has detail without additional noise. The noise suppression might arise as the various factors improve the statistics and remove the epistemic noise.
Comment Source:The Multivariate ENSO Index (MEI) is also a good candidate
<http://www.esrl.noaa.gov/psd/enso/mei/index.html>
The MEI has an interesting characteristic in that it has detail without additional noise. The noise suppression might arise as the various factors improve the statistics and remove the epistemic noise.
Dara, the data is right there on that page linked, if I understand it correctly
"Here we attempt to monitor ENSO by basing the Multivariate ENSO Index (MEI) on the six main observed variables over the tropical Pacific. These six variables are: sea-level pressure (P), zonal (U) and meridional (V) components of the surface wind, sea surface temperature (S), surface air temperature (A), and total cloudiness fraction of the sky (C). These observations have been collected and published in ICOADS for many years. The MEI is computed separately for each of twelve sliding bi-monthly seasons (Dec/Jan, Jan/Feb,..., Nov/Dec). After spatially filtering the individual fields into clusters (Wolter, 1987), the MEI is calculated as the first unrotated Principal Component (PC) of all six observed fields combined. This is accomplished by normalizing the total variance of each field first, and then performing the extraction of the first PC on the co-variance matrix of the combined fields (Wolter and Timlin, 1993). In order to keep the MEI comparable, all seasonal values are standardized with respect to each season and to the 1950-93 reference period. "
What may concern some (it does me a bit) is that they have already performed some analysis -- that of Principal Components -- and that this is processing that may in fact may not be optimal. Who is to say that doing PCA is not obscuring some important feature?
And why do they not include the sea-level component, similar to that which I discovered in another thread. Is that because it is not important? Or that they have not considered it? It may be best to keep the data in as raw as possible a form, while the machine learning chews on it. As I am sure Dara would suggest, make no assumptions apart from that the data is of some quality.
Comment Source:Dara, the data is right there on that page linked, if I understand it correctly
"Here we attempt to monitor ENSO by basing the Multivariate ENSO Index (MEI) on the six main observed variables over the tropical Pacific. These six variables are: sea-level pressure (P), zonal (U) and meridional (V) components of the surface wind, sea surface temperature (S), surface air temperature (A), and total cloudiness fraction of the sky (C). These observations have been collected and published in ICOADS for many years. The MEI is computed separately for each of twelve sliding bi-monthly seasons (Dec/Jan, Jan/Feb,..., Nov/Dec). After spatially filtering the individual fields into clusters (Wolter, 1987), the MEI is calculated as the first unrotated Principal Component (PC) of all six observed fields combined. This is accomplished by normalizing the total variance of each field first, and then performing the extraction of the first PC on the co-variance matrix of the combined fields (Wolter and Timlin, 1993). In order to keep the MEI comparable, all seasonal values are standardized with respect to each season and to the 1950-93 reference period. "
What may concern some (it does me a bit) is that they have already performed some analysis -- that of Principal Components -- and that this is processing that may in fact may not be optimal. Who is to say that doing PCA is not obscuring some important feature?
And why do they not include the sea-level component, similar to that which I discovered in [another thread](http://forum.azimuthproject.org/discussion/1480/tidal-records-and-enso/). Is that because it is not important? Or that they have not considered it? It may be best to keep the data in as raw as possible a form, while the machine learning chews on it. As I am sure Dara would suggest, make no assumptions apart from that the data is of some quality.
This is what I suggest we do, use the Django server and some scripts to make multivariate data from some other sources of data. So start with data X, Y, Z ... then make (X,Y,Z,...) as collection of tuples. This is standard procedure in most data mining packages.
The script in Mathematica is trivial and easy to implement.
Then after computing the tuples, apply filters and transforms.
Dara
Comment Source:Hello Paul
Everything is cryptic in this field.
This is what I suggest we do, use the Django server and some scripts to make multivariate data from some other sources of data. So start with data X, Y, Z ... then make (X,Y,Z,...) as collection of tuples. This is standard procedure in most data mining packages.
The script in Mathematica is trivial and easy to implement.
Then after computing the tuples, apply filters and transforms.
Dara
Is there an economic indicator that is (expected/believed to be) sensitive to climate change?
Presumably there should be since climate change is expected to have economic consequences.
I would guess that something like world crop production and agricultural commodity prices should be the most sensitive.
Predicting such an indicator from climate data could be a more compelling way of getting the attention of the general public.
Comment Source:Is there an economic indicator that is (expected/believed to be) sensitive to climate change?
Presumably there should be since climate change is expected to have economic consequences.
I would guess that something like world crop production and agricultural commodity prices should be the most sensitive.
Predicting such an indicator from climate data could be a more compelling way of getting the attention of the general public.
And why do they not include the sea-level component, similar to that which I discovered in another thread. Is that because it is not important? Or that they have not considered it?
I think it's for historical reasons. The MEI was first developed about 20 years ago, and they didn't have long-term, gridded, sea surface height satellite data products like TOPEX/Poseidon. It's probably worth considering, though, as an indicator of subsurface heat content.
Comment Source:> And why do they not include the sea-level component, similar to that which I discovered in another thread. Is that because it is not important? Or that they have not considered it?
I think it's for historical reasons. The MEI was first developed about 20 years ago, and they didn't have long-term, gridded, sea surface height satellite data products like TOPEX/Poseidon. It's probably worth considering, though, as an indicator of subsurface heat content.
Is there an economic indicator that is (expected/believed to be) sensitive to climate change?
I doubt any serious such indices in Wall Street, however possibly the SOA (Society of Actuaries) which I heard some research projects on the relationship.
Personally I like to make a water index.
An interesting example:
it can take more than 20,000 litres of water to produce 1kg of cotton
Shortage of fresh water for manufacturing might cause havec in Asian economies.
Dara
Comment Source:Daniel
>Is there an economic indicator that is (expected/believed to be) sensitive to climate change?
I doubt any serious such indices in Wall Street, however possibly the SOA (Society of Actuaries) which I heard some research projects on the relationship.
Personally I like to make a water index.
An interesting example:
>it can take more than 20,000 litres of water to produce 1kg of cotton
[Cotton](http://wwf.panda.org/about_our_earth/about_freshwater/freshwater_problems/thirsty_crops/cotton/)
Shortage of fresh water for manufacturing might cause havec in Asian economies.
Dara
I was thinking before joining this group to compute a daily index for water, it could be the coefficients of the SUPPORT VECTORS for a SVR algorithm performed on some precipitation density in certain regions.
Or the internal weight for a Neural Network global approximator (See Hornik paper I posted earlier).
So the index is not just hodge podge of numbers set by someone in some business/political establishment, rather the index has a meaningful concept and it is algorithmic.
Dara
Comment Source:Daniel
I was thinking before joining this group to compute a daily index for water, it could be the coefficients of the SUPPORT VECTORS for a SVR algorithm performed on some precipitation density in certain regions.
Or the internal weight for a Neural Network global approximator (See Hornik paper I posted earlier).
So the index is not just hodge podge of numbers set by someone in some business/political establishment, rather the index has a meaningful concept and it is algorithmic.
Dara
FWIW, a couple of years ago the UK had the Stern review which looked at economic effects. I've never read it so I don't know, but I doubt there's any simple models of how economic indicators would change with climate, but there might be some higher level ideas one could build on.
Comment Source:FWIW, a couple of years ago the UK had [the Stern review](http://en.wikipedia.org/wiki/Stern_Review) which looked at economic effects. I've never read it so I don't know, but I doubt there's any simple models of how economic indicators would change with climate, but there might be some higher level ideas one could build on.
" It’s probably worth considering, though, as an indicator of subsurface heat content."
If that is in regards to the sea-surface height, I think it would be much more likely an indicator of the Pacific Ocean sloshing imbalance. But then again due to thermal expansion, some fraction of the height change will certainly be due to the changing temperature of the ocean.
Other than that, it kind of makes sense, but all I am using is the tide gauges of Sydney harbor, which has had records kept for over 100 years, so can't really attribute it to sudden availability of satellite data.
The east-west sloshing of Pacific Ocean waters is routinely used to describe ENSO, yet curiously none of these indices seem to make use of the changes of surface height, which is the characteristic measure of sloshing.
Comment Source:Nathan Urban said:
" It’s probably worth considering, though, as an indicator of subsurface heat content."
If that is in regards to the sea-surface height, I think it would be much more likely an indicator of the Pacific Ocean sloshing imbalance. But then again due to thermal expansion, some fraction of the height change will certainly be due to the changing temperature of the ocean.
Other than that, it kind of makes sense, but all I am using is the tide gauges of Sydney harbor, which has had records kept for over 100 years, so can't really attribute it to sudden availability of satellite data.
The east-west sloshing of Pacific Ocean waters is routinely used to describe ENSO, yet curiously none of these indices seem to make use of the changes of surface height, which is the characteristic measure of sloshing.
1) Get something done by December 1st for the NIPS talk.
[...]
For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index starting from the average link strength computed by Ludescher et al (and available here). This was supposed to be easy, since it’s predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using “average link strength”, just Ludescher et al’s particular way of using it.
Ludescher et al's way of using the average link strength included using the NINO3.4 index itself as well. So you would be predicting one time series from two. But that should be easy, no sparsification or anything fancy required.
Comment Source:Replying to John at 82,
> 1) Get something done by December 1st for the NIPS talk.
[...]
> For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index starting from the average link strength computed by Ludescher et al (and available here). This was supposed to be easy, since it’s predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using “average link strength”, just Ludescher et al’s particular way of using it.
Ludescher et al's way of using the average link strength included using the NINO3.4 index itself as well. So you would be predicting one time series from two. But that should be easy, no sparsification or anything fancy required.
Comments
WebHubTel
You analysis of what the software outputs is truly grand and you are gifted.
I also need John's theoretical mind because I do not believe these atmospheric and oceanic systems are any simpler than particle systems. I suspect they are even more complex.
Dara
WebHubTel You analysis of what the software outputs is truly grand and you are gifted. I also need John's theoretical mind because I do not believe these atmospheric and oceanic systems are any simpler than particle systems. I suspect they are even more complex. Dara
Thanks, but not gifted at all, more "cursed" by persistent doggedness :)
Another Next Step to consider is the analysis of historical proxy records that have been calibrated to ENSO. One of the lead researchers on this is Kim Cobb from Georgia Tech, who several years ago analyzed old-growth coral from Palmyra Island in the equatorial Pacific. She discovered after correlating the isotopic oxygen content of the newer-growth coral against recent SST records that one can calibrate the two and thus the O content would make a suitable ENSO proxy record going back in time. http://www.ncdc.noaa.gov/paleo/pubs/cobb2003/cobb2003.html
What I tried to do with the Palmyra proxy results is see if I could use my sloshing-derived equations to model the data with approximately the same parameters that I have been using for my SOI fit. This is the result of that, which I finished a couple of weeks ago. http://contextearth.com/2014/06/25/proxy-confirmation-of-soim/
The filtering is somewhat tricky because there is both a seasonal signal and a longer-term PDO-like trend riding along with the data. But otherwise, I was surprised by how much the records from hundreds of years ago match in form to what is being measured now. It shows perhaps how stationary the ENSO time series is over many centuries.
All of the data is available from that first NOAA link in case someone wants to look at it.
Thanks, but not gifted at all, more "cursed" by persistent doggedness :) Another Next Step to consider is the analysis of historical proxy records that have been calibrated to ENSO. One of the lead researchers on this is Kim Cobb from Georgia Tech, who several years ago analyzed old-growth coral from Palmyra Island in the equatorial Pacific. She discovered after correlating the isotopic oxygen content of the newer-growth coral against recent SST records that one can calibrate the two and thus the O content would make a suitable ENSO proxy record going back in time. <http://www.ncdc.noaa.gov/paleo/pubs/cobb2003/cobb2003.html> What I tried to do with the Palmyra proxy results is see if I could use my sloshing-derived equations to model the data with approximately the same parameters that I have been using for my SOI fit. This is the result of that, which I finished a couple of weeks ago. <http://contextearth.com/2014/06/25/proxy-confirmation-of-soim/> The filtering is somewhat tricky because there is both a seasonal signal and a longer-term PDO-like trend riding along with the data. But otherwise, I was surprised by how much the records from hundreds of years ago match in form to what is being measured now. It shows perhaps how stationary the ENSO time series is over many centuries. All of the data is available from that first NOAA link in case someone wants to look at it.
I like to do a lot of these sorts of experiments with data to gain experience and be ready for the real stuff coming down the pipe so suggest ideas
I like to do a lot of these sorts of experiments with data to gain experience and be ready for the real stuff coming down the pipe so suggest ideas
Perhaps you might get upset at me saying this but I looked at your diff EQ and I think they need more terms and cross terms. Also they needs to have decaying exponentials.
Perhaps you might get upset at me saying this but I looked at your diff EQ and I think they need more terms and cross terms. Also they needs to have decaying exponentials.
I also think that the diff equations are better suited if you break the time into two component of {year, month}.
In the .mml equations I gave from the SVR's Gaussian Wavelet you see {x1, x2} accordingly.
You do not need to make any mod 12 or discrete year, assume two continuous vars.
My hunch is that the equations produce much better models/solutions if the time is broken into 1 period value and another linear.
You could also add the leap year as a param by itself 1 or 0, and maybe other params like # of sunspots and so on: {x1, x2, x3 ...} and I could compute you a closed form equation from SVR regressor just the same.
I also think that the diff equations are better suited if you break the time into two component of {year, month}. In the .mml equations I gave from the SVR's Gaussian Wavelet you see {x1, x2} accordingly. You do not need to make any mod 12 or discrete year, assume two continuous vars. My hunch is that the equations produce much better models/solutions if the time is broken into 1 period value and another linear. You could also add the leap year as a param by itself 1 or 0, and maybe other params like # of sunspots and so on: {x1, x2, x3 ...} and I could compute you a closed form equation from SVR regressor just the same.
No, not upset at all. I did not include a first-order drag term to represent viscous dissipation, since one simplifying approximation is to assume an inviscid environment. The first-order terms would definitely add exponential damping to the oscillating waveforms.
As far as cross-terms, I will see what the sloshing literature has for these.
I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
No, not upset at all. I did not include a first-order drag term to represent viscous dissipation, since one simplifying approximation is to assume an inviscid environment. The first-order terms would definitely add exponential damping to the oscillating waveforms. As far as cross-terms, I will see what the sloshing literature has for these. I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold.
Maybe a system differential equations, as opposed to one, each governing the behaviour of a sub-Trend from the original data
>I definitely agree that the monthly and multi-year behaviors coexist but the question is how best to isolate these behaviors and whether superposition properties actually hold. Maybe a system differential equations, as opposed to one, each governing the behaviour of a sub-Trend from the original data
Pertaining to #46
An interesting suggestion from a commenter at my blog is that some of the sub-monthly periods identified by machine learning is due to the non-uniform sampling rate, as monthly data violates the Nyquist criteria because of changing month lengths. He further suggested that non-uniform sampling can routinely extract signal at higher resolutions than uniform sampling, for example in variable star analysis. This is a typical research article "Variable Stars: which Nyquist Frequency ?" L. Eyer and P. Bartholdi http://arxiv.org/pdf/astro-ph/9808176.pdf
And there are implications for machine learning in this chapter "Data Mining and Machine-Learning in Time-Domain Discovery & Classification" J.S.Bloom and J.W. Richards http://arxiv.org/pdf/1104.3142.pdf
Pertaining to #46 An interesting suggestion from a commenter at my blog is that some of the sub-monthly periods identified by machine learning is due to the non-uniform sampling rate, as monthly data violates the Nyquist criteria because of changing month lengths. He further suggested that non-uniform sampling can routinely extract signal at higher resolutions than uniform sampling, for example in variable star analysis. This is a typical research article "Variable Stars: which Nyquist Frequency ?" L. Eyer and P. Bartholdi <http://arxiv.org/pdf/astro-ph/9808176.pdf> And there are implications for machine learning in this chapter "Data Mining and Machine-Learning in Time-Domain Discovery & Classification" J.S.Bloom and J.W. Richards <http://arxiv.org/pdf/1104.3142.pdf>
I have been worrying a lot about what to do next in the El Niño Project.
It would be great if we could carry out some interesting machine learning or neural networks approach to El Niño by December 1st, because shortly after that I'm supposed to give a talk about this stuff at the Neural Information Processing Seminar.
However, I feel very unconfident of my/our ability to do anything very interesting in this direction by December. I could probably do it if I did nothing else between now and then. However, I'll be teaching 2 courses, including a seminar on network theory, and I really want to explain - in the seminar, and on the blog - the great new research my grad students have done. I also need to apply for some grants before mid-November. I would also like to finish writing a book - which is almost done, but everything takes longer than expected.
If people here were organized and willing to work on this project together, it might stil get done in time. But I don't feel that I can say "write code that uses a random forest method to predict El Niños $n$ months in advance" and expect one of you to do that in a week. It just doesn't seem to be working that way. I could say a lot about why not, and how we might become better organized, but I won't now.
For now, I probably need to reduce the ambitiousness of my short-term plans.
So, I think I will try to better understand the network formed by connecting the Ludescher et al sites in the Pacific with edges weighted by (various measures of) correlation between temperature histories. Graham Jones has already written a lot of the software needed to do this! He has already created a lot of graphs and plots that I haven't blogged about yet! There is a lot more that could be done... but I really just need to give a 1-hour talk and not get laughed off the stage.
The first step, then, is for me to blog about the work he's done.
If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like "I want to do this", or "tell me what to do".
I do not believe that wavelet or differential equation based models of El Niño will fit into this talk. Needless to say, our longer-term goals for after December can continue to be extremely ambitious, and they can include such models. It's just this deadline that's giving me an ulcer.
I have been worrying a lot about what to do next in the El Niño Project. It would be great if we could carry out some interesting machine learning or neural networks approach to El Niño by December 1st, because shortly after that I'm supposed to [give a talk about this stuff at the Neural Information Processing Seminar](http://forum.azimuthproject.org/discussion/1364/networks-in-climate-science/#Item_1). However, I feel very unconfident of my/our ability to do anything very interesting in this direction by December. I could probably do it if I did nothing else between now and then. However, I'll be teaching 2 courses, including a seminar on network theory, and I really want to explain - in the seminar, and on the blog - the great new research my grad students have done. I also need to apply for some grants before mid-November. I would also like to finish writing a book - which is almost done, but everything takes longer than expected. If people here were organized and willing to work on this project together, it might stil get done in time. But I don't feel that I can say "write code that uses a random forest method to predict El Niños $n$ months in advance" and expect one of you to do that in a week. It just doesn't seem to be working that way. I could say a lot about why not, and how we might become better organized, but I won't now. For now, I probably need to reduce the ambitiousness of my short-term plans. So, I think I will try to better understand the network formed by connecting the Ludescher _et al_ sites in the Pacific with edges weighted by (various measures of) correlation between temperature histories. Graham Jones has already written a lot of the software needed to do this! He has already created a lot of graphs and plots that I haven't blogged about yet! There is a lot more that could be done... but I really just need to give a 1-hour talk and not get laughed off the stage. The first step, then, is for me to blog about the work he's done. If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like "I want to do this", or "tell me what to do". I do not believe that wavelet or differential equation based models of El Niño will fit into this talk. Needless to say, our longer-term goals for after December can continue to be extremely ambitious, and they can include such models. It's just this deadline that's giving me an ulcer.
John this is what I could do to help out for that talk:
Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on.
Parallelization ideas for the Neural Network computations, I might give you code snippets in OpenMP and CUDA. Without such parallelization usage of Neural computation is nill.
Iffy ideas: how to prepare volumetric data for neural computation
Dara
>If anyone else wants to produce some product that can help me give a good talk by December 1st, please let me know! You can say things like “I want to do this”, or “tell me what to do”. John this is what I could do to help out for that talk: Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on. Parallelization ideas for the Neural Network computations, I might give you code snippets in OpenMP and CUDA. Without such parallelization usage of Neural computation is nill. Iffy ideas: how to prepare volumetric data for neural computation Dara
In general I propose regularly published analyst reports of math algorithms software visualization and parallelization on actual satellite atmospheric and oceanic data.
Instead of publishing general academic papers, publish specific analysis of specific data.
This might not ring a bell for John, but I think it might due to his interests in network theory
Dara
In general I propose regularly published analyst reports of math algorithms software visualization and parallelization on actual satellite atmospheric and oceanic data. Instead of publishing general academic papers, publish specific analysis of specific data. This might not ring a bell for John, but I think it might due to his interests in network theory Dara
This is not that hard. The hard part is formulating the problem: deciding what signal(s) to predict, from what data and what loss function to use so that the are meaningful. It might be easy to predict some signals with a high $R^2$ by simply locking onto seasonal variation, which would be kind of trivial. Once the problem specification is agreed on the implementation should be fairly simple. Consider the following example from the scikit library documentation.
> But I don’t feel that I can say “write code that uses a random forest method to predict El Niños n months in advance” and expect one of you to do that in a week. This is not that hard. The hard part is formulating the problem: deciding what signal(s) to predict, from what data and what loss function to use so that the are meaningful. It might be easy to predict some signals with a high $R^2$ by simply locking onto seasonal variation, which would be kind of trivial. Once the problem specification is agreed on the implementation should be fairly simple. Consider the following [example](http://scikit-learn.org/dev/auto_examples/plot_multioutput_face_completion.html) from the scikit library documentation.
John wrote:
Daniel wrote:
Wow! I'm really happy that you feel that way, because then we might accomplish something by December 1st!
That's the part I can do. I'm not really an expert on climate science, so I can't promise to do a great job - but at least I have specific problems in mind, and I can start by describing some easy ones!
I want to predict the Niño 3.4 index, which is one number per month. The data is available here. This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
For starters, I'd like to predict it from the "average link strength", a quantity we've computed and put here.
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see Part 4 of the El Niño Project series.)
The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month.
The second goal is to measure how good the prediction is and see how it gets worse as we increase $m$.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
I bet you can give me ideas here. I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
John wrote: > But I don’t feel that I can say “write code that uses a random forest method to predict El Niños $n$ months in advance” and expect one of you to do that in a week. Daniel wrote: > This is not that hard. Wow! I'm really happy that you feel that way, because then we might accomplish something by December 1st! > The hard part is formulating the problem. That's the part I can do. I'm not really an expert on climate science, so I can't promise to do a great job - but at least I have specific problems in mind, and I can start by describing some easy ones! > deciding what signal(s) to predict I want to predict the Niño 3.4 index, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM. > from what data For starters, I'd like to predict it from the "average link strength", a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt). This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.) The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month. The second goal is to measure how good the prediction is and see how it gets worse as we increase $m$. (Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this: We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January. Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.) > what loss function to use I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good? Roughly speaking, I'd like predictions that minimize the time average of |(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$ I bet you can give me ideas here. I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR.
Problem is that there are very simple interpolation functions that reduce the error 0, they are very wavy and some might call them over-fitting.
So each minimization algorithm has cooked up a constrained to avoid that over-fitting.
Try to cook something up with read data next couple of days
Dara
John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR. Problem is that there are very simple interpolation functions that reduce the error 0, they are very wavy and some might call them over-fitting. So each minimization algorithm has cooked up a constrained to avoid that over-fitting. Try to cook something up with read data next couple of days Dara
Dara wrote:
I don't want you to do unnecessary work. To understand how minimization of error is handled, and how to avoid overfitting, all I really need is nice sentences in English, mixed with just a few equations. I often understand general concepts more easily than examples.
Dara wrote: > John I try to get you some interim computations to visualize how the minimization of error is handled by ML algorithms e.g. SVR. I don't want you to do unnecessary work. To understand how minimization of error is handled, and how to avoid overfitting, all I really need is nice sentences in English, mixed with just a few equations. I often understand general concepts more easily than examples.
Dara wrote:
Thanks! But first we need to do the research for this talk. If you can help David Tweed do the project he proposed, that would be great. I believe I can fill in details about what datasets to use (the surface air temperatures in a certain rectangle of points) and what quantities we want to predict (the El Niño 3.4 index). I'm hoping David Tweed or you will ask me for these details.
I would prefer to let him, and Daniel Mahler, and you, be the experts on the methods of prediction. I would like to start by some using "standard" or "routine" methods of prediction, just to see what happens. I hope you guys can explain those methods to me, or point me to explanations.
Later, perhaps after December, we can try more creative and interesting things.
Dara wrote: > John this is what I could do to help out for that talk: > Make you fancy interactive CDF to explain the concepts for the talk, assumption being you have a good machine to run it on. Thanks! But first we need to do the research for this talk. If you can help David Tweed do [the project he proposed](http://forum.azimuthproject.org/discussion/1445/possible-gpgpu-task-linearbilinear-regression-on-el-nino-dataset/?Focus=12127#Comment_12127), that would be great. I believe I can fill in details about what datasets to use (the surface air temperatures in a certain rectangle of points) and what quantities we want to predict (the El Niño 3.4 index). I'm hoping David Tweed or you will ask me for these details. I would prefer to let him, and Daniel Mahler, and you, be the experts on the _methods_ of prediction. I would like to start by some using "standard" or "routine" methods of prediction, just to see what happens. I hope you guys can explain those methods to me, or point me to explanations. Later, perhaps after December, we can try more creative and interesting things.
OK I talk to David
OK I talk to David
I think you should pay more attention to your health. Why not cancel the talk?
>It’s just this deadline that’s giving me an ulcer. I think you should pay more attention to your health. Why not cancel the talk?
Thanks, Nad. But I don't literally have an ulcer! That was just a metaphor.
I really want to go to this Neural Information Processing Seminar, since there are probably lots of smart people interested in "networks" there. I was invited to give a minicourse there last year (on information geometry), and I cancelled at the last minute because I was too busy. I don't want to cancel again. If I wanted to give a low-stress talk, I could talk about my mathematical ideas on network theory. But I think for this audience it will be better if I walk about something more "practical". I feel pretty sure I can give a mildly interesting talk on climate networks and El Niño prediction. I'll do this if I can't do something more exciting. But I've been wanting to push myself into new territory... and get some of the machine learning / computation people here involved. We'll see.
Thanks, Nad. But I don't _literally_ have an ulcer! That was just a metaphor. I really want to go to this Neural Information Processing Seminar, since there are probably lots of smart people interested in "networks" there. I was invited to give a minicourse there last year (on information geometry), and I cancelled at the last minute because I was too busy. I don't want to cancel again. If I wanted to give a low-stress talk, I could talk about my mathematical ideas on network theory. But I think for this audience it will be better if I walk about something more "practical". I feel pretty sure I can give a mildly interesting talk on climate networks and El Niño prediction. I'll do this if I can't do something more exciting. But I've been wanting to push myself into new territory... and get some of the machine learning / computation people here involved. We'll see.
Hello John
It seems coordinating efforts might be a bit difficult here, so as an individual contributor I could help out and do whatever you need for the talks, pending permission from Graham could fiddle with his code or add some new stuff for you to show.
I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish. It might be the case that most of it you would not like, not an issue we start from scratch again and again.
Again my purpose here is to learn and work with learned men and women.
Dara
Hello John It seems coordinating efforts might be a bit difficult here, so as an individual contributor I could help out and do whatever you need for the talks, pending permission from Graham could fiddle with his code or add some new stuff for you to show. I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish. It might be the case that most of it you would not like, not an issue we start from scratch again and again. Again my purpose here is to learn and work with learned men and women. Dara
Dara (or anyone), feel free to do what you want with my code.
Dara (or anyone), feel free to do what you want with my code.
Dara wrote:
Great! I hope we can do something like this. A few weeks ago I said:
Daniel wrote:
So I thought it could be done in a week, but obviously it hasn't been done yet!
Daniel wrote:
and so I tried to start formulating it. I hoped that my first try would give rise to more questions that would help me formulate it more precisely. But Daniel Mahler didn't ask any more questions. So maybe you can? Maybe David Tweed can?
First:
I want to predict the Niño 3.4 index, which is one number per month. The data is available here. This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM.
For starters, I'd like to predict it from the average link strength, a quantity we've computed and put here.
This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see Part 4 of the El Niño Project series.)
The first goal is to predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month.
The second goal is to measure how good the prediction is, and see how it gets worse as we increase $m$.
(Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this:
We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January.
Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.)
I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good?
Roughly speaking, I'd like predictions that minimize the time average of
|(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$
But I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
Dara wrote: > I do not mind burning the midnight oil and try some novel code or just do something simple, as you wish. Great! I hope we can do something like this. A few weeks ago I said: > But I don’t feel that I can say “write code that uses a **random forest method** to predict El Niños $n$ months in advance” and expect one of you to do that in a week. Daniel wrote: > This is not that hard. So I thought it could be done in a week, but obviously it hasn't been done yet! <img src = "http://math.ucr.edu/home/baez/emoticons/tongue2.gif" alt = ""/> Daniel wrote: > The hard part is formulating the problem. and so I tried to start formulating it. I hoped that my first try would give rise to more questions that would help me formulate it more precisely. But Daniel Mahler didn't ask any more questions. So maybe you can? Maybe David Tweed can? First: > deciding what signal(s) to predict I want to predict the **Niño 3.4 index**, which is one number per month. The data is available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/nino3.4-anoms.txt). This is data from Climate Prediction Center of the National Weather Service, from January 1950 to May 2014. The Niño 3.4 index is in the column called ANOM. > from what data For starters, I'd like to predict it from the **average link strength**, a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt). This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.) The first goal is to **predict the Niño 3.4 index on any given month starting from the average link strengths during some roughly fixed-length interval of time ending around $m$ months before the given month**. The second goal is to **measure how good the prediction is, and see how it gets worse as we increase $m$**. (Note: There is a certain technical annoyance here involving "months" versus "10-day intervals"! If every month contained exactly 30 days, life would be sweet. But no. Here's one idea on how to deal with this: We predict the Niño 3.4 index for a given month given the values of $S$ over the $k$ 10-day periods whose final one occurs at the end of the month $m$ months before the given month. For example, if $m = 1$, $k = 3$ and the given month is February, we must predict the February Niño 3.4 index given the 3 average link strengths at 10-day periods ending at the end of January. Ignore this crap if you're just trying to get the basic idea. If you can think of a less annoying way to handle this annoying issue, it could be fine.) > what loss function to use I'm not exactly sure what a "loss function" is: does it get big when our prediction counts as "bad"? Or does it get big when our prediction counts as good? Roughly speaking, I'd like predictions that minimize the time average of |(predicted El Niño 3.4 index - observed El Niño 3.4 index)|${}^2$ But I'd like to start by doing something that machine learning people consider "standard", not quirky and idiosyncratic.
Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression here, but it's very slow going (primarily because I'm only able to spend about 30 min per day on it while on the train, which isn't really long enough to fully engage my "concentration mode". Still not at the point of being a working program. I still hope to get something finished before I run out of time.
In terms of things to predict, the only real thoughts I've had so far is that I'm reluctant to try to predict a 3-month average based El Nino 3.4, purely because it's likely to be noisy and hence errors aren't necessarily indicative of errors on the bigger problem. I'm inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observatoins used for prediction, but that's not really more than a rough guess.
A quick note about prediction errors: it may be worth considering things other than squared-error as predictors. As I drone on about in the blog thing I'm working on, squared loss "wants to" reduce any big error to be smaller, even if that means inflating all the other errors up to that level. Sometimes an absolute-value-of-errors, or other alternatives, can be more insightful about classifier performance.
Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression [here](https://github.com/davidtweed/multicoreBilinearRegression), but it's very slow going (primarily because I'm only able to spend about 30 min per day on it while on the train, which isn't really long enough to fully engage my "concentration mode". Still not at the point of being a working program. I still hope to get _something_ finished before I run out of time. In terms of things to predict, the only real thoughts I've had so far is that I'm reluctant to try to predict a 3-month average based El Nino 3.4, purely because it's likely to be noisy and hence errors aren't necessarily indicative of errors on the bigger problem. I'm inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observatoins used for prediction, but that's not really more than a rough guess. A quick note about prediction errors: it may be worth considering things other than squared-error as predictors. As I drone on about in the blog thing I'm working on, squared loss "wants to" reduce any big error to be smaller, even if that means inflating all the other errors up to that level. Sometimes an absolute-value-of-errors, or other alternatives, can be more insightful about classifier performance.
Hello John
I am trying to put together some actionable items + calendar of development in place to help out with your research, of course I have the vested interest of learning from you so there is a selfish reason as well :)
As we do with Paul, to drill down code and attempts to study a problem, we just need more of that and your most valued input, because I could EASILY diverge into useless endeavours.
Hello John I am trying to put together some actionable items + calendar of development in place to help out with your research, of course I have the vested interest of learning from you so there is a selfish reason as well :) As we do with Paul, to drill down code and attempts to study a problem, we just need more of that and your most valued input, because I could EASILY diverge into useless endeavours.
David Tweed wrote:
Thanks, great! But I don't really know what you're doing. I guess it's something new or nonstandard, because I'm guessing should be some software you can just take "off the shelf" that does approximately what I'm asking for... right?
Are you trying to implement some of the ideas discussed in your blog article?
I'm hoping someone will help me use some "off the shelf" software to make an initial attack on the El Niño prediction problem, just to get some sense of how hard it is. This could be someone less trained in machine learning than you but more friendly with computers than me.
I guess you're saying this because an El Niño occurs when the 3-month running average of the Nino 3.4 index is over 0.5 °C for at least 5 months in a row, which involves 1+5+1 months of data?
Of course the 3-month running average being above some value for 5 months in a row is a bit different than the 7-month average being over that value for one month. The former is 5 inequalities while the latter is one.
To some extent we have a choice between predicting things that are easy to predict and predicting things people care about. People care about when there's an official El Niño. But they also care especially when it's a "strong" El Niño. It's hard to predict an El Niño more than 6 months in advance. But for that very reason, this is what people most want to do.
I have no strong ideology about this sort of thing, so it will be good if you provide us with one.
David Tweed wrote: > Hi, just as a general note I'm slowly incrementally adding code towards doing linear/bilinear regression [here](https://github.com/davidtweed/multicoreBilinearRegression)... Thanks, great! But I don't really know what you're doing. I guess it's something new or nonstandard, because I'm guessing should be some software you can just take "off the shelf" that does _approximately_ what I'm asking for... right? Are you trying to implement some of the ideas [discussed in your blog article](http://www.azimuthproject.org/azimuth/show/Blog+-+Exploring+regression+on+the+El+Ni%26ntilde%3Bo+data)? I'm hoping someone will help me use some "off the shelf" software to make an initial attack on the El Niño prediction problem, just to get some sense of how hard it is. This could be someone less trained in machine learning than you but more friendly with computers than me. > In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem. I’m inclined to try something like the (1+5+1)=7 month average El Nino 3.4 index for the following period immediately after the observations used for prediction, but that’s not really more than a rough guess. I guess you're saying this because an El Niño occurs when the 3-month running average of the Nino 3.4 index is over 0.5 °C for at least 5 months in a row, which involves 1+5+1 months of data? Of course the 3-month running average being above some value for 5 months in a row is a bit different than the 7-month average being over that value for one month. The former is 5 inequalities while the latter is one. To some extent we have a choice between predicting things that are easy to predict and predicting things people care about. People care about when there's an official El Niño. But they also care especially when it's a "strong" El Niño. It's hard to predict an El Niño more than 6 months in advance. But for that very reason, this is what people most want to do. > A quick note about prediction errors: it may be worth considering things other than squared-error as predictors. I have no strong ideology about this sort of thing, so it will be good if you provide us with one.
Network ate long comment.trying to apply blog ideas.long comment in about 14 hours.
Network ate long comment.trying to apply blog ideas.long comment in about 14 hours.
Agree on the need for better measures of error.
Take correlation coefficient for instance. One would think that the (anti) correlation between Tahiti and Darwin, which comprises SOI, would probably be close to -1, considering how they were picked to best illustrate the standing wave pattern. When one place is up in pressure, the other place is down is perfect anti-correlation.
In fact, the anti-correlation is "only" -0.55, according to the time-series comparison below. One can sense that these two time-series are very likely anti-correlated, but the fact that there are a few regions that don't align drags the coefficient quickly away from -1. There must be other metrics that do a better job of identifying how well two curves match.
Agree on the need for better measures of error. Take correlation coefficient for instance. One would think that the (anti) correlation between Tahiti and Darwin, which comprises SOI, would probably be close to -1, considering how they were picked to best illustrate the standing wave pattern. When one place is up in pressure, the other place is down is perfect anti-correlation. In fact, the anti-correlation is "only" -0.55, according to the time-series comparison below. One can sense that these two time-series are very likely anti-correlated, but the fact that there are a few regions that don't align drags the coefficient quickly away from -1. There must be other metrics that do a better job of identifying how well two curves match. 
RMS error is standard in climate science and IMHO would be quite sufficient for these purposes.
If I wanted to get fancy, I'd try to construct a utility-based loss function that relates the Pacific SST field to, say, the probability of extreme temperature or precipitation events at some populated location, and see if that gives a much different predictive model. (I doubt it...)
RMS error is standard in climate science and IMHO would be quite sufficient for these purposes. If I wanted to get fancy, I'd try to construct a utility-based loss function that relates the Pacific SST field to, say, the probability of extreme temperature or precipitation events at some populated location, and see if that gives a much different predictive model. (I doubt it...)
[Here's the comment I couldn't post this morning]
Yes, I'm trying to apply the stuff from the blog from tiny datasets (where it takes a couple of minutes on my laptop) to a vastly bigger dataset, namely the El Nino temperatures. In particular, if you take the correlation between two (not necessarily adjacent points) with $N$ total points you get $N \times (N-1)$ ordered pairs. If you look at the minimum and maximum normalized correlation between the first point (1) now, (2) 3 months ago and (3) six months ago and the second (1) at the same time, (2) 3 months preceding and (2) 6 months preceding you get $(3\times 3-1)*2=16$ possibilities.
So the input data -- the feature data if you will -- for 24 points is either a $552\times 16$ matrix $X$ or a $8832$ element vector $x$ (depending if you "concatenate" it or not). Suppose that through discussion we can figure out some plausible "real number" output $y$. Then my plan is to try to generate predictors of the form
However, with different kinds of sparsity prior plus a variable number of bilinear vectors $P$ (as far as I'm aware no-one has yet shown that they "nest" in the same way PCA vectors do) that's $6P$ models to learn on what's quite a medium-size feature vector. (By "big data" standards that's not huge, but the people who do that kind of stuff have big clusters to run on and are using loss functions with known properties that make efficient solution possible, neither of which is true for me.)
These things I'm looking at are (combinations of) models that have been published quite extensively in the last two or three years. As such, they're known but not at the point where there is existing easily available software to solve those models. Part of my reason for focusing on various kinds of variously sparsified regression is that in that area I understand the model structure and how to sparsify it without doing additional cross validation. In addition, I'm hoping that I can reproduce the "division into two sets" test strategy that Ludescher et al used, so that it's a quite direct comparsion.
One of the things that makes me a bit hesitant to look at neural nets, decision forests, etc, at this point is that I don't understand those well enough to sparsify them effectively without essentially needing to have a training, test and validation set which means it'd be looking at a division of the data into 3 parts, so that it'd be more difficult to compare performance directly. (Other people might very well understand how to use decision forests, etc, for this without splitting the data but I don't.)
Yes, there's problems with taking a 7 month average as a proxy for 5 months where the "3-month average" is above some threshold. I'm very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold, although that still piles a lot of different vectors of feature values at 0. (In case it's not obvious, the reason I particularly care about this in the context of (normal|bi-) linear regression is that the regression function assumes it should try equally hard to hit all of the "target outputs" you give it, so if there's a heavy concentration onto one value then it will be heavily biased towards creating a linear function which goes through that point for lots of outputs, which as I'm sure you can visualise is quite "unnatural".)
[Here's the comment I couldn't post this morning] Yes, I'm trying to apply the stuff from the blog from tiny datasets (where it takes a couple of minutes on my laptop) to a vastly bigger dataset, namely the El Nino temperatures. In particular, if you take the correlation between two (not necessarily adjacent points) with $N$ total points you get $N \times (N-1)$ ordered pairs. If you look at the minimum and maximum normalized correlation between the first point (1) now, (2) 3 months ago and (3) six months ago and the second (1) at the same time, (2) 3 months preceding and (2) 6 months preceding you get $(3\times 3-1)*2=16$ possibilities. So the input data -- the feature data if you will -- for 24 points is either a $552\times 16$ matrix $X$ or a $8832$ element vector $x$ (depending if you "concatenate" it or not). Suppose that through discussion we can figure out some plausible "real number" output $y$. Then my plan is to try to generate predictors of the form 1. $\hat{y}=a^T x+c$ if doing linear regression. 2. $\hat{y}=\sum_{i=1:P}a_i^T X b_i + c$ if doing bilinear regression. However, with different kinds of sparsity prior plus a variable number of bilinear vectors $P$ (as far as I'm aware no-one has yet shown that they "nest" in the same way PCA vectors do) that's $6P$ models to learn on what's quite a medium-size feature vector. (By "big data" standards that's not huge, but the people who do that kind of stuff have big clusters to run on and are using loss functions with known properties that make efficient solution possible, neither of which is true for me.) These things I'm looking at are (combinations of) models that have been published quite extensively in the last two or three years. As such, they're known but not at the point where there is existing easily available software to solve those models. Part of my reason for focusing on various kinds of variously sparsified regression is that in that area I understand the model structure and how to sparsify it without doing additional cross validation. In addition, I'm hoping that I can reproduce the "division into two sets" test strategy that Ludescher et al used, so that it's a quite direct comparsion. One of the things that makes me a bit hesitant to look at neural nets, decision forests, etc, at this point is that I don't understand those well enough to sparsify them effectively without essentially needing to have a training, test and validation set which means it'd be looking at a division of the data into 3 parts, so that it'd be more difficult to compare performance directly. (Other people might very well understand how to use decision forests, etc, for this without splitting the data but I don't.) Yes, there's problems with taking a 7 month average as a proxy for 5 months where the "3-month average" is above some threshold. I'm very interested if anyone has any ideas of a **non-binary** statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold, although that still piles a lot of different vectors of feature values at 0. (In case it's not obvious, the reason I particularly care about this in the context of (normal|bi-) linear regression is that the regression function assumes it should try equally hard to hit all of the "target outputs" you give it, so if there's a heavy concentration onto one value then it will be heavily biased towards creating a linear function which goes through that point for lots of outputs, which as I'm sure you can visualise is quite "unnatural".)
Why not the Nino 3.4 index itself? It's a continuous variable.
> I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold [...] Why not the Nino 3.4 index itself? It's a continuous variable.
Thanks for your long comment, David... especially given that it was eaten by the ether the first time 'round!
I guess I see two goals:
1) Get something done by December 1st for the NIPS talk.
2) Do something really interesting.
with 1) as a kind of warmup for 2). It would be amazing if we could do something that simultaneously met goals 1) and 2), but I'm not counting on that.
For 1) I was imagining some "quick and dirty" ways of doing something very much like what Ludescher et al did, but slightly different, to begin to see how good their approach is. This would let me give a talk about climate networks, their paper, and a kind of critique or evaluation of their paper.
For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index starting from the average link strength computed by Ludescher et al (and available here). This was supposed to be easy, since it's predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using "average link strength", just Ludescher et al's particular way of using it.
Of course if you have limited time it makes sense for you to tackle a type 2) project while someone else (maybe even little old me) tries this simpler thing.
Thanks for your long comment, David... especially given that it was eaten by the ether the first time 'round! > One of the things that makes me a bit hesitant to look at neural nets, decision forests, etc, at this point is that I don’t understand those well enough to sparsify them effectively without essentially needing to have a training, test and validation set which means it’d be looking at a division of the data into 3 parts, so that it’d be more difficult to compare performance directly. I guess I see two goals: 1) Get something done by December 1st for the NIPS talk. 2) Do something really interesting. with 1) as a kind of warmup for 2). It would be amazing if we could do something that simultaneously met goals 1) and 2), but I'm not counting on that. For 1) I was imagining some "quick and dirty" ways of doing something _very much like_ what Ludescher _et al_ did, but slightly different, to begin to see how good their approach is. This would let me give a talk about climate networks, their paper, and a kind of critique or evaluation of their paper. For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index _starting from the average link strength computed by Ludescher et al_ (and available [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt)). This was supposed to be easy, since it's predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using "average link strength", just Ludescher _et al_'s particular way of using it. Of course if you have limited time it makes sense for you to tackle a type 2) project while someone else (maybe even little old me) tries this simpler thing.
David wrote:
Nathan wrote:
I'm sure David has thought of that, but earlier he wrote:
so presumably he'd consider the Nino 3.4 itself even more noisy and thus worse.
Personally I don't understand machine learning well enough be sure that trying to predict something noisy is worse than trying to predict a smoothed-out substitute like what David suggested (the 7-month average of Nino 3.4). Obviously you can't predict it as well! But does that mean it's a bad thing to do? It's bad if the algorithm winds up putting a lot of work into predicting irrelevant wiggles. But with a suitable measure of what counts as success (sorry, I'm forgetting the jargon here), one might avoid that.
Anyway, I favor predicting either the Nino 3.4 index or, for some technical reason, a time-averaged version of that. Predicting the Nino 3.4 index has the big sociological advantage that this is something people already do.
Like David, I don't want to have predicting a binary quantity like "is there an El Niño?" as our main goal.
David wrote: > I’m very interested if anyone has any ideas of a non-binary statistic that could be used instead as a prediction variable. I suppose one possibility that could be used is the count of months within the next 5 for which the El Nino 3.4 is above a threshold […] Nathan wrote: > Why not the Nino 3.4 index itself? It’s a continuous variable. I'm sure David has thought of that, but earlier he wrote: > In terms of things to predict, the only real thoughts I’ve had so far is that I’m reluctant to try to predict a 3-month average based El Nino 3.4, purely because it’s likely to be noisy and hence errors aren’t necessarily indicative of errors on the bigger problem. so presumably he'd consider the Nino 3.4 itself even more noisy and thus worse. Personally I don't understand machine learning well enough be sure that trying to predict something noisy is worse than trying to predict a smoothed-out substitute like what David suggested (the 7-month average of Nino 3.4). Obviously you can't predict it as well! But does that mean it's a bad thing to do? It's bad if the algorithm winds up putting a lot of work into predicting irrelevant wiggles. But with a suitable measure of what counts as success (sorry, I'm forgetting the jargon here), one might avoid that. Anyway, I favor predicting either the Nino 3.4 index or, for some technical reason, a time-averaged version of that. Predicting the Nino 3.4 index has the big sociological advantage that this is something people already do. Like David, I don't want to have predicting a binary quantity like "is there an El Niño?" as our main goal.
Climatologists often predict a smoothed El Nino index rather than a raw index, but I have your same feelings: of course it's harder to predict the raw index, but that's the thing you actually care about. Anyway, smoothed or not, the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
Climatologists often predict a smoothed El Nino index rather than a raw index, but I have your same feelings: of course it's harder to predict the raw index, but that's the thing you actually care about. Anyway, smoothed or not, the continuous index itself is the obvious thing to try to predict, not some quantized version thereof.
Nathan wrote:
What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
Nathan wrote: >the continuous index itself is the obvious thing to try to predict, not some quantized version thereof. What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function?
Hello John
You could take the signal (of any dimensions) and break it down to say s = Trend + Noise and devise a forecast function f:
f(Trend) ---> next Trend
f(Noise) ---> next Noise
Or you could use
f (Trend, Noise) ---> next (Trend, Noise)
In other words use the Noise as if an important part of the data, which it is, or use it in tuple form (Trend, Noise) .
I feel in the discussion it Trend vs. Noise we have to choose one or use the raw signal, and there are other configurations as above indicates.
If you lowered a microphone into Beijing and recorded noises of the city you get Trend + Noise, Trend would give you important info about when people wake up and work and sleep basically their mass schedule and information about the days of the week and hollidays, Noise gives you what machinery other than humans are active in the environment. So you could use both to model or forecast next day activities of the city.
Dara
Hello John You could take the signal (of any dimensions) and break it down to say s = Trend + Noise and devise a forecast function f: f(Trend) ---> next Trend f(Noise) ---> next Noise Or you could use f (Trend, Noise) ---> next (Trend, Noise) In other words use the Noise as if an important part of the data, which it is, or use it in tuple form (Trend, Noise) . I feel in the discussion it Trend vs. Noise we have to choose one or use the raw signal, and there are other configurations as above indicates. If you lowered a microphone into Beijing and recorded noises of the city you get Trend + Noise, Trend would give you important info about when people wake up and work and sleep basically their mass schedule and information about the days of the week and hollidays, Noise gives you what machinery other than humans are active in the environment. So you could use both to model or forecast next day activities of the city. Dara
The continuous index is just a time series of a real variable. The quantized version is where you apply some thresholding function to the time series (e.g., 1 if the index is above some value for some time, 0 otherwise).
Anyway, the point is that it's probably most useful to be working with some simple aggregate function of the observational data (sea surface temperatures), not a discretized variable, or a correlation function, or something like that. It's the temperature itself that matters to people.
> What do you mean by QUANTIZED VERSION? and CONTINUOUS INDEX, as in a continuous random variable made up from an interpolation function? The continuous index is just a time series of a real variable. The quantized version is where you apply some thresholding function to the time series (e.g., 1 if the index is above some value for some time, 0 otherwise). Anyway, the point is that it's probably most useful to be working with some simple aggregate function of the observational data (sea surface temperatures), not a discretized variable, or a correlation function, or something like that. It's the temperature itself that matters to people.
HI John, I can see where you're coming from. I guess two key points are (1) I'm not really the best guy to try running pre-canned ML code on problems and (2) I suffer from having been an academic working on ML problems for a few years. As such, one of the natural questions I always have is "if this computerized fitting procedure doesn't produce an accurate predictor, what can I do in that case?" (It's now great fun when a canned software package mysteriously produces mediocre results.) I understand regression and sparstiy enough to at least be able to look at the results to explain why it failed, and maybe even see how to tweak things and have another go. My experience with a lot of other models isn't that great: if I was to run a two-level neural network on the problem and got back a poor model that hadn't convincingly detected any features in the data, I wouldn't know what to do about that (beyond try a completely different model!) As such, as well as me being generally a bit flaky, I'm of the opinion that what I'm looking at is probably the most likely avenue for me to make a helpful contribution.
Maybe we have some other Azimuthans who are more familiar with the properties of other ML toolkits who could help the project in terms of running other models?
HI John, I can see where you're coming from. I guess two key points are (1) I'm not really the best guy to try running pre-canned ML code on problems and (2) I suffer from having been an academic working on ML problems for a few years. As such, one of the natural questions I always have is "if this computerized fitting procedure doesn't produce an accurate predictor, what can I do in that case?" (It's now great fun when a canned software package mysteriously produces mediocre results.) I understand regression and sparstiy enough to at least be able to look at the results to explain why it failed, and maybe even see how to tweak things and have another go. My experience with a lot of other models isn't that great: if I was to run a two-level neural network on the problem and got back a poor model that hadn't convincingly detected any features in the data, I wouldn't know what to do about that (beyond try a completely different model!) As such, as well as me being generally a bit flaky, I'm of the opinion that what I'm looking at is probably the most likely avenue for me to make a helpful contribution. Maybe we have some other Azimuthans who are more familiar with the properties of other ML toolkits who could help the project in terms of running other models?
David wrote:
Makes sense.
Personally I'm eager to run a two-level neural network on the problem and see how well it does, along with various other "off the shelf" prediction methods, merely to 1) learn about different methods, 2) see how well they do on different variants of this problem, 3) start studying different ways to define "how well they do". I don't mind getting mediocre results! I just want to learn stuff.
It looks like maybe Dara can help me with these things.
David wrote: > what I’m looking at is probably the most likely avenue for me to make a helpful contribution. Makes sense. Personally I'm eager to run a two-level neural network on the problem and see how well it does, along with various other "off the shelf" prediction methods, merely to 1) learn about different methods, 2) see how well they do on different variants of this problem, 3) start studying different ways to define "how well they do". I don't mind getting mediocre results! I just want to learn stuff. It looks like maybe Dara can help me with these things.
I try John to set this up for you in one or two way see which one is better
Dara
>It looks like maybe Dara can help me with these things. I try John to set this up for you in one or two way see which one is better Dara
The Multivariate ENSO Index (MEI) is also a good candidate http://www.esrl.noaa.gov/psd/enso/mei/index.html
The MEI has an interesting characteristic in that it has detail without additional noise. The noise suppression might arise as the various factors improve the statistics and remove the epistemic noise.
The Multivariate ENSO Index (MEI) is also a good candidate <http://www.esrl.noaa.gov/psd/enso/mei/index.html> The MEI has an interesting characteristic in that it has detail without additional noise. The noise suppression might arise as the various factors improve the statistics and remove the epistemic noise.
Paul Where is the multivariate data?
Paul Where is the multivariate data?
Dara, the data is right there on that page linked, if I understand it correctly
What may concern some (it does me a bit) is that they have already performed some analysis -- that of Principal Components -- and that this is processing that may in fact may not be optimal. Who is to say that doing PCA is not obscuring some important feature?
And why do they not include the sea-level component, similar to that which I discovered in another thread. Is that because it is not important? Or that they have not considered it? It may be best to keep the data in as raw as possible a form, while the machine learning chews on it. As I am sure Dara would suggest, make no assumptions apart from that the data is of some quality.
Dara, the data is right there on that page linked, if I understand it correctly "Here we attempt to monitor ENSO by basing the Multivariate ENSO Index (MEI) on the six main observed variables over the tropical Pacific. These six variables are: sea-level pressure (P), zonal (U) and meridional (V) components of the surface wind, sea surface temperature (S), surface air temperature (A), and total cloudiness fraction of the sky (C). These observations have been collected and published in ICOADS for many years. The MEI is computed separately for each of twelve sliding bi-monthly seasons (Dec/Jan, Jan/Feb,..., Nov/Dec). After spatially filtering the individual fields into clusters (Wolter, 1987), the MEI is calculated as the first unrotated Principal Component (PC) of all six observed fields combined. This is accomplished by normalizing the total variance of each field first, and then performing the extraction of the first PC on the co-variance matrix of the combined fields (Wolter and Timlin, 1993). In order to keep the MEI comparable, all seasonal values are standardized with respect to each season and to the 1950-93 reference period. " What may concern some (it does me a bit) is that they have already performed some analysis -- that of Principal Components -- and that this is processing that may in fact may not be optimal. Who is to say that doing PCA is not obscuring some important feature? And why do they not include the sea-level component, similar to that which I discovered in [another thread](http://forum.azimuthproject.org/discussion/1480/tidal-records-and-enso/). Is that because it is not important? Or that they have not considered it? It may be best to keep the data in as raw as possible a form, while the machine learning chews on it. As I am sure Dara would suggest, make no assumptions apart from that the data is of some quality.
Hello Paul
Everything is cryptic in this field.
This is what I suggest we do, use the Django server and some scripts to make multivariate data from some other sources of data. So start with data X, Y, Z ... then make (X,Y,Z,...) as collection of tuples. This is standard procedure in most data mining packages.
The script in Mathematica is trivial and easy to implement.
Then after computing the tuples, apply filters and transforms.
Dara
Hello Paul Everything is cryptic in this field. This is what I suggest we do, use the Django server and some scripts to make multivariate data from some other sources of data. So start with data X, Y, Z ... then make (X,Y,Z,...) as collection of tuples. This is standard procedure in most data mining packages. The script in Mathematica is trivial and easy to implement. Then after computing the tuples, apply filters and transforms. Dara
Is there an economic indicator that is (expected/believed to be) sensitive to climate change? Presumably there should be since climate change is expected to have economic consequences. I would guess that something like world crop production and agricultural commodity prices should be the most sensitive. Predicting such an indicator from climate data could be a more compelling way of getting the attention of the general public.
Is there an economic indicator that is (expected/believed to be) sensitive to climate change? Presumably there should be since climate change is expected to have economic consequences. I would guess that something like world crop production and agricultural commodity prices should be the most sensitive. Predicting such an indicator from climate data could be a more compelling way of getting the attention of the general public.
I think it's for historical reasons. The MEI was first developed about 20 years ago, and they didn't have long-term, gridded, sea surface height satellite data products like TOPEX/Poseidon. It's probably worth considering, though, as an indicator of subsurface heat content.
> And why do they not include the sea-level component, similar to that which I discovered in another thread. Is that because it is not important? Or that they have not considered it? I think it's for historical reasons. The MEI was first developed about 20 years ago, and they didn't have long-term, gridded, sea surface height satellite data products like TOPEX/Poseidon. It's probably worth considering, though, as an indicator of subsurface heat content.
Daniel
I doubt any serious such indices in Wall Street, however possibly the SOA (Society of Actuaries) which I heard some research projects on the relationship.
Personally I like to make a water index.
An interesting example:
Cotton
Shortage of fresh water for manufacturing might cause havec in Asian economies.
Dara
Daniel >Is there an economic indicator that is (expected/believed to be) sensitive to climate change? I doubt any serious such indices in Wall Street, however possibly the SOA (Society of Actuaries) which I heard some research projects on the relationship. Personally I like to make a water index. An interesting example: >it can take more than 20,000 litres of water to produce 1kg of cotton [Cotton](http://wwf.panda.org/about_our_earth/about_freshwater/freshwater_problems/thirsty_crops/cotton/) Shortage of fresh water for manufacturing might cause havec in Asian economies. Dara
Daniel
I was thinking before joining this group to compute a daily index for water, it could be the coefficients of the SUPPORT VECTORS for a SVR algorithm performed on some precipitation density in certain regions.
Or the internal weight for a Neural Network global approximator (See Hornik paper I posted earlier).
So the index is not just hodge podge of numbers set by someone in some business/political establishment, rather the index has a meaningful concept and it is algorithmic.
Dara
Daniel I was thinking before joining this group to compute a daily index for water, it could be the coefficients of the SUPPORT VECTORS for a SVR algorithm performed on some precipitation density in certain regions. Or the internal weight for a Neural Network global approximator (See Hornik paper I posted earlier). So the index is not just hodge podge of numbers set by someone in some business/political establishment, rather the index has a meaningful concept and it is algorithmic. Dara
FWIW, a couple of years ago the UK had the Stern review which looked at economic effects. I've never read it so I don't know, but I doubt there's any simple models of how economic indicators would change with climate, but there might be some higher level ideas one could build on.
FWIW, a couple of years ago the UK had [the Stern review](http://en.wikipedia.org/wiki/Stern_Review) which looked at economic effects. I've never read it so I don't know, but I doubt there's any simple models of how economic indicators would change with climate, but there might be some higher level ideas one could build on.
Nathan Urban said:
If that is in regards to the sea-surface height, I think it would be much more likely an indicator of the Pacific Ocean sloshing imbalance. But then again due to thermal expansion, some fraction of the height change will certainly be due to the changing temperature of the ocean.
Other than that, it kind of makes sense, but all I am using is the tide gauges of Sydney harbor, which has had records kept for over 100 years, so can't really attribute it to sudden availability of satellite data.
The east-west sloshing of Pacific Ocean waters is routinely used to describe ENSO, yet curiously none of these indices seem to make use of the changes of surface height, which is the characteristic measure of sloshing.
Nathan Urban said: " It’s probably worth considering, though, as an indicator of subsurface heat content." If that is in regards to the sea-surface height, I think it would be much more likely an indicator of the Pacific Ocean sloshing imbalance. But then again due to thermal expansion, some fraction of the height change will certainly be due to the changing temperature of the ocean. Other than that, it kind of makes sense, but all I am using is the tide gauges of Sydney harbor, which has had records kept for over 100 years, so can't really attribute it to sudden availability of satellite data. The east-west sloshing of Pacific Ocean waters is routinely used to describe ENSO, yet curiously none of these indices seem to make use of the changes of surface height, which is the characteristic measure of sloshing.
Replying to John at 82,
[...]
Ludescher et al's way of using the average link strength included using the NINO3.4 index itself as well. So you would be predicting one time series from two. But that should be easy, no sparsification or anything fancy required.
Replying to John at 82, > 1) Get something done by December 1st for the NIPS talk. [...] > For 1), a first baby step would be to take any method like neural nets, random forests etc. and use it to predict the El Niño 3.4 index starting from the average link strength computed by Ludescher et al (and available here). This was supposed to be easy, since it’s predicting one time series from another; no sparsification needed (right?). It would not test the sanity of using “average link strength”, just Ludescher et al’s particular way of using it. Ludescher et al's way of using the average link strength included using the NINO3.4 index itself as well. So you would be predicting one time series from two. But that should be easy, no sparsification or anything fancy required.