Options

Random forest: El Nino 3.4, (off-the-shelf)

Hello John

Here is Random Forest, it is Mathematica v10 built in:

Random Forest Forecast for El Nino 3.4

I added the standard ANOVA like error analysis, I dislike this kinda of antique error analysis, but remember this is off-the-shelf.

If you need more details or changes to the training of the algorithm, I need to call TechSupport which I do not mind to do.

Dara

Comments

  • 1.

    man page for Mathematica:

    Predict

    Comment Source:man page for Mathematica: [Predict](http://reference.wolfram.com/language/ref/Predict.html)
  • 2.
    edited November 2014

    Hi Dara,

    I haven't found a way of renaming a forum title s/radom forrest/random forest / (people might thing we're radioactive gumps). So some one will shoot me if I'm wrong but I think you'll have to create a new page. I also don't know how to delete a dead page.

    hth

    Comment Source:Hi Dara, I haven't found a way of renaming a forum title s/radom forrest/random forest / (people might thing we're radioactive gumps). So some one will shoot me if I'm wrong but I think you'll have to create a new page. I also don't know how to delete a dead page. hth
  • 3.

    OK what do you want me to call the title? I will make a new page.

    Dara

    Comment Source:OK what do you want me to call the title? I will make a new page. Dara
  • 4.

    Jim sorry it was supposed to be RANDOM not RADOM .... I hope it is better now.

    My NLP-->0 when I program

    D

    Comment Source:Jim sorry it was supposed to be RANDOM not RADOM .... I hope it is better now. My NLP-->0 when I program D
  • 5.

    Ah thanks, so I can retitle pages. Good stuff. But you also need s/forrest/forest/.

    Comment Source:Ah thanks, so I can retitle pages. Good stuff. But you also need s/forrest/forest/.
  • 6.

    Ah... a failure again, let me fix that

    D

    Comment Source:Ah... a failure again, let me fix that D
  • 7.

    I started to mis-spell forest since the Forrest Gump movie and used to visit Forrester Research analysts, my non-machine learning has a few bugs :)

    Comment Source:I started to mis-spell forest since the Forrest Gump movie and used to visit Forrester Research analysts, my non-machine learning has a few bugs :)
  • 8.

    Great, Dara! But remember, I need lots of explanations in English to understand anything. So if you don't answer this question, I won't know what you've done and it will be useless to me.

    • What data are you using to predict the Niño 3.4 index at each time step?

    I don't mean to be rude, but I have to be clear: I need to know the answer to this, and I need to know it not by seeing code but by reading sentences in English (and maybe a few equations).

    The prediction looks suspiciously good, so I suspect you are using the Niño 3.4 index at time steps $t_1, \dots, t_n$ to predict its value at time step $t_{n+1}$. Is that correct?

    This would be interesting, but what I really want is to see how well you can predict Niño 3.4 at a given time given the average link strength at times $\ge M$ months earlier. I'd like to see this for $M = 3, 6, 9, 12$.

    Comment Source:Great, Dara! But remember, I need lots of _explanations in English_ to understand anything. So if you don't answer this question, I won't know what you've done and it will be useless to me. • What data are you using to predict the Niño 3.4 index at each time step? I don't mean to be rude, but I have to be clear: I need to know the answer to this, and I need to know it not by seeing code but by reading sentences in English (and maybe a few equations). The prediction looks suspiciously good, so I suspect you are using the Niño 3.4 index at time steps $t_1, \dots, t_n$ to predict its value at time step $t_{n+1}$. Is that correct? This would be interesting, but what I really want is to see how well you can predict Niño 3.4 at a given time given the _average link strength_ at times $\ge M$ months earlier. I'd like to see this for $M = 3, 6, 9, 12$.
  • 9.
    edited November 2014

    The prediction looks suspiciously good, so I suspect you are using the Niño 3.4 index at time steps $t_1, \dots, t_n$ to predict its value at time step $t_{n+1}$. Is that correct?

    yes. Actually the forecasts are not as good as you think, because you cannot predict the sign or whether the numbers are above certain constant number. For that matter even single digit accuracy (in percents) would not suffice, you need to be sub-percent accurate to get single digit accuracy for sign or greater or less than some constant.

    I tell you what I will do, this is off-the-shelf, I call techsupport at Wolfram Research and get the English wordings from them to assure you of accuracy of my claims :)

    It will take a few phones calls, so give me a day or so please.

    Comment Source:>The prediction looks suspiciously good, so I suspect you are using the Niño 3.4 index at time steps $t_1, \dots, t_n$ to predict its value at time step $t_{n+1}$. Is that correct? yes. Actually the forecasts are not as good as you think, because you cannot predict the sign or whether the numbers are above certain constant number. For that matter even single digit accuracy (in percents) would not suffice, you need to be sub-percent accurate to get single digit accuracy for sign or greater or less than some constant. I tell you what I will do, this is off-the-shelf, I call techsupport at Wolfram Research and get the English wordings from them to assure you of accuracy of my claims :) It will take a few phones calls, so give me a day or so please.
  • 10.

    This would be interesting, but what I really want is to see how well you can predict Niño 3.4 at a given time given the average link strength at times ≥M\ge M months earlier

    I really need to understand this statement better i.e. how to form the new data vectors that reflect what you say in this description. I do not mind doing it, but I need to understand how the input looks like.

    Dara

    Comment Source:>This would be interesting, but what I really want is to see how well you can predict Niño 3.4 at a given time given the average link strength at times ≥M\ge M months earlier I really need to understand this statement better i.e. how to form the new data vectors that reflect what you say in this description. I do not mind doing it, but I need to understand how the input looks like. Dara
  • 11.

    John for the link strength it would be great if I had a description like a map: (x1,x2,x3...xn)--->(y1,2,y3,...ym) and then I could code around that function. Basically would need the input tuple list and output tuple list descriptions

    D

    Comment Source:John for the **link strength** it would be great if I had a description like a map: (x1,x2,x3...xn)--->(y1,2,y3,...ym) and then I could code around that function. Basically would need the input tuple list and output tuple list descriptions D
  • 12.

    I don’t mean to be rude,

    No you are not, and it is my job to do all that you asked, however this is the nature of off-the-shelf software, that is why we made our own libraries to avoid dependencies on others.

    I sent the email to Wolfram techsupport and cc-ed you

    Daraa

    Comment Source:>I don’t mean to be rude, No you are not, and it is my job to do all that you asked, however this is the nature of off-the-shelf software, that is why we made our own libraries to avoid dependencies on others. I sent the email to Wolfram techsupport and cc-ed you Daraa
  • 13.
    edited November 2014

    If I am reading code the correctly the model is learning the association between timestep number and the nino3.4 value . The first code line in the document creates a list of $[i \mapsto Nino34_i]$ pairs and assigns it to the "data" variable. This is the training set to the random forest and the the random forest learns a smoothed/interpolated function that takes in $i$ as input and returns the corresponding Nino34 index based on the values of Nino34 at nearby values.

    I think John wants the "data" variable to be the list of $[LinkStrength_{i} \mapsto Nino34_{i+6}]$ pairs. If you train the random forest on that and at prediction time give it a link strength value it will return its prediction for Nino34 six months in the future. My experiments indicate that past values of the Nino34 index are a much better predictor of future Nino34 values than the link strength is, Unless I messed up aligning the link strength with the rest of the data. I have found that using $[Nino34_{i-5} \ldots Nino34_{i} \mapsto Nino34_{i+6}]$ model performs comparably to some of the other models I shared earlier.

    Comment Source:If I am reading code the correctly the model is learning the association between timestep number and the nino3.4 value . The first code line in the document creates a list of $[i \mapsto Nino34_i]$ pairs and assigns it to the "data" variable. This is the training set to the random forest and the the random forest learns a smoothed/interpolated function that takes in $i$ as input and returns the corresponding Nino34 index based on the values of Nino34 at nearby values. I think John wants the "data" variable to be the list of $[LinkStrength_{i} \mapsto Nino34_{i+6}]$ pairs. If you train the random forest on that and at prediction time give it a link strength value it will return its prediction for Nino34 six months in the future. My experiments indicate that past values of the Nino34 index are a much better predictor of future Nino34 values than the link strength is, Unless I messed up aligning the link strength with the rest of the data. I have found that using $[Nino34_{i-5} \ldots Nino34_{i} \mapsto Nino34_{i+6}]$ model performs comparably to some of the other models I shared earlier.
  • 14.

    Daniel your text is garbled.

    You are correct on the first parag, but I think John wants that broken down even more.

    Please repost

    Comment Source:Daniel your text is garbled. You are correct on the first parag, but I think John wants that broken down even more. Please repost
  • 15.

    I saw as soon as I posted. I re edited the post. Is it ok now?

    Comment Source:I saw as soon as I posted. I re edited the post. Is it ok now?
  • 16.

    Thanx Daniel I try your suggestions and see if I could recode. I do Random Forest and SVR and NN.

    Comment Source:Thanx Daniel I try your suggestions and see if I could recode. I do Random Forest and SVR and NN.
  • 17.
    edited November 2014

    Daniel wrote:

    I think John wants the "data" variable to be the list of $[LinkStrength_{i} \mapsto Nino34_{i+6}]$ pairs.

    Yes, if I understand this correctly this is what I want, assuming $i$ means the $i$th month. I would also like to be able to adjust the number "6" to a few other numbers: 3, 6, 9, and 12. I want to see how the prediction gets worse as these numbers increase.

    (To make life a bit more complicated, we have data for the link strength at ten day intervals, and the Nino34 at one month intervals! I don't think we can obtain Nino34 at ten day intervals. I think we could easily obtain the link strength at 30 day intervals or any fixed interval. Unfortunately, months are not all the same length! I guess we could obtain the link strength at approximately 365/12 day intervals, thus getting 12 link strengths per year. Would this be helpful?)

    If you train the random forest on that and at prediction time give it a link strength value it will return its prediction for Nino34 six months in the future.

    Yes, that's the kind of thing I'd like to see, and also with "six" replaced by "three, nine and twelve".

    My experiments indicate that past values of the Nino34 index are a much better predictor of future Nino34 values than the link strength is, unless I messed up aligning the link strength with the rest of the data.

    That's already very interesting, because Ludescher et al are claiming that link strengths are a useful predictor. Maybe they're not!

    I also am very interested in your work here:

    I have run an ExtraTreesRegressor, a random forest variant, to predict the anomaly directly from the whole raw temperature map 6 months before. the results are here. Out out of the 778 months available I trained on the first 400 and tested on the remaining 378. The sources for this are also in the same directory.

    Again, this sort of calculation will help us test Ludescher's claim that the link strength is a useful concept. If you can do much better with the whole grid, their claim seems implausible! I would like to see you repeat this calculation with "6 months" replaced by 3, 9, or 12 months, to see how the performance degrades as we try to predict further into the future.

    I really want to say how immensely grateful I am to you. This kind of thing can transform my NIPS talk from being just a literature review to something much more interesting. And of course I'll describe this talk as a joint production of the whole Azimuth Project, and list everyone who contributed.

    And by the way: in case you're worried, my list of requests is short! At least for the NIPS talk, that is.

    I have found that using $[Nino34_{i-5} \ldots Nino34_{i} \mapsto Nino34_{i+6}]$ model performs comparably to some of the other models I shared earlier.

    That's fascinating.

    I am also very interested in how air pressure outperforms surface air temperature.... but not for my NIPS talk. I hope it's clear: for that talk I'm not trying to do good predictions of Nino3.4. I'm trying to critically examine the work of Ludescher et al, by comparing their method of predicting Nino3.4 to some other very comparable methods: using Nino3.4 to predict itself, or using the whole grid of air surface temperatures to predict Nino3.4.

    Comment Source:Daniel wrote: > I think John wants the "data" variable to be the list of $[LinkStrength_{i} \mapsto Nino34_{i+6}]$ pairs. Yes, if I understand this correctly this is what I want, assuming $i$ means the $i$th month. I would also like to be able to adjust the number "6" to a few other numbers: 3, 6, 9, and 12. I want to see how the prediction gets worse as these numbers increase. (To make life a bit more complicated, we have data for the link strength at _ten day_ intervals, and the Nino34 at _one month_ intervals! I don't think we can obtain Nino34 at ten day intervals. I think we could easily obtain the link strength at _30 day_ intervals or any fixed interval. Unfortunately, months are not all the same length! I guess we could obtain the link strength at _approximately 365/12 day intervals_, thus getting 12 link strengths per year. Would this be helpful?) > If you train the random forest on that and at prediction time give it a link strength value it will return its prediction for Nino34 six months in the future. Yes, that's the kind of thing I'd like to see, and also with "six" replaced by "three, nine and twelve". > My experiments indicate that past values of the Nino34 index are a much better predictor of future Nino34 values than the link strength is, unless I messed up aligning the link strength with the rest of the data. That's already very interesting, because Ludescher _et al_ are claiming that link strengths are a useful predictor. Maybe they're not! I also am **very** interested in your work here: > I have run an [ExtraTreesRegressor](http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html), a random forest variant, to predict the anomaly directly from the whole raw temperature map 6 months before. the results are [here](https://www.googledrive.com/host/0B4cyIPgV_Vxrb2wxUnFteXVwWHM). Out out of the 778 months available I trained on the first 400 and tested on the remaining 378. The sources for this are also in the same directory. Again, this sort of calculation will help us test Ludescher's claim that the link strength is a useful concept. If you can do much better with the whole grid, their claim seems implausible! I would like to see you repeat this calculation with "6 months" replaced by 3, 9, or 12 months, to see how the performance degrades as we try to predict further into the future. I really want to say how _immensely_ grateful I am to you. This kind of thing can transform my NIPS talk from being just a literature review to something much more interesting. And of course I'll describe this talk as a joint production of the whole Azimuth Project, and list everyone who contributed. And by the way: in case you're worried, my list of requests is short! At least for the NIPS talk, that is. > I have found that using $[Nino34_{i-5} \ldots Nino34_{i} \mapsto Nino34_{i+6}]$ model performs comparably to some of the other models I shared earlier. That's fascinating. I am also very interested in how air pressure outperforms surface air temperature.... but not for my NIPS talk. I hope it's clear: for that talk I'm not trying to do good predictions of Nino3.4. I'm trying to critically examine the work of Ludescher _et al_, by comparing their method of predicting Nino3.4 to some other very comparable methods: using Nino3.4 to predict itself, or using the whole grid of air surface temperatures to predict Nino3.4.
  • 18.
    edited November 2014

    I’m trying to critically examine the work of Ludescher et al, by comparing their method of predicting Nino3.4 to some other very comparable methods: using Nino3.4 to predict itself, or using the whole grid of air surface temperatures to predict Nino3.4.

    I am rather strongly convinced that the link strength, at least as reprepresented in the github file, is essentially unrelated to nino34. My main source of doubt is the possibility that there is an error how I temporally align the signals. It would be great if somebody could take a long hard look at the code in my link analysis notebook, (also source notebook and plain python script). If this is correct then there is less then 0.1 correlation between the signals.

    Also there is no exceptional peak in the xcorr plot, which would mean there is no correct way to line up the signals. LS is as good a predictor of nino34 6 month from now as it is of nino 12 month from now. In fact the highest peak, by a small margin is for a 500+month lag. The n34 signal has a very narrow autcorrelation length, it drops to 0 at 10 month lag, but it still has a moderate correlation at 6 month, so simply predicting the current value of n34 for the next 6 month is probably better than anything you can do with link strength. Link strength on the other hand is very strongly correlated. Autocorrelation just drops linearly over the full length of the signal. It has appproximately $1/f^3$ power spectrum.

    This does not rule out complex nonlinear nonmonotonic relationship. However give only a relatively small number of samples many such relations could be contrived and it would be hard to justify any such relationship inferred from this data even if one really existed. I understand Ludescher et al did use a nonlinear relationship in their prediction, but it appears to have be monotonic: high LS means neans high n34. This should be apparent in a simple scatterplot and xcorr plot.

    Comment Source:> I’m trying to critically examine the work of Ludescher et al, by comparing their method of predicting Nino3.4 to some other very comparable methods: using Nino3.4 to predict itself, or using the whole grid of air surface temperatures to predict Nino3.4. I am rather strongly convinced that the link strength, at least as reprepresented in the github file, is essentially unrelated to nino34. **My main source of doubt is the possibility that there is an error how I temporally align the signals. It would be great if somebody could take a long hard look at the code in my link analysis [notebook](https://www.googledrive.com/host/0B4cyIPgV_VxrX0lxSUxHU2VLN28/link-anom.html),** (also [source notebook](https://www.googledrive.com/host/0B4cyIPgV_VxrX0lxSUxHU2VLN28/link-anom.ipynb) and [plain python script](https://www.googledrive.com/host/0B4cyIPgV_VxrX0lxSUxHU2VLN28/link-anom.py)). If this is correct then there is less then 0.1 correlation between the signals. Also there is no exceptional peak in the xcorr plot, which would mean there is no correct way to line up the signals. LS is as good a predictor of nino34 6 month from now as it is of nino 12 month from now. In fact the highest peak, by a small margin is for a 500+month lag. The n34 signal has a very narrow autcorrelation length, it drops to 0 at 10 month lag, but it still has a moderate correlation at 6 month, so simply predicting the current value of n34 for the next 6 month is probably better than anything you can do with link strength. Link strength on the other hand is very strongly correlated. Autocorrelation just drops linearly over the full length of the signal. It has appproximately $1/f^3$ power spectrum. This does not rule out complex nonlinear nonmonotonic relationship. However give only a relatively small number of samples many such relations could be contrived and it would be hard to justify any such relationship inferred from this data even if one really existed. I understand Ludescher et al did use a nonlinear relationship in their prediction, but it appears to have be monotonic: high LS means neans high n34. This should be apparent in a simple scatterplot and xcorr plot.
  • 19.

    This is great stuff Daniel. I have to remind myself that ENSO is still one of the big mysteries in climate variability, and there are good reasons why it doesn't yield its secrets easily.

    As I am reading it, the discussion is of whether ENSO behavior is equivalent to the gorilla in the room, and whether it will dominate other measures, thus causing apparent link correlations as the effect of an ENSO event propagates outward. Or whether the link correlations are precursors with a strong enough impact to be able to trigger a strong ENSO El Nino behavior.

    The former is causal in one direction, while the latter goes in the other direction. As Daniel notes, determining the phase alignment is crucial. If a clean few-month lead or lag is found, then the causality is more clear than if the phase is spread out over many months. In the latter case, it becomes difficult to tell if the result is from a past ENSO event or is a harbinger of a future one.

    Comment Source:This is great stuff Daniel. I have to remind myself that ENSO is still one of the big mysteries in climate variability, and there are good reasons why it doesn't yield its secrets easily. As I am reading it, the discussion is of whether ENSO behavior is equivalent to the gorilla in the room, and whether it will dominate other measures, thus causing apparent link correlations as the effect of an ENSO event propagates outward. Or whether the link correlations are precursors with a strong enough impact to be able to trigger a strong ENSO El Nino behavior. The former is causal in one direction, while the latter goes in the other direction. As Daniel notes, determining the phase alignment is crucial. If a clean few-month lead or lag is found, then the causality is more clear than if the phase is spread out over many months. In the latter case, it becomes difficult to tell if the result is from a past ENSO event or is a harbinger of a future one.
  • 20.
    edited November 2014

    Thanks, Daniel! I lean toward agreeing that Ludescher et al are exaggerating the usefulness of link strength, so my talk will be rather critical of their work.

    Can someone who knows Python read Daniel's code to see if the time issues he mentions are being dealt with correctly? I don't know Python. I guess a key part of the code is here:

    link = pd.read_csv("https://raw.githubusercontent.com/johncarlosbaez/el-nino/master/R/average-link-strength-1948-2013.txt", sep="\s+", header=None, names=["N", "S"]) dates = dt.date(1948,01,01)+dt.timedelta(days=729)+10*(link["N"])*dt.timedelta(days=1) del link["N"]

    This is at least slightly wrong because, as I mentioned, the link strength data ignores leap years. So, the first link strength occurs 729 days after January 1st 1948, and each new link strength after that occurs 10 days after the previous one except when there's a February 29th between the two: then it's 11 days.

    From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven't calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it's not only a multiple of 100 but also of 400.)

    This effect seems fairly small, but it means we're about half a month off at the end.

    Comment Source:Thanks, Daniel! I lean toward agreeing that Ludescher _et al_ are exaggerating the usefulness of link strength, so my talk will be rather critical of their work. Can someone who knows Python read Daniel's code to see if the time issues he mentions are being dealt with correctly? I don't know Python. I guess a key part of the code is here: `link = pd.read_csv("https://raw.githubusercontent.com/johncarlosbaez/el-nino/master/R/average-link-strength-1948-2013.txt", sep="\s+", header=None, names=["N", "S"])` `dates = dt.date(1948,01,01)+dt.timedelta(days=729)+10*(link["N"])*dt.timedelta(days=1)` `del link["N"]` This is at least _slightly_ wrong because, as I mentioned, the link strength data ignores leap years. So, the first link strength occurs 729 days after January 1st 1948, and each new link strength after that occurs 10 days after the previous one _except_ when there's a February 29th between the two: then it's 11 days. From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven't calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it's not only a multiple of 100 but also of 400.) This effect seems fairly small, but it means we're about half a month off at the end.
  • 21.

    Can someone who knows Python read Daniel’s code to see if the time issues he mentions are being dealt with correctly?

    I look at tonight with Jacob my Python guru, if ok with Daniel

    Dara

    Comment Source:>Can someone who knows Python read Daniel’s code to see if the time issues he mentions are being dealt with correctly? I look at tonight with Jacob my Python guru, if ok with Daniel Dara
  • 22.

    Yes please!

    Thanks Daniel

    Comment Source:Yes please! Thanks Daniel
  • 23.

    From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven’t calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it’s not only a multiple of 100 but also of 400.)

    I realize that. If we are working with 1 month intervals then 1/2 a month is the measurement error, so I think that is to be expected. My way actually gets out of phase by 10 days each 2 years before the resyncing anyway. The way I do the alignment is since there are 36.5 ten day periods to a year I drop every 73 link strength measurement to give me an even number of 30 day "months" per year on average. I group the remainder by 3 and average them. Given that the link strength signal has a long correlation period I thing small deviations are ok.

    The alternative would be to interpolate individual daily link strength values, group by true calendar months and average. This would not be that hard with the dateutil library, but I doubt it would make much difference

    Comment Source:> From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven’t calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it’s not only a multiple of 100 but also of 400.) I realize that. If we are working with 1 month intervals then 1/2 a month is the measurement error, so I think that is to be expected. My way actually gets out of phase by 10 days each 2 years before the resyncing anyway. The way I do the alignment is since there are 36.5 ten day periods to a year I drop every 73 link strength measurement to give me an even number of 30 day "months" per year on average. I group the remainder by 3 and average them. Given that the link strength signal has a long correlation period I thing small deviations are ok. The alternative would be to interpolate individual daily link strength values, group by true calendar months and average. This would not be that hard with the dateutil library, but I doubt it would make much difference
  • 24.

    I lean toward agreeing that Ludescher et al are exaggerating the usefulness of link strength

    If my analyses are correct then they are saying something stronger than that. Are we sure we have the same actual values for link strennth that they have?

    Comment Source:> I lean toward agreeing that Ludescher et al are exaggerating the usefulness of link strength If my analyses are correct then they are saying something stronger than that. Are we sure we have the same actual values for link strennth that they have?
  • 25.
    edited November 2014

    Dear John and Daniel

    With your permissions I had my Python guru (Jacob Simmons) look at the code and the discussion thread here (I cannot format for the source code sorry):

    First off, it seems that Daniel missed that the data in the file is 1-indexed rather than 0-indexed, so he starts off with an extra 10 days.

    Also, as John points out, this will drift because of leap years. The date list in the naive code ends on 12/14/2013.

    Replace the line

    dates = dt.date(1948,01,01)+vectorize(dt.timedelta)(days=10*link["N"]+729)

    with this:

    from calendar import isleap dates = [dt.date(1948, 01, 01)+dt.timedelta(729)] for N in link['N'][1:]: if all(( isleap(dates[-1].year), dates[-1].month == 2, dates[-1].day >= 19, )): delta = dt.timedelta(11) else: delta = dt.timedelta(10)

    dates.append(dates[-1]+delta)
    

    It's probably not the most elegant way to do it but it does what it needs to. With this, the date list ends on 12/30/2013

    Comment Source:Dear John and Daniel With your permissions I had my Python guru (Jacob Simmons) look at the code and the discussion thread here (I cannot format for the source code sorry): First off, it seems that Daniel missed that the data in the file is 1-indexed rather than 0-indexed, so he starts off with an extra 10 days. Also, as John points out, this will drift because of leap years. The date list in the naive code ends on 12/14/2013. Replace the line dates = dt.date(1948,01,01)+vectorize(dt.timedelta)(days=10*link["N"]+729) with this: from calendar import isleap dates = [dt.date(1948, 01, 01)+dt.timedelta(729)] for N in link['N'][1:]: if all(( isleap(dates[-1].year), dates[-1].month == 2, dates[-1].day >= 19, )): delta = dt.timedelta(11) else: delta = dt.timedelta(10) dates.append(dates[-1]+delta) It's probably not the most elegant way to do it but it does what it needs to. With this, the date list ends on 12/30/2013
  • 26.

    Hello John

    This is what I got from Wolfram techsupport:

    Hi Dara,

    Thank you for your patience.

    I heard back from our developers and they have told me that Mathematica uses the basic form of the Classical Random Forest Algorithm as described on Wikipedia ( http://en.wikipedia.org/wiki/Random_forest). Mathematica generates an ensemble of decision tree using the CART algorithm and then averages their results. You can also find the information about the alogorithm by evaluating: PredictorInformation[randForest, "MethodDescription"] in your code.

    Let us know if this solves your problem. We will be glad to assist you in case you require further help.

    Thanks Yash Gandhi Wolfram Technology Engineer Wolfram Technical Support

    Comment Source:Hello John This is what I got from Wolfram techsupport: Hi Dara, Thank you for your patience. I heard back from our developers and they have told me that Mathematica uses the basic form of the Classical Random Forest Algorithm as described on Wikipedia ( http://en.wikipedia.org/wiki/Random_forest). Mathematica generates an ensemble of decision tree using the CART algorithm and then averages their results. You can also find the information about the alogorithm by evaluating: PredictorInformation[randForest, "MethodDescription"] in your code. Let us know if this solves your problem. We will be glad to assist you in case you require further help. Thanks Yash Gandhi Wolfram Technology Engineer Wolfram Technical Support
  • 27.

    John the little code that did Random Forrest is correct and the results are accurate or the Wolfram R&D guys would have said something, they all PhDs in Math :)

    Comment Source:John the little code that did Random Forrest is correct and the results are accurate or the Wolfram R&D guys would have said something, they all PhDs in Math :)
  • 28.
    edited November 2014

    Daniel Mahler wrote:

    I am rather strongly convinced that the link strength, at least as reprepresented in the github file, is essentially unrelated to nino34. My main source of doubt is the possibility that there is an error how I temporally align the signals. It would be great if somebody could take a long hard look at the code in my link analysis

    In order to focus the code review, I took your plain python script, and stripped it down to the bare essentials -- removed comments and moreover plotting statements that didn't affect the final values that get computed.

    I will post this reduced code in the next message.

    Comment Source:Daniel Mahler wrote: > I am rather strongly convinced that the link strength, at least as reprepresented in the github file, is essentially unrelated to nino34. My main source of doubt is the possibility that there is an error how I temporally align the signals. It would be great if somebody could take a long hard look at the code in my link analysis In order to focus the code review, I took your plain python script, and stripped it down to the bare essentials -- removed comments and moreover plotting statements that didn't affect the final values that get computed. I will post this reduced code in the next message.
  • 29.

    from pylab import *
    import pandas as pd
    import pandas.tools.plotting as pt
    import datetime as dt

    nino34=pd.read_csv("http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/detrend.nino34.ascii.txt", sep="\s+")

    link = pd.read_csv("https://raw.githubusercontent.com/johncarlosbaez/el-nino/master/R/average-link-strength-1948-2013.txt", sep="\s+", header=None, names=["N", "S"])

    del link["N"]

    print "link.shape", link.shape

    linktmp = link[arange(len(link))%73 != 0]

    ll = len(linktmp)

    linkmnth = linktmp.groupby(arange(ll)//3).mean() linkmnth.reset_index()

    linkx=linkmnth df=nino34.iloc[:len(linkx)] df.index=range(df.shape[0])

    print "len(linktmp)", ll, "len(linkx)", len(linkx), "df.shape", df.shape

    df["link"]=linkmnth

    zz=xcorr(df.link, df.ANOM, maxlags=None)

    mx=argmin(zz[1]), argmax(zz[1]) print "maxlag:none (argmin,argmax) =", mx, "val0,0", zz[0][mx[0]], "val1,0", zz[1][mx[0]], "val0,1", zz[0][mx[1]], "val1,1", zz[1][mx[1]]

    zz=xcorr(df.link, df.ANOM, maxlags=48)

    mx=argmin(zz[1]), argmax(zz[1]) print "maxlag:48 (argmin,argmax) =", mx, "val0,0", zz[0][mx[0]], "val1,0", zz[1][mx[0]], "val0,1", zz[0][mx[1]], "val1,1", zz[1][mx[1]]

    print "result", zz[1][zz[0]==0]

    Comment Source:from pylab import * import pandas as pd import pandas.tools.plotting as pt import datetime as dt nino34=pd.read_csv("http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/detrend.nino34.ascii.txt", sep="\s+") link = pd.read_csv("https://raw.githubusercontent.com/johncarlosbaez/el-nino/master/R/average-link-strength-1948-2013.txt", sep="\s+", header=None, names=["N", "S"]) del link["N"] print "link.shape", link.shape linktmp = link[arange(len(link))%73 != 0] ll = len(linktmp) linkmnth = linktmp.groupby(arange(ll)//3).mean() linkmnth.reset_index() linkx=linkmnth df=nino34.iloc[:len(linkx)] df.index=range(df.shape[0]) print "len(linktmp)", ll, "len(linkx)", len(linkx), "df.shape", df.shape df["link"]=linkmnth zz=xcorr(df.link, df.ANOM, maxlags=None) mx=argmin(zz[1]), argmax(zz[1]) print "maxlag:none (argmin,argmax) =", mx, "val0,0", zz[0][mx[0]], "val1,0", zz[1][mx[0]], "val0,1", zz[0][mx[1]], "val1,1", zz[1][mx[1]] zz=xcorr(df.link, df.ANOM, maxlags=48) mx=argmin(zz[1]), argmax(zz[1]) print "maxlag:48 (argmin,argmax) =", mx, "val0,0", zz[0][mx[0]], "val1,0", zz[1][mx[0]], "val0,1", zz[0][mx[1]], "val1,1", zz[1][mx[1]] print "result", zz[1][zz[0]==0]
  • 30.
    edited November 2014

    Here is the output that gets produced:

    link.shape (2337, 1)
    len(linktmp) 2304 len(linkx) 768 df.shape (768, 5)
    maxlag:none (argmin,argmax) = (186, 876) val0,0 -581 val1,0 -0.0683858849943 val0,1 109 val1,1 0.0648401993941
    maxlag:48 (argmin,argmax) = (5, 38) val0,0 -43 val1,0 -0.0119607762965 val0,1 -10 val1,1 0.057088514805
    result [ 0.02170817]

    Comment Source:Here is the output that gets produced: link.shape (2337, 1) len(linktmp) 2304 len(linkx) 768 df.shape (768, 5) maxlag:none (argmin,argmax) = (186, 876) val0,0 -581 val1,0 -0.0683858849943 val0,1 109 val1,1 0.0648401993941 maxlag:48 (argmin,argmax) = (5, 38) val0,0 -43 val1,0 -0.0119607762965 val0,1 -10 val1,1 0.057088514805 result [ 0.02170817]
  • 31.
    edited November 2014

    So, is what you are asking that we validate that the result 0.02170817 is the correct value for the correlation?

    Did I reduce your program correctly? Does my output match against yours?

    I'm not very familiar with this data, nor with the Pandas library (though I've been wanting to learn about it, and here is a great application in action). So I might not be in a position to say whether this code is correct or not.

    Can you add a bit of commentary here explaining what is going on in the short program above. Also if you could spell out a justification for the steps involved, that might help you to immediately see that it is correct, or to spot problems with it.

    Comment Source:So, is what you are asking that we validate that the result 0.02170817 is the correct value for the correlation? Did I reduce your program correctly? Does my output match against yours? I'm not very familiar with this data, nor with the Pandas library (though I've been wanting to learn about it, and here is a great application in action). So I might not be in a position to say whether this code is correct or not. Can you add a bit of commentary here explaining what is going on in the short program above. Also if you could spell out a justification for the steps involved, that might help you to immediately see that it is correct, or to spot problems with it.
  • 32.

    I actually stripped out the line involving the dates, because that was not used in the computation of the final value. I was just mechanically reducing it -- did I follow the wrong path?

    Comment Source:I actually stripped out the line involving the dates, because that was not used in the computation of the final value. I was just mechanically reducing it -- did I follow the wrong path?
  • 33.
    edited November 2014

    The code I would really like people to review is the matching of he 10(11) day cycle link strength values with monthly values pof the nino34 anomaly. It is an approximation, but I wanted to check that a) the approximation is reasonable and b) the code correctly implements the approximation I intended rather than something else.

    Since there are more 10 day cycles than months the main idea is to group the 10 day periods into "months" and take the average of the values that fall into the same month and since 10 day periods do not exactly line up with months the grouping has to be somewhat approximate anyway.

    The logic is that there are 73 10-day periods in 2 non leap years or 24 months. This gives 24 30-day "months" with a 10-day period left over. So I take the 24 "months" and match those with calendar months and drop the remaining 10 day period. This means that my "months" drift in and out of phase with calendar months, but by no more than half a month ie half the resulting sampling period.

    There is also the issue with leap year, both the above approximation and John's algorithm pretend leap years do not exist. This means there could be a cumulative drift but I think that both algorithms ignore leap years in a way that happens to cancel out, since John's algorithm is actually producing exactly 73 periods every 2 years even whether there is a leap year or not, and my 30 day "months" then become 31 day months if they include Feb 29. So all that happens is the drifting in an out of phase between the 30-day months and calendar months, which resyncs every 2 years. The code that does this is below

    drop every 73 period:

    linktmp=link[arange(len(link))%73 != 0]
    

    group into 3s and average:

    ll= len(linktmp)
    print ll
    linkmnth= linktmp.groupby(arange(ll)//3).mean()
    linkmnth.reset_index()
    
    linkx=linkmnth
    

    only keep data for the intersection of the data sets:

    df=nino34.iloc[:len(linkx)]
    df.index=range(df.shape[0])
    print ll, len(linkx), df.shape
    df["link"]=linkmnth
    

    The code is implicitly assuming that the link strength data and the nino34 data start the same time, beginning of January 1950.

    For reference John's comments on the link strength data are

    The average link strength, a quantity we've computed and put here. This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see Part 4 of the El Niño Project series.)

    and

    Here’s another way of saying what I said. The first day in the file is 729 days after January 1st, 1948, and the last day is 12039 days after January 1st, 1948. Please don’t ask me to calculate those dates!

    and

    If the actual dates really matter to you, I should warn you of this: climate scientists pretend that the day February 29 on leap years does not exist. For them, every year has 365 days. So, you have to take that into account.

    and

    This is at least slightly wrong because, as I mentioned, the link strength data ignores leap years. So, the first link strength occurs 729 days after January 1st 1948, and each new link strength after that occurs 10 days after the previous one except when there's a February 29th between the two: then it's 11 days.

    From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven't calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it's not only a multiple of 100 but also of 400.)

    This effect seems fairly small, but it means we're about half a month off at the end.

    I am also very open to suggestions of how to better handle the sampling mismatch between the 2 signals

    Comment Source:The code I would really like people to review is the matching of he 10(11) day cycle link strength values with monthly values pof the nino34 anomaly. It is an approximation, but I wanted to check that a) the approximation is reasonable and b) the code correctly implements the approximation I intended rather than something else. Since there are more 10 day cycles than months the main idea is to group the 10 day periods into "months" and take the average of the values that fall into the same month and since 10 day periods do not exactly line up with months the grouping has to be somewhat approximate anyway. The logic is that there are 73 10-day periods in 2 non leap years or 24 months. This gives 24 30-day "months" with a 10-day period left over. So I take the 24 "months" and match those with calendar months and drop the remaining 10 day period. This means that my "months" drift in and out of phase with calendar months, but by no more than half a month ie half the resulting sampling period. There is also the issue with leap year, both the above approximation and John's algorithm pretend leap years do not exist. This means there could be a cumulative drift but I think that both algorithms ignore leap years in a way that happens to cancel out, since John's algorithm is actually producing exactly 73 periods every 2 years even whether there is a leap year or not, and my 30 day "months" then become 31 day months if they include Feb 29. So all that happens is the drifting in an out of phase between the 30-day months and calendar months, which resyncs every 2 years. The code that does this is below drop every 73 period: linktmp=link[arange(len(link))%73 != 0] group into 3s and average: ll= len(linktmp) print ll linkmnth= linktmp.groupby(arange(ll)//3).mean() linkmnth.reset_index() linkx=linkmnth only keep data for the intersection of the data sets: df=nino34.iloc[:len(linkx)] df.index=range(df.shape[0]) print ll, len(linkx), df.shape df["link"]=linkmnth The code is implicitly assuming that the link strength data and the nino34 data start the same time, beginning of January 1950. For reference John's comments on the link strength data are > The **average link strength**, a quantity we've computed and put [here](https://github.com/johncarlosbaez/el-nino/blob/master/R/average-link-strength.txt). This file has the average link strength, called $S$, at 10-day intervals starting from day 730 and going until day 12040, where day 1 is the first of January 1948. (For an explanation of how this was computed, see [Part 4](http://johncarlosbaez.wordpress.com/2014/07/08/el-nino-project-part-4/) of the El Niño Project series.) and > Here’s another way of saying what I said. The first day in the file is 729 days after January 1st, 1948, and the last day is 12039 days after January 1st, 1948. Please don’t ask me to calculate those dates! and > If the actual dates really matter to you, I should warn you of this: climate scientists pretend that the day February 29 on leap years does not exist. For them, every year has 365 days. So, you have to take that into account. and > This is at least _slightly_ wrong because, as I mentioned, the link strength data ignores leap years. So, the first link strength occurs 729 days after January 1st 1948, and each new link strength after that occurs 10 days after the previous one _except_ when there's a February 29th between the two: then it's 11 days. > From 1948 to 2013 this effect would cause the times to slowly drift out of alignment by a total of roughly 16 days. (I haven't calculated it very carefully; if anyone wants to, remember that 2000 was a leap year because it's not only a multiple of 100 but also of 400.) > This effect seems fairly small, but it means we're about half a month off at the end. I am also very open to suggestions of how to better handle the sampling mismatch between the 2 signals
  • 34.

    I just realized my code review request has hijacked Dara's Random Forest thread. Future discussion of my code should go into a new thread.

    Comment Source:I just realized my code review request has hijacked Dara's Random Forest thread. Future discussion of my code should go into a new thread.
  • 35.

    Daniel,

    I haven't read all the above comments, but this may help on the dates alignment: at the bottom of this Wiki page in section "Second attempt at replicating Ludescher et al" the link strength I calculated and the NINO34 index are plotted together.

    Comment Source:Daniel, I haven't read all the above comments, but this may help on the dates alignment: at the bottom of this [Wiki page](http://www.azimuthproject.org/azimuth/show/Experiments+in+El+Ni%C3%B1o+analysis+and+prediction+) in section "Second attempt at replicating Ludescher et al" the link strength I calculated and the NINO34 index are plotted together.
  • 36.

    The work of Ludescher leaves nagging suspicions that dipoles of the opposite sign can contribute significantly to the prediction algorithm. Consider that if El Nino events happen roughly every ~4 years, then the dipolar complement will peak at points halfway between these events, or at ~2 years. Some of Ludescher's warning events are also at 2 years; this means that the early warning could be just as well be a La Nina dipole phantom. As an example, one such dipole could be a Darwin hotspot outside the NINO3.4 region. Add this to the built-in autocorrelation of an individual event and it gets tricky to extract causality.

    Also don't forget to review Steve Wenner's analysis http://johncarlosbaez.wordpress.com/2014/07/23/el-nino-project-part-6/, which wasn't substantiating Ludescher's approach either.

    In conclusion, Steve said "I wonder if there are better ways to analyze these two correlated time series.". I think that Daniel Mahler is on to something with his approach, varying the cross-correlations over a sufficient time window, much longer than the 200 day lead that Ludescher limit it to. Who knows, there might be something else he comes across.

    It may be a little late in the game, but if I were to do this over, I would create a synthetic model of the ocean with a few strategic dipoles, such as at Darwin and one at Hawaii that Daniel discovered. Then see what the Ludescher algorithm yields.

    Comment Source:The work of Ludescher leaves nagging suspicions that dipoles of the opposite sign can contribute significantly to the prediction algorithm. Consider that if El Nino events happen roughly every ~4 years, then the dipolar complement will peak at points halfway between these events, or at ~2 years. Some of Ludescher's warning events are also at 2 years; this means that the early warning could be just as well be a La Nina dipole phantom. As an example, one such dipole could be a Darwin hotspot outside the NINO3.4 region. Add this to the built-in autocorrelation of an individual event and it gets tricky to extract causality. Also don't forget to review Steve Wenner's analysis <http://johncarlosbaez.wordpress.com/2014/07/23/el-nino-project-part-6/>, which wasn't substantiating Ludescher's approach either. In conclusion, Steve said *"I wonder if there are better ways to analyze these two correlated time series."*. I think that Daniel Mahler is on to something with his approach, varying the cross-correlations over a sufficient time window, much longer than the 200 day lead that Ludescher limit it to. Who knows, there might be something else he comes across. It may be a little late in the game, but if I were to do this over, I would create a synthetic model of the ocean with a few strategic dipoles, such as at Darwin and one at Hawaii that Daniel discovered. Then see what the Ludescher algorithm yields.
  • 37.

    Here is the addition of link data, the domain map link--->Anomalies

    Random Forest El Nion 3.4 Link-Anomalies

    Comment Source:Here is the addition of link data, the domain map link--->Anomalies [Random Forest El Nion 3.4 Link-Anomalies ](http://files.lossofgenerality.com/random_forrest_nino34.pdf)
  • 38.

    John

    Machine Learning algorithms are severely non-linear and the way I use them to issue forecast at each point of time, outcomes of modification to the domain of the input data has drastic and unpredictable consequences, however exactly measured in terms of error analysis.

    There is little theory or algebraic tools for guessing the results.

    Comment Source:John Machine Learning algorithms are severely non-linear and the way I use them to issue forecast at each point of time, outcomes of modification to the domain of the input data has drastic and unpredictable consequences, however exactly measured in terms of error analysis. There is little theory or algebraic tools for guessing the results.
  • 39.

    6-months lag here it is, scroll to the bottom:

    Random Forest El Nino 3.4, Links 6-months lag

    As I told you the results for 6-monthd lag are similar to results for 1-months lag, no new accuracy is obtained.

    Comment Source:6-months lag here it is, scroll to the bottom: [Random Forest El Nino 3.4, Links 6-months lag](http://files.lossofgenerality.com/random_forrest_nino34.pdf) As I told you the results for 6-monthd lag are similar to results for 1-months lag, no new accuracy is obtained.
  • 40.
    edited December 2014

    Thanks, Dara. It seems like you're training the random forest on the whole data set. Daniel: what do you think about this? Isn't it considered wiser to train this sort of method on just a subset of the data, to get a fair estimate of its abilities?

    Comment Source:Thanks, Dara. It seems like you're training the random forest on the whole data set. Daniel: what do you think about this? Isn't it considered wiser to train this sort of method on just a subset of the data, to get a fair estimate of its abilities?
  • 41.

    It seems like you’re training the random forest on the whole data set.

    Yes that is correct. The logic is I like to capture the forecast-ability of entire data. If you look at a subset, then which subset? first half, or last half, every other element, so no matter how you choose the subset you introduce a bias of some kind, bias might be a wrong terminology.

    Imagine a sat photo of a region of the planet, I like to build trees for classifying the water vs. rocks vs. vegetation. I am using the entire image for training. Similar situation to your data.

    Comment Source:> It seems like you’re training the random forest on the whole data set. Yes that is correct. The logic is I like to capture the forecast-ability of entire data. If you look at a subset, then which subset? first half, or last half, every other element, so no matter how you choose the subset you introduce a bias of some kind, bias might be a wrong terminology. Imagine a sat photo of a region of the planet, I like to build trees for classifying the water vs. rocks vs. vegetation. I am using the entire image for training. Similar situation to your data.
  • 42.

    It's kind of a philosophical question. Depends on whether you want an understanding of the data or whether you want to forecast.

    Take an example of the output of a crude power supply. Consider that all one has is one cycle of output.

    1. The forecaster thinks that it is fair to use only one half of that cycle, because then he can use that to forecast the other half of the cycle.
    2. The problem solver wants the whole cycle.

    Why is the problem solver in better shape?

    1. The forecaster looks at the half of a cycle and extrapolates it to a complete cycle. See the dotted line below.
    2. The problem solver looks at the continuation and discovers that it is a full-wave rectified signal. See the solid line below

    signal

    In this case, the problem solver is right because the power supply happens to be a full-wave rectifier needed to create a DC supply voltage.

    Comment Source:It's kind of a philosophical question. Depends on whether you want an understanding of the data or whether you want to forecast. Take an example of the output of a crude power supply. Consider that all one has is one cycle of output. 1. The forecaster thinks that it is fair to use only one half of that cycle, because then he can use that to forecast the other half of the cycle. 2. The problem solver wants the whole cycle. Why is the problem solver in better shape? 1. The forecaster looks at the half of a cycle and extrapolates it to a complete cycle. See the dotted line below. 2. The problem solver looks at the continuation and discovers that it is a full-wave rectified signal. See the solid line below ![signal](http://www.geofex.com/effxfaq/d101_06.gif) In this case, the problem solver is right because the power supply happens to be a full-wave rectifier needed to create a DC supply voltage.
  • 43.

    Depends on whether you want an understanding of the data or whether you want to forecast

    Agreed 100%.

    Apropos of John's question and application, the best way is to denoise the signal and then sub-sample the data and look for the structures or patterns in the data. This might not work with the raw signal.

    And do what you were doing Paul, remove certain frequencies which are known to exist and work on the remaining lesser understood sub-signal and then sub-sample and look for structure.

    I might conclude that there is actually no set way of doing this.

    Comment Source:> Depends on whether you want an understanding of the data or whether you want to forecast Agreed 100%. Apropos of John's question and application, the best way is to denoise the signal and then sub-sample the data and look for the structures or patterns in the data. This might not work with the raw signal. And do what you were doing Paul, remove certain frequencies which are known to exist and work on the remaining lesser understood sub-signal and then sub-sample and look for structure. I might conclude that there is actually no set way of doing this.
  • 44.

    Hello John

    As Paul noted this particular forecast was optimized for accuracy of predication, not for uncovering the data architecture.

    Comment Source:Hello John As Paul noted this particular forecast was optimized for accuracy of predication, not for uncovering the data architecture.
  • 45.
    edited December 2014

    The performance of complex models can only really be evaluated on unseen data. This is particularly true of Random models and largish neural nets which are capable of approximating any function. This means that these models can "learn" random noise. This R code snippet illustrates the issue:

    library(randomForest)
    
    # create 2 independent random variables x, y
    d <- data.frame(x=rnorm(1000), y=rnorm(1000))
    
    plot(d[,c("x","y")])
    summary(lm(y~x, d))
    
    # train a random forest on
    rf <- randomForest(y~x, d[1:500,])
    
    # make predictions
    d$ypred <- predict(rf, d)
    
    # as expected the model does no better than random guessing on unseen data
    plot(d[501:1000, c("ypred","y")])
    summary(lm(y~ypred, d[501:1000,]))
    
    # but it achieves R^2 > 0.7 on the training data
    plot(d[1:500, c("ypred","y")])
    summary(lm(y~ypred, d[1:500,]))
    
    Comment Source:The performance of complex models can only really be evaluated on unseen data. This is particularly true of Random models and largish neural nets which are capable of approximating any function. This means that these models can "learn" random noise. This R code snippet illustrates the issue: library(randomForest) # create 2 independent random variables x, y d <- data.frame(x=rnorm(1000), y=rnorm(1000)) plot(d[,c("x","y")]) summary(lm(y~x, d)) # train a random forest on rf <- randomForest(y~x, d[1:500,]) # make predictions d$ypred <- predict(rf, d) # as expected the model does no better than random guessing on unseen data plot(d[501:1000, c("ypred","y")]) summary(lm(y~ypred, d[501:1000,])) # but it achieves R^2 > 0.7 on the training data plot(d[1:500, c("ypred","y")]) summary(lm(y~ypred, d[1:500,]))
  • 46.

    This means that these models can “learn” random noise

    Yes this is true, however the accuracy is low, I do that all the time for the noisy stock market data forecasts. So one can get single digit % accuracy but that is not enough for many applications, in order to get sub-percent accuracy often noisy data might not give good enough forecast.

    So we denoise the data and do sub-percent accurate forecast and then use it in conjunction with actual data to approximate a forecast for noisy data.

    Comment Source:>This means that these models can “learn” random noise Yes this is true, however the accuracy is low, I do that all the time for the noisy stock market data forecasts. So one can get single digit % accuracy but that is not enough for many applications, in order to get sub-percent accuracy often noisy data might not give good enough forecast. So we denoise the data and do sub-percent accurate forecast and then use it in conjunction with actual data to approximate a forecast for noisy data.
Sign In or Register to comment.