Options

Off the shelf machine learning for John

Predict

John the Mathematica 10.0.1 has most of the popular forecast algorithms packaged up. They are canned and not much reconfiguration nor parallelism possible but for the sake of your off-the-shelf demonstrations pretty adequate.

I did not use them but might try to get them to work with some of the data we were talking about this week and next. So we know how good the results could be.

Let's say they worked ok, then I add them to the Django server for you to try on different data.

Dara

Comments

  • 1.
    edited October 2014

    This sounds great! I hope I explained some of the things I want to do in sufficient detail for you to try them. If not, please ask questions.

    But this looks so easy to run that I could run it myself. I should! So, please help me get started as soon as you easily can.

    Also, please remember: it's okay if these popular forecast algorithms work badly! I need to learn about them. I need to give a talk very soon. It would be interesting in this talk to explain how bad these algorithms are at predicting El Niños.

    Comment Source:This sounds great! I hope I explained some of the things I want to do in sufficient detail for you to try them. If not, please ask questions. But this looks so easy to run that I could run it myself. I should! So, please help me get started as soon as you easily can. Also, please remember: _it's okay if these popular forecast algorithms work badly!_ I need to learn about them. I need to give a talk very soon. It would be interesting in this talk to explain how _bad_ these algorithms are at predicting El Niños.
  • 2.
    edited October 2014

    Note that you have to be a little bit careful about drawing conclusions from machine learning experiments with negative outcomes. If your algorithm learns a model that has 98% accuracy, you can conclude that this technique can achieve 98% on the task. If your algorithm learns a model that achieves 10% worse performance than just constantly picking the most likely answer, you can't really conclude that the algorithm is hopeless on this problem. It might be that

    1. the optimization ended up in a local minimum
    2. algorithms generally find solutions over a limited solution space, and it may be the variables that you used don't allow the good solution in those terms, but some preprocessing/feature extraction might transform the data in such a way that it could give a very good algorithm.
    3. Maybe the training data set is just too small for the algorithm to see the effective prediction patterns from the noise. (Some of the popular pieces about Deep Learning have this kind of narrative in them.)

    Of course it's also very possible your algorithm produced a bad model that was the best it can possibly do, and you can certainly validly say "I tried very hard and the best the algorithm could come up with was bad, so the algorithm is likely to be useless for this task".

    (A kind of learned "academic's strategic cowardice" is why I'm a lot more interested in looking at models where, even if they don't work well as predictors, you can look at what they've produced and maybe get something out of analysing why they didn't work ;-) )

    Comment Source:Note that you have to be a little bit careful about drawing conclusions from machine learning experiments with negative outcomes. If your algorithm learns a model that has 98% accuracy, you can conclude that this technique can achieve 98% on the task. If your algorithm learns a model that achieves 10% worse performance than just constantly picking the most likely answer, you can't really conclude that the algorithm is hopeless on this problem. It might be that 1. the optimization ended up in a local minimum 2. algorithms generally find solutions over a limited solution space, and it may be the variables that you used don't allow the good solution in those terms, but some preprocessing/feature extraction might transform the data in such a way that it could give a very good algorithm. 3. Maybe the training data set is just too small for the algorithm to see the effective prediction patterns from the noise. (Some of the [popular pieces about Deep Learning](http://www.wired.com/2014/01/geoffrey-hinton-deep-learning) have this kind of narrative in them.) Of course it's also very possible your algorithm produced a bad model that was the best it can possibly do, and you can certainly validly say "I tried very hard and the best the algorithm could come up with was bad, so the algorithm is likely to be useless for this task". (A kind of learned "academic's strategic cowardice" is why I'm a lot more interested in looking at models where, even if they don't work well as predictors, you can look at what they've produced and maybe get something out of analysing why they didn't work ;-) )
  • 3.
    edited October 2014

    Apart from local minima and the uniterpretability of any individual or set of neuron weights (said to be a reason for a decline in NN popularity) some new interesting problems with NNs have just been described in:

    Christian Szegedy et al., Intriguing properties of neural networks (2014)

    Comment Source:Apart from local minima and the uniterpretability of any individual or set of neuron weights (said to be a reason for a decline in NN popularity) some new interesting problems with NNs have just been described in: Christian Szegedy et al., [Intriguing properties of neural networks (2014)](http://cs.nyu.edu/~zaremba/docs/understanding.pdf)
  • 4.
    edited October 2014

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, Intriguing properties of neural networks (2014)

    Every deep neural network has "blind spots" in the sense that there are inputs that are very close to correctly classified examples that are misclassified.

    Since the very start of neural network research it has been assumed that networks had the power to generalize. That is, if you train a network to recognize a cat using a particular set of cat photos the network will, as long as it has been trained properly, have the ability to recognize a cat photo it hasn't seen before.

    Within this assumption has been the even more "obvious" assumption that if the network correctly classifies the photo of a cat as a cat then it will correctly classify a slightly perturbed version of the same photo as a cat. To create the slightly perturbed version you would simply modify each pixel value, and as long as the amount was small, then the cat photo would look exactly the same to a human - and presumably to a neural network.

    However, this isn't true.

    What the researchers did was to invent an optimization algorithm that starts from a correctly classified example and tries to find a small perturbation in the pixel values that drives the output of the network to another classification. Of course, there is no guarantee that such a perturbed incorrect version of the image exists - and if the continuity assumption mentioned earlier applied the search would fail. However the search succeeds.

    For a range of different neural networks and data sets it has proved very possible to find such "adversarial examples" from correctly classified data. To quote the paper

    For all the networks we studied, for each sample, we always manage to generate very close, visually indistinguishable, adversarial examples that are misclassified by the original network.

    To be clear, the adversarial examples looked to a human like the original, but the network misclassified them. You can have two photos that look not only like a cat but the same cat, indeed the same photo, to a human, but the machine gets one right and the other wrong.

    ...

    One possible explanation is that this is another manifestation of the curse of dimensionality. As the dimension of a space increases it is well known that the volume of a hypersphere becomes increasingly concentrated at its surface. (The volume that is not near the surface drops exponentially with increasing dimension.) Given that the decision boundaries of a deep neural network are in a very high dimensional space it seems reasonable that most correctly classified examples are going to be close to the decision boundary - hence the ability to find a misclassified example close to the correct one, you simply have to work out the direction to the closest boundary.

    If this is part of the explanation, then it is clear that even the human brain cannot avoid the effect and must somehow cope with it; otherwise cats would morph into dogs with an alarming regularity.

    The bottom line is that deep neural networks do not seem to be continuous with respect to the decisions they make and exhibit a new sort of instability. Rather than patch things up with adversarial training cases, research is needed to explore and eliminate the problem. Until this happens you cannot rely on a neural network in any safety critical system..


    • Mikolov et al.,

      various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. ...

    Comment Source:Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, [Intriguing properties of neural networks (2014)](http://cs.nyu.edu/~zaremba/docs/understanding.pdf) > Every deep neural network has "blind spots" in the sense that there are inputs that are very close to correctly classified examples that are misclassified. > Since the very start of neural network research it has been assumed that networks had the power to generalize. That is, if you train a network to recognize a cat using a particular set of cat photos the network will, as long as it has been trained properly, have the ability to recognize a cat photo it hasn't seen before. > Within this assumption has been the even more "obvious" assumption that if the network correctly classifies the photo of a cat as a cat then it will correctly classify a slightly perturbed version of the same photo as a cat. To create the slightly perturbed version you would simply modify each pixel value, and as long as the amount was small, then the cat photo would look exactly the same to a human - and presumably to a neural network. However, this isn't true. What the researchers did was to invent an optimization algorithm that starts from a correctly classified example and tries to find a small perturbation in the pixel values that drives the output of the network to another classification. Of course, there is no guarantee that such a perturbed incorrect version of the image exists - and if the continuity assumption mentioned earlier applied the search would fail. However the search succeeds. For a range of different neural networks and data sets it has proved very possible to find such "adversarial examples" from correctly classified data. To quote the paper > For all the networks we studied, for each sample, we always manage to generate very close, visually indistinguishable, adversarial examples that are misclassified by the original network. > To be clear, the adversarial examples looked to a human like the original, but the network misclassified them. You can have two photos that look not only like a cat but the same cat, indeed the same photo, to a human, but the machine gets one right and the other wrong. ... > One possible explanation is that this is another manifestation of the curse of dimensionality. As the dimension of a space increases it is well known that the volume of a hypersphere becomes increasingly concentrated at its surface. (The volume that is not near the surface drops exponentially with increasing dimension.) Given that the decision boundaries of a deep neural network are in a very high dimensional space it seems reasonable that most correctly classified examples are going to be close to the decision boundary - hence the ability to find a misclassified example close to the correct one, you simply have to work out the direction to the closest boundary. > If this is part of the explanation, then it is clear that even the human brain cannot avoid the effect and must somehow cope with it; otherwise cats would morph into dogs with an alarming regularity. > The bottom line is that deep neural networks do not seem to be continuous with respect to the decisions they make and exhibit a new sort of instability. Rather than patch things up with adversarial training cases, research is needed to explore and eliminate the problem. Until this happens you cannot rely on a neural network in any safety critical system.. --- * Mikolov et al., various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. ...
  • 5.
    edited October 2014

    various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies.

    I have used this in full measure and it is amazing. I even found relationship between the words in financial reports of the companies and their stocks prices. It is amazing how much is encoded in few words in terms of relationships.

    My plan, in another application, is to apply this vector space representation of words to various word-problems in group theory and algebra. Mathematicians try to prove theorems or construct exact formulas for group theory and algebras, my suggestion is to use machine learning to establish similarity relationships or clusters of large word sets.

    Most of the ugly programming parts of that is done by yours truly just looking for nice applications in math.

    Dara

    Comment Source:>various directions in the vector space representing the words are shown to give rise to a surprisingly rich semantic encoding of relations and analogies. I have used this in full measure and it is amazing. I even found relationship between the words in financial reports of the companies and their stocks prices. It is amazing how much is encoded in few words in terms of relationships. My plan, in another application, is to apply this **vector space representation of words** to various word-problems in group theory and algebra. Mathematicians try to prove theorems or construct exact formulas for group theory and algebras, my suggestion is to use machine learning to establish similarity relationships or clusters of large word sets. Most of the ugly programming parts of that is done by yours truly just looking for nice applications in math. Dara
  • 6.

    Mathematica 10.0.1 Machine Learning off-the-shelf, just a sample to evaluate :

    Darwin Anomalies, Mathematica Machine Learning

    John I got Random Forest and Nearest Neighbors working, but somehow the Neural Networks did not give a good predication, I will contact tech support at Wolfram Research this week.

    Worst case scenario I will use my own homemade C code for NN, so I have control to adjust for far better training.

    I could change the data to multiple dimensions and small images if you like to get fancy.

    Please see the error analysis GUI I added for Random Forest, I could do that for all the algorithms

    Dara

    Comment Source:Mathematica 10.0.1 Machine Learning off-the-shelf, just a sample to evaluate : [Darwin Anomalies, Mathematica Machine Learning](http://files.lossofgenerality.com/mlofftheshelf.pdf) John I got Random Forest and Nearest Neighbors working, but somehow the Neural Networks did not give a good predication, I will contact tech support at Wolfram Research this week. Worst case scenario I will use my own homemade C code for NN, so I have control to adjust for far better training. I could change the data to multiple dimensions and small images if you like to get fancy. Please see the error analysis GUI I added for Random Forest, I could do that for all the algorithms Dara
  • 7.

    John

    Let's say some algorithm worked for your purposes, then I place the script in our servers, and give you access to run it on any data or parts of data you like. All you have to do is to adjust the params for the algorithm and choose different geographic data, from your laptop, without need to download code or data.

    The results will be emailed to you by server.

    Dara

    Comment Source:John Let's say some algorithm worked for your purposes, then I place the script in our servers, and give you access to run it on any data or parts of data you like. All you have to do is to adjust the params for the algorithm and choose different geographic data, from your laptop, without need to download code or data. The results will be emailed to you by server. Dara
  • 8.

    Dara - this looks interesting!

    With the Darwin anomaly prediction, what are you trying to predict? And from what data?

    For example, are you trying to use 6 months of Darwin air pressure anomaly data to predict the next month's Darwin air pressure anomaly anomaly?

    (I pulled the number 6 out of a hat, just as an example.)

    Comment Source:Dara - this looks interesting! With the [Darwin anomaly prediction](http://files.lossofgenerality.com/mlofftheshelf.pdf), what are you trying to predict? And from what data? For example, are you trying to use 6 months of Darwin air pressure anomaly data to predict the next month's Darwin air pressure anomaly anomaly? (I pulled the number 6 out of a hat, just as an example.)
  • 9.
    edited October 2014

    David wrote:

    Note that you have to be a little bit careful about drawing conclusions from machine learning experiments with negative outcomes.

    Thanks! Yes, I should be careful in drawing conclusions and I hope to show my talk slides around and have them mercilessly critiqued before (not during) my talk on this. But of course I need to get started doing something... now!

    Comment Source:David wrote: > Note that you have to be a little bit careful about drawing conclusions from machine learning experiments with negative outcomes. Thanks! Yes, I should be careful in drawing conclusions and I hope to show my talk slides around and have them mercilessly critiqued _before_ (not during) my talk on this. But of course I need to get started doing _something_... now!
  • 10.
    edited October 2014

    Hello John

    Data Darwin Anomaly

    I just wanted to grab some time series to test and see how the new Mathematica machine learning works.

    So the ML algorithms take a list like this:

    1-->f(1)

    2-->f(2)

    ...

    n-->f(n)

    In this case I used the last 300 months.

    The ML algorithms give a function forecast() which approximates the above map of numbers. For error Variance(forecast(x)-f(x)) was computed for x between 1 to n.

    If you want to forecast the next kth value then compute

    forecast(n+k) n is the last value in the list above

    In your example forecast (n+1) to forecast(n+6)

    The programmer has small amount of control on configuring the algorithm, it is mostly self-configured and easy to use (hah!) , hence off-the-shelf.

    Dara

    Comment Source:Hello John Data [Darwin Anomaly](http://www.cgd.ucar.edu/cas/catalog/climind/darwin.anom.ascii) I just wanted to grab some time series to test and see how the new Mathematica machine learning works. So the ML algorithms take a list like this: 1-->f(1) 2-->f(2) ... n-->f(n) In this case I used the last 300 months. The ML algorithms give a function forecast() which approximates the above map of numbers. For error Variance(forecast(x)-f(x)) was computed for x between 1 to n. If you want to forecast the next kth value then compute forecast(n+k) n is the last value in the list above In your example forecast (n+1) to forecast(n+6) The programmer has small amount of control on configuring the algorithm, it is mostly self-configured and easy to use (hah!) , hence **off-the-shelf**. Dara
  • 11.

    John in the above map you could replace the domain with anytime from images to matrices

    image --> f(image)

    f could be a count of measurement for something in a geographic area map of heat or pressure.

    Comment Source:John in the above map you could replace the domain with anytime from images to matrices image --> f(image) f could be a count of measurement for something in a geographic area map of heat or pressure.
  • 12.

    I am planning to get the data from the Django server so you could apply the same algorithm to different regions on the planet, simply enter the long lat or rubber-band box the area

    D

    Comment Source:I am planning to get the data from the Django server so you could apply the same algorithm to different regions on the planet, simply enter the long lat or rubber-band box the area D
  • 13.

    Hello John

    I got the Random Forest to predict the next month value and it is good, but next 6 was not up to par.

    So here is the vicious circle of fine tuning and re-training. I use some training techniques to get something reasonable for 6 forecasts.

    If not I use my custom code

    Dara

    Comment Source:Hello John I got the Random Forest to predict the next month value and it is good, but next 6 was not up to par. So here is the vicious circle of fine tuning and re-training. I use some training techniques to get something reasonable for 6 forecasts. If not I use my custom code Dara
  • 14.
    edited October 2014

    Dara - you didn't answer my last question:

    With the Darwin anomaly prediction, what are you trying to predict? And from what data?

    At least, you didn't answer it in a way I can understand. I wasn't asking how you did the prediction. I was asking what you were trying to predict, and from what data.

    Here's my new guess:

    What you are trying to predict: the Darwin air pressure anomaly on any given month $N$.

    From what data: the Darwin air pressure anomaly on the 300 previous months.

    Comment Source:Dara - you didn't answer my last question: > With the [Darwin anomaly prediction](http://files.lossofgenerality.com/mlofftheshelf.pdf), what are you trying to predict? And from what data? At least, you didn't answer it in a way I can understand. I wasn't asking _how_ you did the prediction. I was asking _what_ you were trying to predict, and from what data. Here's my new guess: *What you are trying to predict:* the Darwin air pressure anomaly on any given month $N$. *From what data:* the Darwin air pressure anomaly on the 300 previous months.
  • 15.

    Hello John

    Your guess was correct, I need to be more astute to document and speak better here. I am taking more time off from programming and attending to more communications.

    This off-the-shelf use of neural network algorithm in Mathematica does not seem to be suitable for the sorts of data we are dealing with. Problem is I cannot fine tune the training sessions, the variables are hidden by the API.

    I post, next, the customized versions I did for you.

    BTW I used darwin anomaly data, but please choose whatever you like.

    Dara

    Comment Source:Hello John Your guess was correct, I need to be more astute to document and speak better here. I am taking more time off from programming and attending to more communications. This off-the-shelf use of neural network algorithm in Mathematica does not seem to be suitable for the sorts of data we are dealing with. Problem is I cannot fine tune the training sessions, the variables are **hidden** by the API. I post, next, the customized versions I did for you. BTW I used darwin anomaly data, but please choose whatever you like. Dara
Sign In or Register to comment.