Options

Preamble to the analysis of the Pandas/Python data analysis framework

edited March 2016 in General

Hello Azimuth friends, I've been having a great time learning the Pandas framework, which is embedded in the Python language, and is key to the "scientific python ecosystem." I'm starting this thread, and another one, to share some of these ideas. I'm also hoping to generate some raw material here for a blog article.

Here is a classic reference book:

Here is a recommended primer from the Pandas website:

Here are the main components of the scientific python ecosystem. I am paraphrasing/quoting from McKinney:

  • NumPy. Short for numerical python, NumPy is the foundational package for scientific computing in Python. It provides a fast and efficient multi-dimensional array object; functions for performing element-wise computations with arrays or mathematical operations between arrays; tools for reading and writing array-based data sets to disk; linear algebra operations, Fourier transform, and random number generation; tools for integrating other languages with Python.

  • pandas. Pandas provides rich data structures and functions designed to make working with structured data fast, easy and expressive. The primary object in pandas is the DataFrame, a two-dimensional tabular, column-oriented structure with both row and column labels. Pandas combines the high performance array-computing features of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases.

And, I may add: it is seamlessly integrated with the developed high-level language Python, which contains mechanisms for abstraction, functional programming, object-orientation; extensive platform support libraries for systems programming, web services interfaces, etc., etc.

For users of the R statistical computing language, the DataFrame name will be familiar, as it was named after the similar R data.frame object. They are not the same however, as the functionality provided by the R data frame is essentially a strict subset of that provided by the pandas DataFrame.

  • matplotlib. The most popular Python library for producing plots and other 2D visualizations. It is maintained by a large team of developers, and is well-suited for creating publication-quality plots.

  • IPython. IPython is the component in the toolset that ties everything together; it provides a robust and productive environment for interactive and exploratory computing.

  • SciPy. SciPy is a collection of packages addressing a number of different standard problem domains in scientific computing. It includes: scipy.integrate, with numerical integration routines and differential equation solvers; scipy.linalg, with linear algebra and matrix decompostion algorithms; scipy.optimize, with function optimizers and root finding algorithms; scipy.signal, with signal processing tools; scipy.sparse, for sparse matricies and sparse linear system solvers; scipy.stats, with standard continuous and discrete probability distributions, statistical tests, and descriptive statistics; scipy.weave, a tool for using inline C++ code to accelerate array computations.

Together NumPy and SciPy form a reasonably complete computational replacement for much of MATLAB along with some of its add-on toolboxes.

And, I may add: it is free!

Comments

  • 1.
    edited March 2016

    I will be starting a separate thread on the semantics and usage of the Pandas/Python data analysis framework.

    The core structures in Pandas are the Series and the DataFrame, which are enrichments, respectively, of the notions of sequences and relations. The entirety of the algebra of (finite) sequences, and of relational algebra, therefore applies to them. And these operations are implemented in the library. For example, the DataFrame object contains methods for performing joins with other DataFrames (relations).

    By the way, it's a lot nicer to do this kind of data programming right inside of a high-level language, as opposed to using a client-server model, with two entirely different frameworks on the client and the server. Of course, servers do have an important role to play, when there are large volumes of data to be managed, concurrent processes, etc. But lots of routine data analysis tasks do not call for such an industrialized approach. This advantage is clearly shared by many other data programming frameworks. But it doesn't hurt to be working inside of a general-purpose programming language that is powerful and elegant.

    Anyhow, back to the topic of the algebras. So all of the algebra of finite sequences and relations will apply, and can be implemented, on the pandas objects. But there's much more, because the Series and the DataFrame include substantial enrichments of these structures.

    Whereas universal algebra is typically applied to systems with a relatively small number of operations, the algebra for Pandas objects will contain just two object types, with a large number of interesting and useful primitive operations. I am curious to see what the categorists might find here.

    Comment Source:I will be starting a separate thread on the semantics and usage of the Pandas/Python data analysis framework. The core structures in Pandas are the Series and the DataFrame, which are enrichments, respectively, of the notions of sequences and relations. The entirety of the algebra of (finite) sequences, and of relational algebra, therefore applies to them. And these operations are implemented in the library. For example, the DataFrame object contains methods for performing joins with other DataFrames (relations). By the way, it's a lot nicer to do this kind of data programming right inside of a high-level language, as opposed to using a client-server model, with two entirely different frameworks on the client and the server. Of course, servers do have an important role to play, when there are large volumes of data to be managed, concurrent processes, etc. But lots of routine data analysis tasks do not call for such an industrialized approach. This advantage is clearly shared by many other data programming frameworks. But it doesn't hurt to be working inside of a general-purpose programming language that is powerful and elegant. Anyhow, back to the topic of the algebras. So all of the algebra of finite sequences and relations will apply, and can be implemented, on the pandas objects. But there's much more, because the Series and the DataFrame include substantial enrichments of these structures. Whereas universal algebra is typically applied to systems with a relatively small number of operations, the algebra for Pandas objects will contain just two object types, with a large number of interesting and useful primitive operations. I am curious to see what the categorists might find here.
  • 2.
    edited March 2016

    You may detect some notes of enthusiasm in the ways that I speak of this framework.

    For various psychological reasons, I believe that discussions of programming languages, by people who actually use them, are always partisan -- no matter whether they are handled crudely or cordially. Part of is that these languages are a mode of our own thinking, so criticisms of the languages can feel like criticisms of our our thinking. Imagine if I could only think in English (which is true), and some guys say that it is a completely backwards language to think in, and that all problems are best formulated in some language that is completely alien to me. Sure I'd have a reaction!

    Elsewhere I have seen discussions between intelligent colleagues really deteriorate over these matters. In the end, it's as if they were debating the pros and cons of French versus Spanish, or Lions versus Tigers. I'll take the Mane over the Stripes, any day!

    There is another reason why it is hard for programmers to maintain objectivity about languages. The analogy to two scientists discussing and comparing lions and tigers is incomplete, because programming languages are actually used to get jobs done. If I needed a lion or a tiger as a bodyguard, then I might really think that one is better than the other, if, for example, it had a more menacing stare.

    Comment Source:You may detect some notes of enthusiasm in the ways that I speak of this framework. For various psychological reasons, I believe that discussions of programming languages, by people who actually use them, are always partisan -- no matter whether they are handled crudely or cordially. Part of is that these languages are a mode of our own thinking, so criticisms of the languages can feel like criticisms of our our thinking. Imagine if I could only think in English (which is true), and some guys say that it is a completely backwards language to think in, and that all problems are best formulated in some language that is completely alien to me. Sure I'd have a reaction! Elsewhere I have seen discussions between intelligent colleagues really deteriorate over these matters. In the end, it's as if they were debating the pros and cons of French versus Spanish, or Lions versus Tigers. I'll take the Mane over the Stripes, any day! There is another reason why it is hard for programmers to maintain objectivity about languages. The analogy to two scientists discussing and comparing lions and tigers is incomplete, because programming languages are actually _used_ to get jobs done. If I needed a lion or a tiger as a bodyguard, then I might really think that one is better than the other, if, for example, it had a more menacing stare.
  • 3.
    edited March 2016

    But with this understanding, we can find ways to organize our conversations, so that everything gets a chance to be said, in a productive manner.

    Suppose that you thought that Python was really misguided, and that language X was a much more promising prospect.

    Then why not start a thread dedicated to X, where you could expound its features, and what you think is good about them?

    Then, if you wanted to dig deeper into the comparison between X and Python, then why not start a thread called Comparison of X versus Python?

    Comment Source:But with this understanding, we can find ways to organize our conversations, so that everything gets a chance to be said, in a productive manner. Suppose that you thought that Python was really misguided, and that language X was a much more promising prospect. Then why not start a thread dedicated to X, where you could expound its features, and what you think is good about them? Then, if you wanted to dig deeper into the comparison between X and Python, then why not start a thread called Comparison of X versus Python?
  • 4.

    The only thing that I would specifically ask is that you would help me to keep the new thread, soon to be created, on topic.

    Comment Source:The only thing that I would specifically ask is that you would help me to keep the new thread, soon to be created, on topic.
  • 5.
    edited March 2016

    I use the PyData stack (ie numpy, pandas, matplolib & co) pretty much all the time now and I am very happy with it.

    Two other key components are Jupyter Notebooks and Anaconda. Jupyter provides Mathematica like notebooks and Anaconda is a package management system that makes easier to stay out of dependency hell.

    Jupyter Notebooks, originally called IPython Notebooks, are what I used to create stuff for John's NIPS talk on climate networks. There is a fair bit of enthusiasm around using Jupyter for improving the reproducibility and accessiblity of scientific research.

    Other math/science/data oriented Python tools of note

    • Scikit - machine learning
    • Scikit-image & PIL/Pillow - image processing
    • Blaze - data transformation pipelines & simplified interactions with various data stores
    • Bokeh - Interactive web visualizations
    • Sympy - symbolic algebra (also see Sage)
    • Numba - a very easy to use JIT compiler (just import it and put @jit annotation on functions you want compiled)

    and for dealing with genuinely big data there is PySpark (also something called Ibis that I have not tried yet)

    I think there languages/systems that do individual things better than Python, but using the complete system is hard to beat. I still use R for some things, mainly initial explorations. but increasingly I just live in Python - and of course Emacs ;).

    Comment Source:I use the PyData stack (ie numpy, pandas, matplolib & co) pretty much all the time now and I am very happy with it. Two other key components are [Jupyter Notebooks](http://jupyter.org/) and [Anaconda](https://www.continuum.io/downloads). Jupyter provides Mathematica like notebooks and Anaconda is a package management system that makes easier to stay out of dependency hell. Jupyter Notebooks, originally called IPython Notebooks, are what I used to create stuff for John's NIPS talk on climate networks. There is a fair bit of enthusiasm around using Jupyter for improving the reproducibility and accessiblity of scientific research. Other math/science/data oriented Python tools of note * Scikit - machine learning * Scikit-image & PIL/Pillow - image processing * [Blaze](http://blaze.pydata.org) - data transformation pipelines & simplified interactions with various data stores * Bokeh - Interactive web visualizations * Sympy - symbolic algebra (also see Sage) * Numba - a very easy to use JIT compiler (just import it and put @jit annotation on functions you want compiled) and for dealing with genuinely big data there is PySpark (also something called Ibis that I have not tried yet) I think there languages/systems that do individual things better than Python, but using the complete system is hard to beat. I still use R for some things, mainly initial explorations. but increasingly I just live in Python - and of course Emacs ;).
  • 6.
    edited March 2016

    The key, as David points out, is that Pandas DataFrames and Numpy arrays are what tie all these components together.

    Comment Source:The key, as David points out, is that Pandas DataFrames and Numpy arrays are what tie all these components together.
  • 7.
    edited March 2016

    Also:

    • Interfaces to SQL etc. through Python.

    • HDF5 support through Pandas. HDF5 is a file format, originally from NASA, for storing large volumes of data, in an efficient yet semantically flexible manner. Within the file there is essentially a complete file system, with symbolic links, etc., and leaf nodes that hold arrays of binary data.

    I haven't tried it, but the McKinney book makes it look like an HDF5 file can be represented through a DataFrame object. I can't really picture how this could work if the file were too big to fit into memory -- it would be great if it did, but also a major achievement to do so. This is getting into the functional area of KDB. Does anyone know what is the status of Pandas/NumPy development for large arrays that don't fit into memory?

    Comment Source:Also: * Interfaces to SQL etc. through Python. * HDF5 support through Pandas. HDF5 is a file format, originally from NASA, for storing large volumes of data, in an efficient yet semantically flexible manner. Within the file there is essentially a complete file system, with symbolic links, etc., and leaf nodes that hold arrays of binary data. I haven't tried it, but the McKinney book makes it look like an HDF5 file can be represented through a DataFrame object. I can't really picture how this could work if the file were too big to fit into memory -- it would be great if it did, but also a major achievement to do so. This is getting into the functional area of KDB. Does anyone know what is the status of Pandas/NumPy development for large arrays that don't fit into memory?
  • 8.
    edited March 2016

    Hi Daniel, what platform do you run it on? I'm looking for a good distribution for the Mac.

    EDIT: I just realized that I have an old version of Enthought Canopy on my Mac, which includes Pandas and NumPy. This will do for now, but I'm still curious to hear recommendations for the scientific python ecosystem on the Mac. Thanks.

    Comment Source:Hi Daniel, what platform do you run it on? I'm looking for a good distribution for the Mac. EDIT: I just realized that I have an old version of Enthought Canopy on my Mac, which includes Pandas and NumPy. This will do for now, but I'm still curious to hear recommendations for the scientific python ecosystem on the Mac. Thanks.
  • 9.

    Anaconda is avalable for Mac, Linux and Windows. I use it Linux and other people at work use it on Macs. Anaconda makes it much easier to get comaptible environments. I tried Canopy but Anaconda seems much better to me.

    One important thing is to install it separately from your system Python so that you system package manager is not fighting with Anaconda for control. I usually install it in my personal home directory, but I put it on $PATH ahead of system directories, so it becomes my default. Works like a charm that way.

    Comment Source:Anaconda is avalable for Mac, Linux and Windows. I use it Linux and other people at work use it on Macs. Anaconda makes it much easier to get comaptible environments. I tried Canopy but Anaconda seems much better to me. One important thing is to install it separately from your system Python so that you system package manager is not fighting with Anaconda for control. I usually install it in my personal home directory, but I put it on $PATH ahead of system directories, so it becomes my default. Works like a charm that way.
  • 10.

    Regarding HDF5 and DataFrames, I think the standard Panas interface does load the file into memory but it is possible to implement the DataFrame API in a way that fetches the data lazily from disk. This is done in the dask and bcolz packages.

    Comment Source:Regarding HDF5 and DataFrames, I think the standard Panas interface does load the file into memory but it is possible to implement the DataFrame API in a way that fetches the data lazily from disk. This is done in the dask and bcolz packages.
  • 11.
    edited March 2016

    Daniel, thanks for your good input. Anaconda looks great!!

    Comment Source:Daniel, thanks for your good input. Anaconda looks great!!
  • 12.

    Even apart from HDF5, the issue of managing large DataFrames, that don't fit into memory, is challenging and important. And if you add in the prospect of writing to the DataFrame, it becomes even more challenging -- to do this all efficiently. My understanding is that K/KDB/Q is masterful at efficiently managing large arrays that cannot fit into memory. Interestingly, I see there is now an open-source variant of K itself, called Kona. Not sure how significant that is, but in any case, I hope that Pandas will one day catch up in the area of high-performance disk-based array processing.

    Comment Source:Even apart from HDF5, the issue of managing large DataFrames, that don't fit into memory, is challenging and important. And if you add in the prospect of writing to the DataFrame, it becomes even more challenging -- to do this all efficiently. My understanding is that K/KDB/Q is masterful at efficiently managing large arrays that cannot fit into memory. Interestingly, I see there is now an open-source variant of K itself, called Kona. Not sure how significant that is, but in any case, I hope that Pandas will one day catch up in the area of high-performance disk-based array processing.
  • 13.

    David, take a look at Dask (and also Bcolz). They provide out of memory DataFrames. Bcolz uses its own datastore, but dask can use a number of different data stores, including HDF5 and the Bcolz store.

    I have interested in the K family for a long time. Have not heard of Kona. Thanks.

    Comment Source:David, take a look at Dask (and also Bcolz). They provide out of memory DataFrames. Bcolz uses its own datastore, but dask can use a number of different data stores, including HDF5 and the Bcolz store. I have interested in the K family for a long time. Have not heard of Kona. Thanks.
Sign In or Register to comment.