Options

Semantics and usage of the Pandas/Python data analysis framework

edited March 2016 in General

For the context and framing of this discussion, please see the prior thread:

Python data types

The Python language contains a whole range of standard types, including primitive value types (int, float, etc), lists, tuples, dictionaries (i.e. finite mappings), functions and objects. For tutorials and reference information, see:

ndarray (NumPy)

The python module NumPy has an n-dimensional array type. All the elements in an ndarray must be of the same Python type. This is an efficient representation, which gets packed into a contiguous array in memory. This makes it a good format for interfacing with libraries that are external to Python. NumPy provides operators that will apply element-wise operations to entire arrays (vectorization). So, even though the Python interpreter does have performance deficits in comparison with strongly typed compiled languages, by making use of vectorized operators on large data sets, the critical inner loops are being performed in the compiled NumPy library, rather than in the Python interpreter.

Series and DataFrame (Pandas)

These two data types (classes in the Pandas module) are built on top of the ndarray data type. They are enrichments of, respectively, the mathematical types Sequence and Relation. A Series is a sequence of values with associated labels, and a DataFrame is a two-dimensional, column-oriented structure with row and column labels.

Index (Pandas)

An Index is an object that provides the sequences of labels that are used in the Series and DataFrame objects. An Index may contain multiple levels of hierarchy within it.

This thread will consist of an exposition of the algebra of Series and DataFrames, along with examples of their use.

Comments

  • 1.
    edited March 2016

    Def: an Index of type $U$ consists of a sequence of labels of type $U$.

    If the type $U$ consists of $n$-tuples, then the index is hierarchical, with $n$ levels.

    Comment Source:Def: an **Index** of type $U$ consists of a sequence of labels of type $U$. If the type $U$ consists of $n$-tuples, then the index is hierarchical, with $n$ levels.
  • 2.
    edited March 2016

    Def: a Series of type $(U,V)$ consists of an index of type $U$, which provides the sequence of axis labels, along with a matching sequence of values of type $V$.

    $U$ and $V$ are Python types, which include the primitive value types, string, tuple, and object.

    A Series also has an optional Name attribute.

    Caveat: the type $U$ must be "hashable." That includes all of the types that we might normally consider for axis labels, such as integers, strings, or tuples of strings.

    Comment Source:Def: a **Series** of type $(U,V)$ consists of an index of type $U$, which provides the sequence of axis labels, along with a matching sequence of values of type $V$. $U$ and $V$ are Python types, which include the primitive value types, string, tuple, and object. A Series also has an optional Name attribute. Caveat: the type $U$ must be "hashable." That includes all of the types that we might normally consider for axis labels, such as integers, strings, or tuples of strings.
  • 3.
    edited March 2016

    Examples:

    >>> Series(['Albany', 'Sacramento', 'Trenton'], ['NY', 'CA', 'NJ'], name='State Capitals')
    NY        Albany
    CA    Sacramento
    NJ       Trenton
    Name: State Capitals, dtype: object
    

    If the labels are not specified, they default to integer values starting from zero:

    >>> s1 = Series([1,2,3])
    >>> s2 = Series([10,20,30])
    
    >>> s1
    0    1
    1    2
    2    3
    dtype: int64
    
    >>> s2
    0    10
    1    20
    2    30
    dtype: int64
    
    >>> s1 + s2
    0    11
    1    22
    2    33
    dtype: int64
    
    >>> 3 * s2
    0    30
    1    60
    2    90
    dtype: int64 
    
    Comment Source:Examples: >>> Series(['Albany', 'Sacramento', 'Trenton'], ['NY', 'CA', 'NJ'], name='State Capitals') NY Albany CA Sacramento NJ Trenton Name: State Capitals, dtype: object If the labels are not specified, they default to integer values starting from zero: >>> s1 = Series([1,2,3]) >>> s2 = Series([10,20,30]) >>> s1 0 1 1 2 2 3 dtype: int64 >>> s2 0 10 1 20 2 30 dtype: int64 >>> s1 + s2 0 11 1 22 2 33 dtype: int64 >>> 3 * s2 0 30 1 60 2 90 dtype: int64
  • 4.
    edited March 2016

    In general, the sequence of labels may contain duplicate values. This complicates the picture.

    For now, let's focus on the normal case, where the labels are unique.

    Def: a Series is functional if its labels are unique.

    A functional Series obviously gives a function from labels to values, which permits us to make this equivalent definition:

    Def: a functional Series is a function over a finite, totally ordered domain.

    Comment Source:In general, the sequence of labels may contain duplicate values. This complicates the picture. For now, let's focus on the normal case, where the labels are unique. Def: a Series is **functional** if its labels are unique. A functional Series obviously gives a function from labels to values, which permits us to make this equivalent definition: Def: a **functional Series** is a function over a finite, totally ordered domain.
  • 5.
    edited March 2016

    When a component-wise operation is applied to two series, it is applied in the same way that, for example, two mathematical functions are added to form another function.

    That is to say, the labels are used to determine which values to add (or multiply, etc.) together. That means the the orderings of the labels in the two input series are not required to be the same. Each value is strongly tied to its label, and this connection will be maintained through the calculations.

    The labels in the index of the resultant series will consist of the union of the values in each of the argument series. If the index for one of the argument series contains the label X, but the other does not, then the result index will contain the label X, but its value will be set to NaN.

    Regarding the ordering of the labels in the result, I have seen the following, commonsensical behavior. If $s1$ and $s2$ have the same index, then $s1 + s2$ will have that index. But if there is any difference whatsoever, then the result index will contain the union of the labels, and will be sorted by the canonical ordering for the type of the labels.

    Comment Source:When a component-wise operation is applied to two series, it is applied in the same way that, for example, two mathematical functions are added to form another function. That is to say, the _labels_ are used to determine which values to add (or multiply, etc.) together. That means the the orderings of the labels in the two input series are not required to be the same. Each value is strongly tied to its label, and this connection will be maintained through the calculations. The labels in the index of the resultant series will consist of the _union_ of the values in each of the argument series. If the index for one of the argument series contains the label X, but the other does not, then the result index will contain the label X, but its value will be set to NaN. Regarding the ordering of the labels in the result, I have seen the following, commonsensical behavior. If $s1$ and $s2$ have the same index, then $s1 + s2$ will have that index. But if there is any difference whatsoever, then the result index will contain the union of the labels, and will be sorted by the canonical ordering for the type of the labels.
  • 6.
    edited March 2016

    Example:

    >>> s1 = Series([10,20,30], ['A','B','C'])
    >>> s2 = Series([4,3,2], ['D','C','B'])
    
    >>> s1
    A    10
    B    20
    C    30
    dtype: int64
    
    >>> s2
    D    4
    C    3
    B    2
    dtype: int64
    
    >>> s1 + s2
    A   NaN
    B    22
    C    33
    D   NaN
    dtype: float64
    

    Note that the type of the series was changed to a float type, in order to accommodate the NaN value. That reflects a limitation of NumPy, which for efficiency uses contiguous arrays of the underlying machine types; these machine types do not contain an NaN value for integer types. So NumPy would have to be made more complex, by storing the mask bits elsewhere.

    Comment Source:Example: >>> s1 = Series([10,20,30], ['A','B','C']) >>> s2 = Series([4,3,2], ['D','C','B']) >>> s1 A 10 B 20 C 30 dtype: int64 >>> s2 D 4 C 3 B 2 dtype: int64 >>> s1 + s2 A NaN B 22 C 33 D NaN dtype: float64 Note that the type of the series was changed to a float type, in order to accommodate the NaN value. That reflects a limitation of NumPy, which for efficiency uses contiguous arrays of the underlying machine types; these machine types do not contain an NaN value for integer types. So NumPy would have to be made more complex, by storing the mask bits elsewhere.
  • 7.
    edited March 2016

    Def: Let $R$ be an index representing the row labels, $C$ be an index representing the column labels, and let $V$ be a mapping from the column labels to Python types.

    Then a **DataFrame* over $(R,C,V)$ is given by $R$, $C$, $V$ and a mapping $D$ that associates to each label $c$ in $C$ a Series $D(c)$ that has index $R$ and value type $V(c)$.

    So a dataframe is a tabular structure, where types are assigned to the columns, and each column is a Series with a specified value type, and all of the columns share a common row index.

    The row and column indexes for a dataframe each have an optional Name attribute.

    Comment Source:Def: Let $R$ be an index representing the row labels, $C$ be an index representing the column labels, and let $V$ be a mapping from the column labels to Python types. Then a **DataFrame* over $(R,C,V)$ is given by $R$, $C$, $V$ and a mapping $D$ that associates to each label $c$ in $C$ a Series $D(c)$ that has index $R$ and value type $V(c)$. So a dataframe is a tabular structure, where types are assigned to the columns, and each column is a Series with a specified value type, and all of the columns share a common row index. The row and column indexes for a dataframe each have an optional Name attribute.
  • 8.
    edited March 2016

    This structure is a recurrent theme in data processing languages, which can be seen in various forms in APL, J, K, KDB, and R.

    It formalizes the commonsense notion of the data structure that is presented, for example, by a spreadsheet with labeled columns, each of a definite type.

    In the remainder of this thread, we will be looking at some of the interesting and useful algebraic operations that can be defined over Series and DataFrames, and illustrating these operators by their implementation in Pandas/Python.

    We will also see that this range of operators can become more interesting and more useful, when the indexes for the data frames are hierarchical.

    Comment Source:This structure is a recurrent theme in data processing languages, which can be seen in various forms in APL, J, K, KDB, and R. It formalizes the commonsense notion of the data structure that is presented, for example, by a spreadsheet with labeled columns, each of a definite type. In the remainder of this thread, we will be looking at some of the interesting and useful algebraic operations that can be defined over Series and DataFrames, and illustrating these operators by their implementation in Pandas/Python. We will also see that this range of operators can become more interesting and more useful, when the indexes for the data frames are hierarchical.
  • 9.
    edited March 2016

    Compared to general relations, data frames contain the following additional structure:

    • The rows, and the data in each column, are totally ordered.
    • The rows, and the values in each column, are labeled by indexes.
    • The row and column indexes can each have an internal hierarchical structure.

    Here is how these structures can be modeled within the standard relational framework:

    • Introduce another attribute, that for the row labels.
    • Add a total ordering on this attribute.
    • For the hierarchy, simply consider the fact that the row labels may consist of $n$-tuples.
    • Similarly, if it's not there already, introduce a total ordering on the attributes.
    • Introduce a labeling function for the columns, and consider the fact that the columns labels may consist of $k$-tuples.

    Two ways of looking at the same thing, but, at least for practical purposes, it is useful to take the above-given definition of a DataFrame is useful as a consolidated starting point, with a clear visualization as a table with labeled row and column axes.

    Comment Source:Compared to general relations, data frames contain the following additional structure: * The rows, and the data in each column, are totally ordered. * The rows, and the values in each column, are labeled by indexes. * The row and column indexes can each have an internal hierarchical structure. Here is how these structures can be modeled within the standard relational framework: * Introduce another attribute, that for the row labels. * Add a total ordering on this attribute. * For the hierarchy, simply consider the fact that the row labels may consist of $n$-tuples. * Similarly, if it's not there already, introduce a total ordering on the attributes. * Introduce a labeling function for the columns, and consider the fact that the columns labels may consist of $k$-tuples. Two ways of looking at the same thing, but, at least for practical purposes, it is useful to take the above-given definition of a DataFrame is useful as a consolidated starting point, with a clear visualization as a table with labeled row and column axes.
  • 10.
    edited March 2016

    Example of DataFrame:

    >>> column_data = {'State': ['California', 'New Jersey', 'New York'],
    ...                'Capital': ['Sacramento', 'Trenton', 'Albany'],
    ...                'Timezone': ['PST', 'EST', 'EST']}
    
    >>> frame = DataFrame(column_data, index = ['NY', 'CA', 'NJ'], 
                                       columns = ['State', 'Capital', 'Timezone'])
    
    >>> frame
             State     Capital Timezone
    NY  California  Sacramento      PST 
    CA  New Jersey     Trenton      EST
    NJ    New York      Albany      EST
    
    [3 rows x 3 columns]
    

    Here the column data is first stored in a Python dictionary, and then passed to the constructor for the DataFrame. Because dictionaries are unordered, we needed to pass the "columns" argument to the constructor, to tell it what order to sequence the columns in.

    There are many constructors for DataFrame. Here we used one that was based on the column vectors. There is another one, for example, which takes a list of dictionaries as its argument, with each dictionary one row of the table, as a tuple. So, that constructor reflects the relational perspective on data frames.

    Comment Source:Example of DataFrame: >>> column_data = {'State': ['California', 'New Jersey', 'New York'], ... 'Capital': ['Sacramento', 'Trenton', 'Albany'], ... 'Timezone': ['PST', 'EST', 'EST']} >>> frame = DataFrame(column_data, index = ['NY', 'CA', 'NJ'], columns = ['State', 'Capital', 'Timezone']) >>> frame State Capital Timezone NY California Sacramento PST CA New Jersey Trenton EST NJ New York Albany EST [3 rows x 3 columns] Here the column data is first stored in a Python dictionary, and then passed to the constructor for the DataFrame. Because dictionaries are unordered, we needed to pass the "columns" argument to the constructor, to tell it what order to sequence the columns in. There are many constructors for DataFrame. Here we used one that was based on the column vectors. There is another one, for example, which takes a list of dictionaries as its argument, with each dictionary one row of the table, as a tuple. So, that constructor reflects the relational perspective on data frames.
Sign In or Register to comment.