It looks like you're new here. If you want to get involved, click one of these buttons!

- All Categories 2.3K
- Chat 502
- Study Groups 21
- Petri Nets 9
- Epidemiology 4
- Leaf Modeling 2
- Review Sections 9
- MIT 2020: Programming with Categories 51
- MIT 2020: Lectures 20
- MIT 2020: Exercises 25
- Baez ACT 2019: Online Course 339
- Baez ACT 2019: Lectures 79
- Baez ACT 2019: Exercises 149
- Baez ACT 2019: Chat 50
- UCR ACT Seminar 4
- General 72
- Azimuth Code Project 110
- Statistical methods 4
- Drafts 10
- Math Syntax Demos 15
- Wiki - Latest Changes 3
- Strategy 113
- Azimuth Project 1.1K
- - Spam 1
- News and Information 148
- Azimuth Blog 149
- - Conventions and Policies 21
- - Questions 43
- Azimuth Wiki 715

Options

For the context and framing of this discussion, please see the prior thread:

**Python data types**

The Python language contains a whole range of standard types, including primitive value types (int, float, etc), lists, tuples, dictionaries (i.e. finite mappings), functions and objects. For tutorials and reference information, see:

**ndarray (NumPy)**

The python module NumPy has an n-dimensional array type. All the elements in an ndarray must be of the same Python type. This is an efficient representation, which gets packed into a contiguous array in memory. This makes it a good format for interfacing with libraries that are external to Python. NumPy provides operators that will apply element-wise operations to entire arrays (vectorization). So, even though the Python interpreter does have performance deficits in comparison with strongly typed compiled languages, by making use of vectorized operators on large data sets, the critical inner loops are being performed in the compiled NumPy library, rather than in the Python interpreter.

**Series and DataFrame (Pandas)**

These two data types (classes in the Pandas module) are built on top of the ndarray data type. They are enrichments of, respectively, the mathematical types Sequence and Relation. A Series is a sequence of values with associated labels, and a DataFrame is a two-dimensional, column-oriented structure with row and column labels.

**Index (Pandas)**

An Index is an object that provides the sequences of labels that are used in the Series and DataFrame objects. An Index may contain multiple levels of hierarchy within it.

This thread will consist of an exposition of the algebra of Series and DataFrames, along with examples of their use.

## Comments

Def: an

Indexof type $U$ consists of a sequence of labels of type $U$.If the type $U$ consists of $n$-tuples, then the index is hierarchical, with $n$ levels.

`Def: an **Index** of type $U$ consists of a sequence of labels of type $U$. If the type $U$ consists of $n$-tuples, then the index is hierarchical, with $n$ levels.`

Def: a

Seriesof type $(U,V)$ consists of an index of type $U$, which provides the sequence of axis labels, along with a matching sequence of values of type $V$.$U$ and $V$ are Python types, which include the primitive value types, string, tuple, and object.

A Series also has an optional Name attribute.

Caveat: the type $U$ must be "hashable." That includes all of the types that we might normally consider for axis labels, such as integers, strings, or tuples of strings.

`Def: a **Series** of type $(U,V)$ consists of an index of type $U$, which provides the sequence of axis labels, along with a matching sequence of values of type $V$. $U$ and $V$ are Python types, which include the primitive value types, string, tuple, and object. A Series also has an optional Name attribute. Caveat: the type $U$ must be "hashable." That includes all of the types that we might normally consider for axis labels, such as integers, strings, or tuples of strings.`

Examples:

If the labels are not specified, they default to integer values starting from zero:

`Examples: >>> Series(['Albany', 'Sacramento', 'Trenton'], ['NY', 'CA', 'NJ'], name='State Capitals') NY Albany CA Sacramento NJ Trenton Name: State Capitals, dtype: object If the labels are not specified, they default to integer values starting from zero: >>> s1 = Series([1,2,3]) >>> s2 = Series([10,20,30]) >>> s1 0 1 1 2 2 3 dtype: int64 >>> s2 0 10 1 20 2 30 dtype: int64 >>> s1 + s2 0 11 1 22 2 33 dtype: int64 >>> 3 * s2 0 30 1 60 2 90 dtype: int64`

In general, the sequence of labels may contain duplicate values. This complicates the picture.

For now, let's focus on the normal case, where the labels are unique.

Def: a Series is

functionalif its labels are unique.A functional Series obviously gives a function from labels to values, which permits us to make this equivalent definition:

Def: a

functional Seriesis a function over a finite, totally ordered domain.`In general, the sequence of labels may contain duplicate values. This complicates the picture. For now, let's focus on the normal case, where the labels are unique. Def: a Series is **functional** if its labels are unique. A functional Series obviously gives a function from labels to values, which permits us to make this equivalent definition: Def: a **functional Series** is a function over a finite, totally ordered domain.`

When a component-wise operation is applied to two series, it is applied in the same way that, for example, two mathematical functions are added to form another function.

That is to say, the

labelsare used to determine which values to add (or multiply, etc.) together. That means the the orderings of the labels in the two input series are not required to be the same. Each value is strongly tied to its label, and this connection will be maintained through the calculations.The labels in the index of the resultant series will consist of the

unionof the values in each of the argument series. If the index for one of the argument series contains the label X, but the other does not, then the result index will contain the label X, but its value will be set to NaN.Regarding the ordering of the labels in the result, I have seen the following, commonsensical behavior. If $s1$ and $s2$ have the same index, then $s1 + s2$ will have that index. But if there is any difference whatsoever, then the result index will contain the union of the labels, and will be sorted by the canonical ordering for the type of the labels.

`When a component-wise operation is applied to two series, it is applied in the same way that, for example, two mathematical functions are added to form another function. That is to say, the _labels_ are used to determine which values to add (or multiply, etc.) together. That means the the orderings of the labels in the two input series are not required to be the same. Each value is strongly tied to its label, and this connection will be maintained through the calculations. The labels in the index of the resultant series will consist of the _union_ of the values in each of the argument series. If the index for one of the argument series contains the label X, but the other does not, then the result index will contain the label X, but its value will be set to NaN. Regarding the ordering of the labels in the result, I have seen the following, commonsensical behavior. If $s1$ and $s2$ have the same index, then $s1 + s2$ will have that index. But if there is any difference whatsoever, then the result index will contain the union of the labels, and will be sorted by the canonical ordering for the type of the labels.`

Example:

Note that the type of the series was changed to a float type, in order to accommodate the NaN value. That reflects a limitation of NumPy, which for efficiency uses contiguous arrays of the underlying machine types; these machine types do not contain an NaN value for integer types. So NumPy would have to be made more complex, by storing the mask bits elsewhere.

`Example: >>> s1 = Series([10,20,30], ['A','B','C']) >>> s2 = Series([4,3,2], ['D','C','B']) >>> s1 A 10 B 20 C 30 dtype: int64 >>> s2 D 4 C 3 B 2 dtype: int64 >>> s1 + s2 A NaN B 22 C 33 D NaN dtype: float64 Note that the type of the series was changed to a float type, in order to accommodate the NaN value. That reflects a limitation of NumPy, which for efficiency uses contiguous arrays of the underlying machine types; these machine types do not contain an NaN value for integer types. So NumPy would have to be made more complex, by storing the mask bits elsewhere.`

Def: Let $R$ be an index representing the row labels, $C$ be an index representing the column labels, and let $V$ be a mapping from the column labels to Python types.

Then a **DataFrame* over $(R,C,V)$ is given by $R$, $C$, $V$ and a mapping $D$ that associates to each label $c$ in $C$ a Series $D(c)$ that has index $R$ and value type $V(c)$.

So a dataframe is a tabular structure, where types are assigned to the columns, and each column is a Series with a specified value type, and all of the columns share a common row index.

The row and column indexes for a dataframe each have an optional Name attribute.

`Def: Let $R$ be an index representing the row labels, $C$ be an index representing the column labels, and let $V$ be a mapping from the column labels to Python types. Then a **DataFrame* over $(R,C,V)$ is given by $R$, $C$, $V$ and a mapping $D$ that associates to each label $c$ in $C$ a Series $D(c)$ that has index $R$ and value type $V(c)$. So a dataframe is a tabular structure, where types are assigned to the columns, and each column is a Series with a specified value type, and all of the columns share a common row index. The row and column indexes for a dataframe each have an optional Name attribute.`

This structure is a recurrent theme in data processing languages, which can be seen in various forms in APL, J, K, KDB, and R.

It formalizes the commonsense notion of the data structure that is presented, for example, by a spreadsheet with labeled columns, each of a definite type.

In the remainder of this thread, we will be looking at some of the interesting and useful algebraic operations that can be defined over Series and DataFrames, and illustrating these operators by their implementation in Pandas/Python.

We will also see that this range of operators can become more interesting and more useful, when the indexes for the data frames are hierarchical.

`This structure is a recurrent theme in data processing languages, which can be seen in various forms in APL, J, K, KDB, and R. It formalizes the commonsense notion of the data structure that is presented, for example, by a spreadsheet with labeled columns, each of a definite type. In the remainder of this thread, we will be looking at some of the interesting and useful algebraic operations that can be defined over Series and DataFrames, and illustrating these operators by their implementation in Pandas/Python. We will also see that this range of operators can become more interesting and more useful, when the indexes for the data frames are hierarchical.`

Compared to general relations, data frames contain the following additional structure:

Here is how these structures can be modeled within the standard relational framework:

Two ways of looking at the same thing, but, at least for practical purposes, it is useful to take the above-given definition of a DataFrame is useful as a consolidated starting point, with a clear visualization as a table with labeled row and column axes.

`Compared to general relations, data frames contain the following additional structure: * The rows, and the data in each column, are totally ordered. * The rows, and the values in each column, are labeled by indexes. * The row and column indexes can each have an internal hierarchical structure. Here is how these structures can be modeled within the standard relational framework: * Introduce another attribute, that for the row labels. * Add a total ordering on this attribute. * For the hierarchy, simply consider the fact that the row labels may consist of $n$-tuples. * Similarly, if it's not there already, introduce a total ordering on the attributes. * Introduce a labeling function for the columns, and consider the fact that the columns labels may consist of $k$-tuples. Two ways of looking at the same thing, but, at least for practical purposes, it is useful to take the above-given definition of a DataFrame is useful as a consolidated starting point, with a clear visualization as a table with labeled row and column axes.`

Example of DataFrame:

Here the column data is first stored in a Python dictionary, and then passed to the constructor for the DataFrame. Because dictionaries are unordered, we needed to pass the "columns" argument to the constructor, to tell it what order to sequence the columns in.

There are many constructors for DataFrame. Here we used one that was based on the column vectors. There is another one, for example, which takes a list of dictionaries as its argument, with each dictionary one row of the table, as a tuple. So, that constructor reflects the relational perspective on data frames.

`Example of DataFrame: >>> column_data = {'State': ['California', 'New Jersey', 'New York'], ... 'Capital': ['Sacramento', 'Trenton', 'Albany'], ... 'Timezone': ['PST', 'EST', 'EST']} >>> frame = DataFrame(column_data, index = ['NY', 'CA', 'NJ'], columns = ['State', 'Capital', 'Timezone']) >>> frame State Capital Timezone NY California Sacramento PST CA New Jersey Trenton EST NJ New York Albany EST [3 rows x 3 columns] Here the column data is first stored in a Python dictionary, and then passed to the constructor for the DataFrame. Because dictionaries are unordered, we needed to pass the "columns" argument to the constructor, to tell it what order to sequence the columns in. There are many constructors for DataFrame. Here we used one that was based on the column vectors. There is another one, for example, which takes a list of dictionaries as its argument, with each dictionary one row of the table, as a tuple. So, that constructor reflects the relational perspective on data frames.`