7 Pandas

The pandas package was written by Wes McKinney (who seems to be Hadley Wickham’s counterpart for Python) and builds on top of the numpy package to provide DataFrames.

7.1 Creation

To create a DataFrame manually:

import pandas as pd
mydict = {
  "country":    ['Brazil',   'Russia', 'India',     'China',   'South Africa'],
  "capital":    ['Brasilia', 'Moscow', 'New Delhi', 'Beijing', 'Pretoria'    ],
  "area":       [ 8.516,      17.10,    3.286,       9.597,     1.221        ],
  "population": [ 200.4,      143.5,    1252,        1357,      52.98        ]}
df = pd.DataFrame(mydict)
df.index      = ['BR',       'RU',     'IN',        'CH',      'SA'          ]
print(df)
##          country    capital    area  population
## BR        Brazil   Brasilia   8.516      200.40
## RU        Russia     Moscow  17.100      143.50
## IN         India  New Delhi   3.286     1252.00
## CH         China    Beijing   9.597     1357.00
## SA  South Africa   Pretoria   1.221       52.98

Alternatively, if you have data in a CSV file you can import it using:

df = pd.read_csv('<path>')                # If there are no column names
df = pd.read_csv('<path>', index_col = 0) # If there are column names

You can also create a DataFrame as a list of lists:

import pandas as pd
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
visitors = [139, 237, 326, 456]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]
zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)
print(users)
##      city  signups  visitors weekday
## 0  Austin        7       139     Sun
## 1  Dallas       12       237     Sun
## 2  Austin        3       326     Mon
## 3  Dallas        5       456     Mon

7.2 Subsetting

To select a column in pandas, use df['country']. This returns a “pandas Series” object, which is like a 1D element of a data frame. If you want to get a data frame back instead of a series, use df[['country']]. This is the opposite of R, where the [[]] is used to drill down further than []. The reason for this is that [] defines a list in Python, so you’re really calling df[<subset>] where <subset> is a list of length 1. In R, the [[]] subset operator is actually a function, hence the differing behaviours. This means that subsetting for multiple columns in Python doesn’t take too much writing: df[['country', 'capital']].

Interestingly, and perhaps confusingly, if you provide a slice (range) to df[] then you will be subsetting on rows, not columns! Using df[1:4] selects the 2nd through 4th rows of the data frame.

If you want to do anything more complex than this, it starts getting a bit weird, and you have to use the loc() method (which stands for “location”). To pull out a named row, you need to use the df.loc['RU'] syntax (which returns a Series object), or df.loc[['RU']] (which returns a DataFrame). You can also select multiple rows this way: df.loc[['RU', 'IN', 'CH']]. You can then extend this to subset on columns as well, which starts to look a bit more like R (and like the 2D numpy array): df.loc[['RU', 'IN', 'CH'], ['country', 'capital']].

If you want to use an index, then you can do the same thing with lists of integers passed to the iloc() method (which stands for index-location). It seems like you probably need to chain these functions together if you want to have a mix of subsetting approaches.

7.3 Reading in chunks

You can use the chunksize argument to read a CSV file in chunks, then iterate over the chunks in a loop. This can be useful when you’re dealing with large amounts of data that won’t fit in memory.

import pandas as pd
result = []
for chunk in pd.read_csv('data.csv', chunksize = 1000) :
  result.append(sun(chunk['x']))
total = sum(result)
print(total)

7.4 Working with DataFrames

7.4.1 Basics

You can print the head or tail of a DataFrame using .head() and .tail() respectively. You can inspect the shape of the DataFrame by accessing the .shape attribute and get the names of the columns using .columns.

# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

The .info() method gives you a summary of the DataFrame.

print(df.info())
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 12846 entries, 0 to 12845
    Data columns (total 13 columns):
    Job #                      12846 non-null int64
    Doc #                      12846 non-null int64
    Borough                    12846 non-null object
    Initial Cost               12846 non-null object
    Total Est. Fee             12846 non-null object
    Existing Zoning Sqft       12846 non-null int64
    Proposed Zoning Sqft       12846 non-null int64
    Enlargement SQ Footage     12846 non-null int64
    Street Frontage            12846 non-null int64
    ExistingNo. of Stories     12846 non-null int64
    Proposed No. of Stories    12846 non-null int64
    Existing Height            12846 non-null int64
    Proposed Height            12846 non-null int64
    dtypes: int64(10), object(3)
    memory usage: 1.3+ MB
    None

For more detailed statistics about numeric columns, you can also use the .describe() method.

7.4.2 Selecting Columns

If the column name does not contain any special characters or spaces, and is not the name of a Python function, then you can access columns as attributes. For example, this means that df.continent and df['continent'] are equivalent.

7.4.3 Counting Values

You can use the .value_counts() method on a column (AKA Series) to count and sort occurrences of each value in the column.

# Print the value counts for 'Site Fill'
print(df['State'].value_counts(dropna=False))
    Name: State, dtype: int64
    NOT APPLICABLE                              7806
    NaN                                         4205
    ON-SITE                                      519
    OFF-SITE                                     186
    USE UNDER 300 CU.YD                          130

7.4.4 Plotting Things

Pandas has tight integration with matplotlib which allows you to call plotting methods on columns directly.

import pandas as pd
import matplotlib.pyplot as plt

df.population.plot('hist')
plt.show()

df.boxplot(column = 'population', by = 'continent')
plt.show()

df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

7.4.5 Reshaping Data (Tidy Data)

The pd.melt() function is functionally similar to the tidyr::gather() function in R in that it takes a “wide” data representation and gathers the columns into a “narrow” representation.

From DataCamp:

There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted. You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter.

You can then use pd.melt() like this:

print(airquality.head())
airquality_melt = pd.melt(airquality, 
                          id_vars=['Month', 'Day'], 
                          var_name='measurement', 
                          value_name='reading')
print(airquality_melt.head())
       Ozone  Solar.R  Wind  Temp  Month  Day
    0   41.0    190.0   7.4    67      5    1
    1   36.0    118.0   8.0    72      5    2
    2   12.0    149.0  12.6    74      5    3
    3   18.0    313.0  11.5    62      5    4
    4    NaN      NaN  14.3    56      5    5
       Month  Day measurement  reading
    0      5    1       Ozone     41.0
    1      5    2       Ozone     36.0
    2      5    3       Ozone     12.0
    3      5    4       Ozone     18.0
    4      5    5       Ozone      NaN

To go the other way, i.e. to “spread” the table (like tidyr::spread() in R) you need to use the .pivot() or .pivot_table() method on the DataFrame object. The main difference between these two methods seems to be that .pivot() cannot deal with duplicate values, where .pivot_table() allows you to pass in an aggregation function to handle duplicates (via the aggfunc argument). .pivot_table() also allows you to pass multiple index columns.

print(airquality_melt.head())
airquality_pivot = airquality_melt.pivot(index=['Month', 'Day'], 
                                         columns='measurement', 
                                         values='reading')
print(airquality_pivot.head())
       Month  Day measurement  reading
    0      5    1       Ozone     41.0
    1      5    2       Ozone     36.0
    2      5    3       Ozone     12.0
    3      5    4       Ozone     18.0
    4      5    5       Ozone      NaN
    measurement  Ozone  Solar.R  Temp  Wind
    Month Day                              
    5     1       41.0    190.0  67.0   7.4
          2       36.0    118.0  72.0   8.0
          3       12.0    149.0  74.0  12.6
          4       18.0    313.0  62.0  11.5
          5        NaN      NaN  56.0  14.3

Note that this DataFrame looks a bit different, because it has a hierarchical index (AKA MultiIndex) which seems to be how pandas implements grouping. You can remove this by calling the .reset_index() method on the DataFrame.

print(airquality_pivot.index)
airquality_pivot_reset = airquality_pivot.reset_index()
print(airquality_pivot_reset.index)
print(airquality_pivot_reset.head())
    MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
               labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
               names=['Month', 'Day'])
    RangeIndex(start=0, stop=153, step=1)
    measurement  Month  Day  Ozone  Solar.R  Temp  Wind
    0                5    1   41.0    190.0  67.0   7.4
    1                5    2   36.0    118.0  72.0   8.0
    2                5    3   12.0    149.0  74.0  12.6
    3                5    4   18.0    313.0  62.0  11.5
    4                5    5    NaN      NaN  56.0  14.3

7.4.6 Chaining methods

This is a cool bit of code, not sure where it belongs but I’ll stick it here for now until I write it into proper section.

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')

# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

# Print the head of ebola_melt
print(ebola_melt.head())
             Date  Day  type_country  counts        str_split   type country
    0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
    1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
    2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
    3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
    4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea

7.4.7 Concatenation

You can join DataFrames together using the pd.concat() function, which seems like the equivalent of R’s rbind() and cbind() functions.

row_bind_df = pd.concat([df1, df2])
col_bind_df = pd.concat([df1, df2], axis=1)

Similar to the dplyr bind_rows() and bind_cols() functions, it looks like it helps you out by including the union of columns (rather than just the intersection) and uses NaN to fill in missing values.

7.4.8 Joining (Merging)

You can join DataFrames using the pd.merge() function.

merged_df = pd.merge(left=df1, right=df2, left_on='key1', right_on='key2')

7.4.9 Types

You can change the type of a column in a DataFrame using the .astype() method.

df1.col1 = df1.col1.astype('category')

There are also some more feature-rich conversion functions which allow you to specify how edge-cases are handled. One such function is pd.to_numeric().

df1['col1'] = pd.to_numeric(df1['col1'], errors='coerce')

7.4.10 Apply

You can use the .apply() method to a DataFrame to apply a function to cols (axis=0) or rows (axis=1). You can also apply to a Series directly.

Some examples:

# Normal function applied to a Series
def recode_col1(value) :
    # this function does nothing
    return value
    
df1['col1_recode'] = df1.col1.apply(recode_col1)
# Lambda function applied to a series (basically removes dollar signs)
d1['col1_re'] = df['col1'].apply(lambda x: re.findall('\d+\.\d+', x)[0])

7.4.11 Duplicates

Pandas DataFrames have a .drop_duplicates() method which you can apply to a DataFrame to remove duplicates.