7 Pandas
The pandas package was written by Wes McKinney (who seems to be Hadley Wickham’s counterpart for Python) and builds on top of the numpy package to provide DataFrames.
7.1 Creation
To create a DataFrame manually:
import pandas as pd
mydict = {
"country": ['Brazil', 'Russia', 'India', 'China', 'South Africa'],
"capital": ['Brasilia', 'Moscow', 'New Delhi', 'Beijing', 'Pretoria' ],
"area": [ 8.516, 17.10, 3.286, 9.597, 1.221 ],
"population": [ 200.4, 143.5, 1252, 1357, 52.98 ]}
df = pd.DataFrame(mydict)
df.index = ['BR', 'RU', 'IN', 'CH', 'SA' ]
print(df)
## country capital area population
## BR Brazil Brasilia 8.516 200.40
## RU Russia Moscow 17.100 143.50
## IN India New Delhi 3.286 1252.00
## CH China Beijing 9.597 1357.00
## SA South Africa Pretoria 1.221 52.98
Alternatively, if you have data in a CSV file you can import it using:
df = pd.read_csv('<path>') # If there are no column names
df = pd.read_csv('<path>', index_col = 0) # If there are column names
You can also create a DataFrame as a list of lists:
import pandas as pd
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
visitors = [139, 237, 326, 456]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays]
zipped = list(zip(list_labels, list_cols))
data = dict(zipped)
users = pd.DataFrame(data)
print(users)
## city signups visitors weekday
## 0 Austin 7 139 Sun
## 1 Dallas 12 237 Sun
## 2 Austin 3 326 Mon
## 3 Dallas 5 456 Mon
7.2 Subsetting
To select a column in pandas, use df['country']
. This returns a “pandas Series” object, which is like a 1D element of a data frame. If you want to get a data frame back instead of a series, use df[['country']]
. This is the opposite of R, where the [[]]
is used to drill down further than []
. The reason for this is that []
defines a list in Python, so you’re really calling df[<subset>]
where <subset>
is a list of length 1. In R, the [[]]
subset operator is actually a function, hence the differing behaviours. This means that subsetting for multiple columns in Python doesn’t take too much writing: df[['country', 'capital']]
.
Interestingly, and perhaps confusingly, if you provide a slice (range) to df[]
then you will be subsetting on rows, not columns! Using df[1:4]
selects the 2nd through 4th rows of the data frame.
If you want to do anything more complex than this, it starts getting a bit weird, and you have to use the loc()
method (which stands for “location”). To pull out a named row, you need to use the df.loc['RU']
syntax (which returns a Series object), or df.loc[['RU']]
(which returns a DataFrame). You can also select multiple rows this way: df.loc[['RU', 'IN', 'CH']]
. You can then extend this to subset on columns as well, which starts to look a bit more like R (and like the 2D numpy array): df.loc[['RU', 'IN', 'CH'], ['country', 'capital']]
.
If you want to use an index, then you can do the same thing with lists of integers passed to the iloc()
method (which stands for index-location). It seems like you probably need to chain these functions together if you want to have a mix of subsetting approaches.
7.3 Reading in chunks
You can use the chunksize
argument to read a CSV file in chunks, then iterate over the chunks in a loop. This can be useful when you’re dealing with large amounts of data that won’t fit in memory.
import pandas as pd
result = []
for chunk in pd.read_csv('data.csv', chunksize = 1000) :
result.append(sun(chunk['x']))
total = sum(result)
print(total)
7.4 Working with DataFrames
7.4.1 Basics
You can print the head or tail of a DataFrame using .head()
and .tail()
respectively. You can inspect the shape of the DataFrame by accessing the .shape
attribute and get the names of the columns using .columns
.
# Import pandas
import pandas as pd
# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')
# Print the head of df
print(df.head())
# Print the tail of df
print(df.tail())
# Print the shape of df
print(df.shape)
# Print the columns of df
print(df.columns)
The .info()
method gives you a summary of the DataFrame.
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12846 entries, 0 to 12845
Data columns (total 13 columns):
Job # 12846 non-null int64
Doc # 12846 non-null int64
Borough 12846 non-null object
Initial Cost 12846 non-null object
Total Est. Fee 12846 non-null object
Existing Zoning Sqft 12846 non-null int64
Proposed Zoning Sqft 12846 non-null int64
Enlargement SQ Footage 12846 non-null int64
Street Frontage 12846 non-null int64
ExistingNo. of Stories 12846 non-null int64
Proposed No. of Stories 12846 non-null int64
Existing Height 12846 non-null int64
Proposed Height 12846 non-null int64
dtypes: int64(10), object(3)
memory usage: 1.3+ MB
None
For more detailed statistics about numeric columns, you can also use the .describe()
method.
7.4.2 Selecting Columns
If the column name does not contain any special characters or spaces, and is not the name of a Python function, then you can access columns as attributes. For example, this means that df.continent
and df['continent']
are equivalent.
7.4.3 Counting Values
You can use the .value_counts()
method on a column (AKA Series) to count and sort occurrences of each value in the column.
# Print the value counts for 'Site Fill'
print(df['State'].value_counts(dropna=False))
Name: State, dtype: int64
NOT APPLICABLE 7806
NaN 4205
ON-SITE 519
OFF-SITE 186
USE UNDER 300 CU.YD 130
7.4.4 Plotting Things
Pandas has tight integration with matplotlib which allows you to call plotting methods on columns directly.
import pandas as pd
import matplotlib.pyplot as plt
df.population.plot('hist')
plt.show()
df.boxplot(column = 'population', by = 'continent')
plt.show()
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
7.4.5 Reshaping Data (Tidy Data)
The pd.melt()
function is functionally similar to the tidyr::gather()
function in R in that it takes a “wide” data representation and gathers the columns into a “narrow” representation.
From DataCamp:
There are two parameters you should be aware of:
id_vars
andvalue_vars
. Theid_vars
represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while thevalue_vars
represent the columns you do wish to melt into rows. By default, if novalue_vars
are provided, all columns not set in theid_vars
will be melted. This could save a bit of typing, depending on the number of columns that need to be melted. You can rename the variable column by specifying an argument to thevar_name
parameter, and the value column by specifying an argument to thevalue_name
parameter.
You can then use pd.melt()
like this:
print(airquality.head())
airquality_melt = pd.melt(airquality,
id_vars=['Month', 'Day'],
var_name='measurement',
value_name='reading')
print(airquality_melt.head())
Ozone Solar.R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
To go the other way, i.e. to “spread” the table (like tidyr::spread()
in R) you need to use the .pivot()
or .pivot_table()
method on the DataFrame object. The main difference between these two methods seems to be that .pivot()
cannot deal with duplicate values, where .pivot_table()
allows you to pass in an aggregation function to handle duplicates (via the aggfunc
argument). .pivot_table()
also allows you to pass multiple index columns.
print(airquality_melt.head())
airquality_pivot = airquality_melt.pivot(index=['Month', 'Day'],
columns='measurement',
values='reading')
print(airquality_pivot.head())
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
measurement Ozone Solar.R Temp Wind
Month Day
5 1 41.0 190.0 67.0 7.4
2 36.0 118.0 72.0 8.0
3 12.0 149.0 74.0 12.6
4 18.0 313.0 62.0 11.5
5 NaN NaN 56.0 14.3
Note that this DataFrame looks a bit different, because it has a hierarchical index (AKA MultiIndex) which seems to be how pandas implements grouping. You can remove this by calling the .reset_index()
method on the DataFrame.
print(airquality_pivot.index)
airquality_pivot_reset = airquality_pivot.reset_index()
print(airquality_pivot_reset.index)
print(airquality_pivot_reset.head())
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
names=['Month', 'Day'])
RangeIndex(start=0, stop=153, step=1)
measurement Month Day Ozone Solar.R Temp Wind
0 5 1 41.0 190.0 67.0 7.4
1 5 2 36.0 118.0 72.0 8.0
2 5 3 12.0 149.0 74.0 12.6
3 5 4 18.0 313.0 62.0 11.5
4 5 5 NaN NaN 56.0 14.3
7.4.6 Chaining methods
This is a cool bit of code, not sure where it belongs but I’ll stick it here for now until I write it into proper section.
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
# Print the head of ebola_melt
print(ebola_melt.head())
Date Day type_country counts str_split type country
0 1/5/2015 289 Cases_Guinea 2776.0 [Cases, Guinea] Cases Guinea
1 1/4/2015 288 Cases_Guinea 2775.0 [Cases, Guinea] Cases Guinea
2 1/3/2015 287 Cases_Guinea 2769.0 [Cases, Guinea] Cases Guinea
3 1/2/2015 286 Cases_Guinea NaN [Cases, Guinea] Cases Guinea
4 12/31/2014 284 Cases_Guinea 2730.0 [Cases, Guinea] Cases Guinea
7.4.7 Concatenation
You can join DataFrames together using the pd.concat()
function, which seems like the equivalent of R’s rbind()
and cbind()
functions.
row_bind_df = pd.concat([df1, df2])
col_bind_df = pd.concat([df1, df2], axis=1)
Similar to the dplyr bind_rows()
and bind_cols()
functions, it looks like it helps you out by including the union of columns (rather than just the intersection) and uses NaN
to fill in missing values.
7.4.8 Joining (Merging)
You can join DataFrames using the pd.merge()
function.
merged_df = pd.merge(left=df1, right=df2, left_on='key1', right_on='key2')
7.4.9 Types
You can change the type of a column in a DataFrame using the .astype()
method.
df1.col1 = df1.col1.astype('category')
There are also some more feature-rich conversion functions which allow you to specify how edge-cases are handled. One such function is pd.to_numeric()
.
df1['col1'] = pd.to_numeric(df1['col1'], errors='coerce')
7.4.10 Apply
You can use the .apply()
method to a DataFrame to apply a function to cols (axis=0) or rows (axis=1). You can also apply to a Series directly.
Some examples:
# Normal function applied to a Series
def recode_col1(value) :
# this function does nothing
return value
df1['col1_recode'] = df1.col1.apply(recode_col1)
# Lambda function applied to a series (basically removes dollar signs)
d1['col1_re'] = df['col1'].apply(lambda x: re.findall('\d+\.\d+', x)[0])
7.4.11 Duplicates
Pandas DataFrames have a .drop_duplicates()
method which you can apply to a DataFrame to remove duplicates.