3 Importing and Exporting

3.1 Opening Text Files

3.1.1 Plain Text

These are the easiest files to open.

file = open('huck_finn.txt', mode='r') # 'r' is to read
text = file.read()
file.close()

The .read() method reads the entire file. You can also use .readline() which reads a line at a time.

You can also manage files using a “context manager”:

with open('huck_finn.txt', mode='r') as file :
  print(file.read())

This has the advantage that it doesn’t require you to close the connection at the end of the script.

3.1.2 Tabular Data

3.1.2.1 NumPy arrays

import numpy as np
data = np.loadtxt('MNIST.txt', delimiter=',')

Some additional useful arguments include:

skiprows - set to an integer (typically 1) to skip header rows
usecols - pass a list of integers (e.g. usecols=[0,2]) to select specified columns
dtype - numpy arrays must all be a single data type, so this lets you cast everything to a type (for example dtype=str)

You can also load data with multiple data types using np.genfromtxt(), and if you pass the dtype=None argument then it will automatically detect column types for you.

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

When you do this, the result is a structured array rather than a 2d numpy array. This isn’t ideal, which is why pandas exists. Also note that the function np.recfromcsv() does the same as np.genfromtxt() except that it sets dtype=None, delimiter=',' and names=True by default - handy!

3.1.2.2 Pandas DataFrames

These are the standard for almost all data work in Python. The Pandas chapter will include more information about how to work with DataFrames in Pandas.

import pandas as pd
data = pd.read_csv('winequality-red.csv')

Key options include:

sep - delimiter character
nrows - specify the number of rows to read
header - set to None (not 0!) if your data doesn’t have a header row
na_values - strings to treat as missing values

Pandas has some handy helpers, including:

.head() - shows the top 5 rows
`.values - returns a numpy array

3.2 Relational Databases

3.2.1 Using SQLAlchemy

from sqlalchemy import create_engine
engine = create_engine('sqlite:///Northwind.sqlite')

You can retrieve the table names from the database using the .table_names() method of the engine object.

To query the database using the engine:

con = engine.connect()
rs = con.execute('select * from Album')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
con.close()

You can also do this using a context manager:

with engine.connect() as con:
    rs = con.execute('select LastName, Title from Employee')
    df = pd.DataFrame(rs.fetchmany(size = 3))
    df.columns = rs.keys()

You can also use this engine directly with Pandas:

df = pd.read_sql_query(‘select * from Orders’, engine)

3.3 Exotic Filetypes

3.3.1 Excel

import pandas as pd
data = pd.ExcelFile('urbanpop.xlsx')
print(data.sheet_names)
df1 = data.parse('1960-1966') # sheet name as string, or index as float

Key options include:

skiprows - indexes of rows to skip - must be a list (zero indexed!)
names - can be used to provide (or override) the row names
parse_cols - can be used to specify which columns to import

3.3.2 Pickle (serialised)

import pickle
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

3.3.3 HDF5

HDF5 stands for Hierarchical Data Format 5. It’s the standard for storing large quantities of numerical data, and can scale to exabytes.

import h5py
data = h5py.File('LIGO_data.hdf5', 'r') # 'r' is to read

The file can look like a dict, and you can explore it using the same techniques:

for key in data.keys() :
  print(key)

Because it is hierarchical, you can drill down further to find out what is there:

for key in data['meta'].keys() :
  print(key)

You can grab values using the .value attribute:

strain = data['strain']['Strain'].value

3.4 Working with the file system

You can programmatically interrogate the file system using the os package.

import os
wd = os.getcwd()
os.listdir(wd)

You can also use the glob package to find files from the file system:

# Import necessary modules
import pandas as pd
import glob

# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)

# Print the file names
print(csv_files)

# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])

# Print the head of csv2
print(csv2.head())

3.5 Retrieving files from the web

from urllib.request import urlretrieve
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
dat = urlretrieve(url, “winequality-red.csv”)

Pandas can also do this for you without saving the intermediate file. This works with all of the Pandas functions including read_csv and read_excel

df = pd.read_csv(url, sep = ';')

3.5.1 Working with HTTP requests directly

from urllib.request import urlopen, Request
url = "http://www.datacamp.com/teach/documentation"
request = Request(url)
response = urlopen(request)
print(type(response))
html = response.read()
response.close()

3.5.2 Using the requests package

The requests package is the equivalent of R’s httr package, and is the most popular Python package for working with HTTP requests.

# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text)

3.6 Parsing HTML

You can parse HTML using the BeautifulSoup package.

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

You can also parse the tag tree with the BeautifulSoup object

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

3.7 Working with JSON

import json

# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

3.8 Working with APIs

You can query APIs directly with the requests package

# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=the+social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

You can pull out the JSON directly from the object using the .json() method:

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

You can parse multiple levels of nested JSON easily:

json_data['query']['pages']['24768']['extract']