Looping Over Data Sets


  • “How can I process many data sets with a single command?” objectives:

  • “Be able to read and write globbing expressions that match sets of files.”

  • “Use glob to create lists of files.”

  • “Write for loops to perform operations on files given their names in a list.” keypoints:

  • “Use a for loop to process files given a list of their names.”

  • “Use glob.glob to find sets of files whose names match a pattern.”

  • “Use glob and for to process batches of files.”

Use a for loop to process files given a list of their names.

  • A filename is a character string.

  • And lists can contain character strings.

import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165877
gdpPercap_2007    277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64

Use glob.glob to find sets of files whose names match a pattern.

  • In Unix, the term “globbing” means “matching a set of files with a pattern”.

  • The most common patterns are:

    • * meaning “match zero or more characters”

    • ? meaning “match exactly one character”

  • Python’s standard library contains the glob module to provide pattern matching functionality

  • The glob module contains a function also called glob to match file patterns

  • E.g., glob.glob('*.txt') matches all files in the current directory whose names end with .txt.

  • Result is a (possibly empty) list of character strings.

import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data\\gapminder_all.csv', 'data\\gapminder_gdp_africa.csv', 'data\\gapminder_gdp_americas.csv', 'data\\gapminder_gdp_asia.csv', 'data\\gapminder_gdp_europe.csv', 'data\\gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []

Use glob and for to process batches of files.

  • Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.

for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())
data\gapminder_all.csv 298.8462121
data\gapminder_gdp_africa.csv 298.8462121
data\gapminder_gdp_americas.csv 1397.7171369999999
data\gapminder_gdp_asia.csv 331.0
data\gapminder_gdp_europe.csv 973.5331947999999
data\gapminder_gdp_oceania.csv 10039.595640000001
  • This includes all data, as well as per-region data.

  • Use a more specific pattern in the exercises to exclude the whole data set.

  • But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.

Exercise: Determining Matches

Which of these files is not matched by the expression glob.glob('data/*as*.csv')?

  1. data/gapminder_gdp_africa.csv

  2. data/gapminder_gdp_americas.csv

  3. data/gapminder_gdp_asia.csv

  4. 1 and 2 are not matched.

Exercise: Minimum File Size

Modify this program so that it prints the number of records in the file that has the fewest records.

import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pd.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')

Note that the shape method returns a tuple with the number of rows and columns of the data frame.

Exercise: Comparing Data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.