Looping Over Data Sets¶
âHow can I process many data sets with a single command?â objectives:
âBe able to read and write globbing expressions that match sets of files.â
âUse glob to create lists of files.â
âWrite for loops to perform operations on files given their names in a list.â keypoints:
âUse a
for
loop to process files given a list of their names.ââUse
glob.glob
to find sets of files whose names match a pattern.ââUse
glob
andfor
to process batches of files.â
Use a for
loop to process files given a list of their names.¶
A filename is a character string.
And lists can contain character strings.
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
data = pd.read_csv(filename, index_col='country')
print(filename, data.min())
data/gapminder_gdp_africa.csv gdpPercap_1952 298.846212
gdpPercap_1957 335.997115
gdpPercap_1962 355.203227
gdpPercap_1967 412.977514
gdpPercap_1972 464.099504
gdpPercap_1977 502.319733
gdpPercap_1982 462.211415
gdpPercap_1987 389.876185
gdpPercap_1992 410.896824
gdpPercap_1997 312.188423
gdpPercap_2002 241.165877
gdpPercap_2007 277.551859
dtype: float64
data/gapminder_gdp_asia.csv gdpPercap_1952 331.0
gdpPercap_1957 350.0
gdpPercap_1962 388.0
gdpPercap_1967 349.0
gdpPercap_1972 357.0
gdpPercap_1977 371.0
gdpPercap_1982 424.0
gdpPercap_1987 385.0
gdpPercap_1992 347.0
gdpPercap_1997 415.0
gdpPercap_2002 611.0
gdpPercap_2007 944.0
dtype: float64
Use glob.glob
to find sets of files whose names match a pattern.¶
In Unix, the term âglobbingâ means âmatching a set of files with a patternâ.
The most common patterns are:
*
meaning âmatch zero or more charactersâ?
meaning âmatch exactly one characterâ
Pythonâs standard library contains the
glob
module to provide pattern matching functionalityThe
glob
module contains a function also calledglob
to match file patternsE.g.,
glob.glob('*.txt')
matches all files in the current directory whose names end with.txt
.Result is a (possibly empty) list of character strings.
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))
all csv files in data directory: ['data\\gapminder_all.csv', 'data\\gapminder_gdp_africa.csv', 'data\\gapminder_gdp_americas.csv', 'data\\gapminder_gdp_asia.csv', 'data\\gapminder_gdp_europe.csv', 'data\\gapminder_gdp_oceania.csv']
print('all PDB files:', glob.glob('*.pdb'))
all PDB files: []
Use glob
and for
to process batches of files.¶
Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
for filename in glob.glob('data/gapminder_*.csv'):
data = pd.read_csv(filename)
print(filename, data['gdpPercap_1952'].min())
data\gapminder_all.csv 298.8462121
data\gapminder_gdp_africa.csv 298.8462121
data\gapminder_gdp_americas.csv 1397.7171369999999
data\gapminder_gdp_asia.csv 331.0
data\gapminder_gdp_europe.csv 973.5331947999999
data\gapminder_gdp_oceania.csv 10039.595640000001
This includes all data, as well as per-region data.
Use a more specific pattern in the exercises to exclude the whole data set.
But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.
Exercise: Determining Matches
Which of these files is not matched by the expression glob.glob('data/*as*.csv')
?
data/gapminder_gdp_africa.csv
data/gapminder_gdp_americas.csv
data/gapminder_gdp_asia.csv
1 and 2 are not matched.
See Solution
1 is not matched by the glob.
Exercise: Minimum File Size
Modify this program so that it prints the number of records in the file that has the fewest records.
import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
dataframe = pd.____(filename)
fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')
Note that the shape method returns a tuple with the number of rows and columns of the data frame.
See Solution
import glob
import pandas as pd
fewest = float('Inf')
for filename in glob.glob('data/*.csv'):
dataframe = pd.read_csv(filename)
fewest = min(fewest, dataframe.shape[0])
print('smallest file has', fewest, 'records')
Exercise: Comparing Data
Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart.
See Solution
This solution builds a useful legend by using the string split
method to
extract the region
from the path âdata/gapminder_gdp_a_specific_region.csvâ. The [pathlib module
] also provides useful abstractions for file and path manipulation like returning the name of a file without the file extension.
import glob
import pandas as pd
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('data/gapminder_gdp*.csv'):
dataframe = pd.read_csv(filename)
# extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
# we will split the string using the split method and `_` as our separator,
# retrieve the last string in the list that split returns (`<region>.csv`),
# and then remove the `.csv` extension from that string.
region = filename.split('_')[-1][:-4]
dataframe.mean().plot(ax=ax, label=region)
plt.legend()
plt.show()