# Import data from a comma separated values file
import csv
# Calculate statistics
import statistics
# Represent data as tables
import pandas
# Visualize and interact with tables
import itables
=True) itables.init_notebook_mode(all_interactive
Introduction
I want to import football play data from a comma-separated values file and store it as a matrix represented as a list of lists.
Import dependencies
Functions
def read_comma_separated_values_as_matrix(filename: str):
"""Read a comma separated values file into a matrix represented as a list of lists.
Arguments:
filename (str): The name of the comma separated values file to import.
Returns:
(list): This list of lists contains the data from the comma separated values file.
"""
with open(filename, 'rt') as file_handle:
# Reference: https://stackoverflow.com/questions/30076209/import-csv-file-into-a-matrix-array-in-python
= csv.reader(file_handle)
reader = list(reader)
data_as_list_of_lists return data_as_list_of_lists
def cast_list_values_as_floats(this_list: list):
"""Try to convert a list of arbitrary values to floats.
Arguments:
this_list (list): A list of values.
Returns
(list): A list of floats.
"""
try:
= [float(value) for value in this_list]
float_values except:
= []
float_values return float_values
def calculate_summary_statistics(this_list: list):
"""Calculate summary statistics for a list of quantitative data.
Arguments:
this_list (list): A list of quantitative values.
Returns:
(dict): A dictionary of name summary statistics.
"""
= {
summary_statistics "name": 0,
"mean": 0,
"median": 0,
"min": 0,
"max": 0,
"variance": 0,
"standard-deviation": 0,
"count": 0
}# I store the name of the list as the first item in the list or the zero-index item.
"name"] = this_list[0]
summary_statistics[
# First I want to confirm that all the values in my list are quantitative. I will ignore the first element since that is the name of this list.
= cast_list_values_as_floats(this_list[1:])
this_list_floats
# If my list lengths are uneven it indicates to me that not all of the values could be converted to floats and I will ignore this data.
if len(this_list_floats) != len(this_list) - 1:
return summary_statistics
"mean"] = statistics.mean(this_list_floats)
summary_statistics["median"] = statistics.median(this_list_floats)
summary_statistics["min"] = min(this_list_floats)
summary_statistics["max"] = max(this_list_floats)
summary_statistics["variance"] = statistics.variance(this_list_floats)
summary_statistics["standard-deviation"] = statistics.stdev(this_list_floats)
summary_statistics["count"] = len(this_list_floats)
summary_statistics[return summary_statistics
def transpose_matrix(this_matrix: list):
"""Transpose a matrix represented as a list of lists.
Arguments:
this_matrix (list): A matrix represented as a list of lists.
Returns:
(list): A transposed matrix.
"""
return [list(this_list) for this_list in zip(*this_matrix)]
Methods
Organize my data as a column-oriented matrix
A matrix is a structure for organizing data into columns and rows. In Python I represent a matrix as a list of lists.
In the parlance of relational databases a row-oriented database is one where all of the dimensions of a particular thing are linked together. In this dataset each row in the comma separated values table corresponds to a single play in a football game. My data will initially be row-oriented because Python reads files line-by-line. This means that each list in my matrix will contain all of the metrics for one particular play.
This is not particularly useful for me because I want to identify patterns within each dimension of data and use those dimensions to make predictions. I want to transform my dataset to by column-oriented so that I can do that. A column-oriented database is one where all of the values of a particular metric are stored together which makes it easier to pick out only the metrics I am interested in and use statistics to measure them. Instead of each list in my matrix storing all the metrics for a particular play I will organize it so that each list stores all the vales for a particular metric such as Yards
.
Read the data into a row-oriented matrix
I will read the football plays data from the comma separated values file into a list of lists and then check the length of the outer list and the first inner list as a validation measure.
= "football-plays-2024.csv"
football_plays_filename = read_comma_separated_values_as_matrix(football_plays_filename)
row_oriented_matrix print(f"This matrix contains ({len(row_oriented_matrix)}) rows and ({len(row_oriented_matrix[0])}) columns.")
This matrix contains (53284) rows and (45) columns.
Transpose matrix so that it is column-oriented
= transpose_matrix(row_oriented_matrix)
columnar_matrix print(f"This matrix contains ({len(columnar_matrix)}) columns and ({len(columnar_matrix[0])}) rows.")
This matrix contains (45) columns and (53284) rows.
Get all of my column names
Now I want to validate that my data is organized as expected by getting the first item in all of my columns. I expect this to be the name of the column because I have not parsed out the names from the data.
= [column[0] for column in columnar_matrix]
column_names pandas.DataFrame(column_names)
Loading ITables v2.4.2 from the init_notebook_mode cell...
(need help?) |
Generate summary statistics for each column
This calculate_summary_statistics()
function takes in a list as input and outputs a dictionary of summary statistics. A dictionary is a mapping between a key such as Yards
and a value such as 3.6
. I can also think of it as a pair of same length lists where each item in one list is connected to an item in the other list.
= [calculate_summary_statistics(column) for column in columnar_matrix]
list_of_summary_statistics len(list_of_summary_statistics)
45
View the mean of every metric
I can display a dictionary as a dataframe by putting square brackets around the variable name. I am also going to ignore metrics with a mean of zero since I am assuming that they are uninformative.
= {summary['name']: summary['mean'] for summary in list_of_summary_statistics if summary['mean'] != 0}
metric_means pandas.DataFrame([metric_means])
Loading ITables v2.4.2 from the init_notebook_mode cell...
(need help?) |
Organize summary statistics into a table
I have calculated summary statistics column of my data. Now I want to organize that data into a table so that I can visualize it. I have created a dictionary of summary metrics for each dimension of my data such as Yards
and I have stored all of those dictionaries in a list called summary_statistics_list
. What I want is a table that looks something like the following.
name | mean | median | standard-deviation |
---|---|---|---|
Yards | 3.6 | 4.2 | 0.5 |
IsRush | 0.6 | 0.4 | 0.01 |
This again looks like a matrix and so I think I will try to convert my list of dictionaries into a list of lists, which is how I am representing a matrix. Before I merge my dictionaries into a super dictionary I want to remove any dictionaries with a count
value of zero because that indicates that I was not able to use any of the values and they will just add visual clutter.
Remove uninformative dictionaries
= [dictionary for dictionary in list_of_summary_statistics if dictionary['count'] > 0]
clean_list_of_summary_statistics print(f"There are ({len(clean_list_of_summary_statistics)}) columns which passed filtration out of the total of ({len(list_of_summary_statistics)}) columns for which I generated statistics.")
There are (29) columns which passed filtration out of the total of (45) columns for which I generated statistics.
Merge individual dictionaries into a super dictionary
I recorded my thinking for using the list comprehension merged inside a dictionary comprehension in this issue.
= {key: [dictionary[key] for dictionary in clean_list_of_summary_statistics] for key in clean_list_of_summary_statistics[0].keys()}
super_summary_statistics_dictionary len(super_summary_statistics_dictionary)
8
Visualize summary statistics table
pandas.DataFrame(super_summary_statistics_dictionary)
Loading ITables v2.4.2 from the init_notebook_mode cell...
(need help?) |