Introduction

I want to import football play data from a comma-separated values file and store it as a matrix represented as a list of lists.

Import dependencies

# Import data from a comma separated values file
import csv
# Calculate statistics
import statistics
# Represent data as tables
import pandas
# Visualize and interact with tables
import itables
itables.init_notebook_mode(all_interactive=True)

init_notebook_modetrusted

Functions

def read_comma_separated_values_as_matrix(filename: str):
    """Read a comma separated values file into a matrix represented as a list of lists.

    Arguments:
        filename (str): The name of the comma separated values file to import.
    Returns:
        (list): This list of lists contains the data from the comma separated values file.
    """
    with open(filename, 'rt') as file_handle:
        # Reference: https://stackoverflow.com/questions/30076209/import-csv-file-into-a-matrix-array-in-python
        reader = csv.reader(file_handle)
        data_as_list_of_lists = list(reader)
        return data_as_list_of_lists

def cast_list_values_as_floats(this_list: list):
    """Try to convert a list of arbitrary values to floats.

    Arguments:
        this_list (list): A list of values.
    Returns
        (list): A list of floats.
    """
    try:
        float_values = [float(value) for value in this_list]
    except:
        float_values = []
    return float_values

def calculate_summary_statistics(this_list: list):
    """Calculate summary statistics for a list of quantitative data.

    Arguments:
        this_list (list): A list of quantitative values.
    Returns:
        (dict): A dictionary of name summary statistics.
    """
    summary_statistics = {
        "name": 0,
        "mean": 0,
        "median": 0,
        "min": 0,
        "max": 0,
        "variance": 0,
        "standard-deviation": 0,
        "count": 0
    }
    # I store the name of the list as the first item in the list or the zero-index item.
    summary_statistics["name"] = this_list[0]

    # First I want to confirm that all the values in my list are quantitative. I will ignore the first element since that is the name of this list.
    this_list_floats = cast_list_values_as_floats(this_list[1:])

    # If my list lengths are uneven it indicates to me that not all of the values could be converted to floats and I will ignore this data.
    if len(this_list_floats) != len(this_list) - 1:
        return summary_statistics
    
    summary_statistics["mean"] = statistics.mean(this_list_floats)
    summary_statistics["median"] = statistics.median(this_list_floats)
    summary_statistics["min"] = min(this_list_floats)
    summary_statistics["max"] = max(this_list_floats)
    summary_statistics["variance"] = statistics.variance(this_list_floats)
    summary_statistics["standard-deviation"] = statistics.stdev(this_list_floats)
    summary_statistics["count"] = len(this_list_floats)
    return summary_statistics

def transpose_matrix(this_matrix: list):
    """Transpose a matrix represented as a list of lists.

    Arguments:
        this_matrix (list): A matrix represented as a list of lists.
    Returns:
        (list): A transposed matrix.
    """
    return [list(this_list) for this_list in zip(*this_matrix)]

Methods

Organize my data as a column-oriented matrix

A matrix is a structure for organizing data into columns and rows. In Python I represent a matrix as a list of lists.

In the parlance of relational databases a row-oriented database is one where all of the dimensions of a particular thing are linked together. In this dataset each row in the comma separated values table corresponds to a single play in a football game. My data will initially be row-oriented because Python reads files line-by-line. This means that each list in my matrix will contain all of the metrics for one particular play.

This is not particularly useful for me because I want to identify patterns within each dimension of data and use those dimensions to make predictions. I want to transform my dataset to by column-oriented so that I can do that. A column-oriented database is one where all of the values of a particular metric are stored together which makes it easier to pick out only the metrics I am interested in and use statistics to measure them. Instead of each list in my matrix storing all the metrics for a particular play I will organize it so that each list stores all the vales for a particular metric such as Yards.

Read the data into a row-oriented matrix

I will read the football plays data from the comma separated values file into a list of lists and then check the length of the outer list and the first inner list as a validation measure.

football_plays_filename = "football-plays-2024.csv"
row_oriented_matrix = read_comma_separated_values_as_matrix(football_plays_filename)
print(f"This matrix contains ({len(row_oriented_matrix)}) rows and ({len(row_oriented_matrix[0])}) columns.")

This matrix contains (53284) rows and (45) columns.

Transpose matrix so that it is column-oriented

columnar_matrix = transpose_matrix(row_oriented_matrix)
print(f"This matrix contains ({len(columnar_matrix)}) columns and ({len(columnar_matrix[0])}) rows.")

This matrix contains (45) columns and (53284) rows.

Get all of my column names

Now I want to validate that my data is organized as expected by getting the first item in all of my columns. I expect this to be the name of the column because I have not parsed out the names from the data.

column_names = [column[0] for column in columnar_matrix]
pandas.DataFrame(column_names)

Loading ITables v2.4.2 from the init_notebook_mode cell... (need help?)

Generate summary statistics for each column

This calculate_summary_statistics() function takes in a list as input and outputs a dictionary of summary statistics. A dictionary is a mapping between a key such as Yards and a value such as 3.6. I can also think of it as a pair of same length lists where each item in one list is connected to an item in the other list.

list_of_summary_statistics = [calculate_summary_statistics(column) for column in columnar_matrix]
len(list_of_summary_statistics)

View the mean of every metric

I can display a dictionary as a dataframe by putting square brackets around the variable name. I am also going to ignore metrics with a mean of zero since I am assuming that they are uninformative.

metric_means = {summary['name']: summary['mean'] for summary in list_of_summary_statistics if summary['mean'] != 0}
pandas.DataFrame([metric_means])

Loading ITables v2.4.2 from the init_notebook_mode cell... (need help?)

Organize summary statistics into a table

I have calculated summary statistics column of my data. Now I want to organize that data into a table so that I can visualize it. I have created a dictionary of summary metrics for each dimension of my data such as Yards and I have stored all of those dictionaries in a list called summary_statistics_list. What I want is a table that looks something like the following.

name	mean	median	standard-deviation
Yards	3.6	4.2	0.5
IsRush	0.6	0.4	0.01

This again looks like a matrix and so I think I will try to convert my list of dictionaries into a list of lists, which is how I am representing a matrix. Before I merge my dictionaries into a super dictionary I want to remove any dictionaries with a count value of zero because that indicates that I was not able to use any of the values and they will just add visual clutter.

Remove uninformative dictionaries

clean_list_of_summary_statistics = [dictionary for dictionary in list_of_summary_statistics if dictionary['count'] > 0]
print(f"There are ({len(clean_list_of_summary_statistics)}) columns which passed filtration out of the total of ({len(list_of_summary_statistics)}) columns for which I generated statistics.")

There are (29) columns which passed filtration out of the total of (45) columns for which I generated statistics.

Merge individual dictionaries into a super dictionary

I recorded my thinking for using the list comprehension merged inside a dictionary comprehension in this issue.

super_summary_statistics_dictionary = {key: [dictionary[key] for dictionary in clean_list_of_summary_statistics] for key in clean_list_of_summary_statistics[0].keys()}
len(super_summary_statistics_dictionary)

Visualize summary statistics table

pandas.DataFrame(super_summary_statistics_dictionary)

Loading ITables v2.4.2 from the init_notebook_mode cell... (need help?)