Introduction¶
dfsummarizer is a Python package which aims to provide an easy and intuitive way
for summarizing the columns of a dataframe. It will deal natively with data sets larger
than available memory by processing a file in chunks. It will generate a set of standard
statistics for numerical variables, and calculate similar statistics for dates, and
look at the length of text variables.
Motivation¶
Data frame summarization is a standard task in data science and the default options in the python ecosystem provide only partial functionality.
The goal of the package is both a command line app to generate a markdown (or latex) table summary of a dataset. In addition, a library with a set of re-usable functions that can be integrated into other apps.
Limitations¶
Currently calculates the summary over the entire dataset: TODO: Sample based summary.
Currently the number of unique values for large data sets is estimated using the Flajolet Martin algorithm. This is suboptimal for low cardinality columns. TODO: Implement a hybrid version that tracks absolute numbers until the cardinality exceeds a defined threshold.