dfsummarizer package¶
Submodules¶
dfsummarizer.FlajoletMartin module¶
dfsummarizer.config module¶
dfsummarizer.dfsummarizer module¶
dfsummarizer.dfsummarizer: provides entry point main().
dfsummarizer.funcs module¶
- dfsummarizer.funcs.analyse_df(df)[source]¶
Given a pandas dataframe that is already in memory we generate a table of summary statistics and descriptors.
- dfsummarizer.funcs.analyse_file_in_chunks(path_to_file)[source]¶
Given a path to a large dataset we will iteratively load it in chunks and build out the statistics necessary to summarise the whole dataset.
- dfsummarizer.funcs.count_lines(path_to_file)[source]¶
Return a count of total lines in a file. In a way that filesize is irrelevant
- dfsummarizer.funcs.len_or_null(val)[source]¶
Alternative len function that will simply return numpy.NA for invalid values This is need to get sensible results when running len over a column that may contain nulls
- dfsummarizer.funcs.load_complete_dataframe(path_to_file)[source]¶
We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it ppears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.