dfsummarizer package

Submodules

dfsummarizer.FlajoletMartin module

class dfsummarizer.FlajoletMartin.FMEstimator[source]

Bases: object

estimate()[source]
update(val)[source]
update_all(vals)[source]
dfsummarizer.FlajoletMartin.get_prefix(seed)[source]
dfsummarizer.FlajoletMartin.trailing_zeros(n)[source]

dfsummarizer.config module

dfsummarizer.dfsummarizer module

dfsummarizer.dfsummarizer: provides entry point main().

dfsummarizer.dfsummarizer.main()[source]
dfsummarizer.dfsummarizer.print_usage(args)[source]

Command line application usage instrutions.

dfsummarizer.funcs module

dfsummarizer.funcs.after_decimal(n)[source]
dfsummarizer.funcs.analyse_df(df)[source]

Given a pandas dataframe that is already in memory we generate a table of summary statistics and descriptors.

dfsummarizer.funcs.analyse_file(path_to_file)[source]
dfsummarizer.funcs.analyse_file_in_chunks(path_to_file)[source]

Given a path to a large dataset we will iteratively load it in chunks and build out the statistics necessary to summarise the whole dataset.

dfsummarizer.funcs.booleanize(x)[source]
dfsummarizer.funcs.clean_dict(df, col)[source]
dfsummarizer.funcs.coerce_dates(df)[source]
dfsummarizer.funcs.combine_dicts(a, b, op=<built-in function add>)[source]
dfsummarizer.funcs.count_lines(path_to_file)[source]

Return a count of total lines in a file. In a way that filesize is irrelevant

dfsummarizer.funcs.extract_file_extension(path_to_file)[source]
dfsummarizer.funcs.generate_final_summary(temp, total_chunks)[source]
dfsummarizer.funcs.get_first_non_null_type(thecolumn, startpoint)[source]
dfsummarizer.funcs.get_padded_number(n)[source]
dfsummarizer.funcs.get_padded_val(val)[source]
dfsummarizer.funcs.get_padded_val2(val, spacer)[source]
dfsummarizer.funcs.get_percent_spacer(p)[source]
dfsummarizer.funcs.get_spaces(spacer)[source]
dfsummarizer.funcs.get_type_spacer(t)[source]
dfsummarizer.funcs.infer_type(thetype, unicount, uniques)[source]
dfsummarizer.funcs.infer_type_2(thecolumn, startpoint, unicount, uniques)[source]
dfsummarizer.funcs.isNaN(num)[source]
dfsummarizer.funcs.len_or_null(val)[source]

Alternative len function that will simply return numpy.NA for invalid values This is need to get sensible results when running len over a column that may contain nulls

dfsummarizer.funcs.load_complete_dataframe(path_to_file)[source]

We load the entire dataset into memory, using the file extension to determine the expected format. We are using encoding=’latin1’ because it ppears to permit loading of the largest variety of files. Representation of strings may not be perfect, but is not important for generating a summarization of the entire dataset.

dfsummarizer.funcs.print_csv(s)[source]
dfsummarizer.funcs.print_latex(summary)[source]
dfsummarizer.funcs.print_markdown(s)[source]
dfsummarizer.funcs.round_down(n, decimals=0)[source]

Round down a number to a specifed number of decimal places

dfsummarizer.funcs.update_temp_summary(temp, df, startpoint)[source]

Module contents