Klib highlights

4/1/2023

Klib highlights

Read Now

Many parameters are available allowing a more restrictive data cleaning where needed.įurthermore, the function klib.mv_col_handling() provides a sophisticated selection mechanism for columns with relatively many missing values. Using this procedure, 56006 duplicate rows are identified in the subset, i.e., 56006 rows in 10 columns are encoded into a single column of dtype integer, greatly reducing the memory footprint and number of columns which should speed up model training.Īll of these functions were run with their relatively “soft” default settings. This allows us to pool and encode “carrier” and similar columns, while “tailnum” remains in the dataset.

While this is unlikely, it is advised to specifically exclude features that provide sufficient informational content by themselves as well as the target column by using the “exclude” setting.Īs can be seen in *cat_plot()* the “carrier” column is made up of a few very frequent values - the top 4 values account for roughly 75% - while in “tailnum” the top 4 values barely make up 2%. While the encoding itself does not lead to a loss in information, some details might get lost in the aggregation step. These are then added to the original data what allows dropping the previously identified and now encoded columns. Specifically, the pooling is achieved by finding duplicates in subsets of the data and encoding the largest possible subset with sufficient duplicates with integers. This function “pools” columns together based on several settings. Further, klib.pool_duplicate_subsets() can be applied, what ultimately reduces the dataset to only 3.8 MB (from 51 MB originally).

0 Comments

Klib highlights

Leave a Reply.

Author

Archives

Categories