Sharing insights I gathered on my journey to Data Science and Machine Learning
Project maintained by WKirgsnHosted on GitHub Pages — Theme by mattgraham
Auto-Downsizing data
10 Feb 2018
There is a vast amount of data published on Kaggle Datasets, that one can download and play around with for experimental purposes.
However, more than often this data is stored with data types being an overkill for the datum’s intentional use.
Consider this virtual example
Problem
The used numpy data type here, float32, takes four byte of disk space for every age datum stored in this mini data frame.
Yet it is plausible that a feature as ‘age’ would not get close to the upper limit of a single byte datum, that is, 256.
For small datasets there should be no concern but when working with Big Data and millions of entries in a database the occupied disk and RAM space might get exhausted unreasonably and unnecessarily fast.
Converting the demonstrated example is as simple as this:
Now nobody’s got time to skim through a dataset of multiple GBs just to identify the potential candidates for data type reduction.
It should come naturally to have an automatic function or class dealing with this reduction by just checking min and max values.
Solution
The following mini class might come in handy for all reduction intentions. Please note, that reducing data types based on min and max values assumes there won’t be any future data accumulated to the dataset, which could not fit in the newly converted data type.
This class can be used the following way:
The minimum datatypes to be converted can be controlled by the ‘conv_table’ that is an optional argument for the class’ init function.
The default converts to all integers and at most to float32.
Neat thing about this presented implementation: It even takes advantage of multiple CPU cores when available via the joblib python library.
Conclusion
Dealing with inappropriate data types can be now a thing of the past by utilizing sophisticated reducing classes.
Converting the dataset and saving it again to disk can save a considerable amount of disk space and, even more crucial, lessens the allocated RAM such that more actual data can be taken into account.