Kirgsn

Sharing insights I gathered on my journey to Data Science and Machine Learning


Project maintained by WKirgsn Hosted on GitHub Pages — Theme by mattgraham

Capital Redeployment Algorithm

Some capital investment experts tend to say there are two basic things that will lead to a high-performing depot:

  • A risk-balanced, diversified portfolio and
  • Periodic asset redeployment

The former has no definitive answer and should be adapted to each investor individually. Generally speaking, higher-risk assets can take up a larger amount of younger money-savers’ total investment since they have more time to put up for adverse market trends whereas the older generation should stick to steadier securities. For a good overview of different portfolio types see Investopedia’s guideline on popuular portfolio types. However, diversification is comfortably achieved by passively managed funds as, for instance, ETFs, which come with large cost savings and outperform actively managed funds almost all the time - the Economist confirms.

The second virtue for a successful investment into stocks on the other hand is dead simple: Reassess your capital as often as you feel comfortable with, at least once a year or as soon as your asset weightings fall apart drastically, say 5% in one asset weighting.

What is Asset Redeployment?

Reassessing capital means one would transfer money on all her/his different assets in such a way, that each asset holds an amount equal to the weighting you initially assigned to it.

Imagine for example you like 30% of your capital in stocks and the other 70% in bonds since that is a risk-apportionment you can live with. The next day stock market booms and you see your stock capital soaring up to undreamed heights. The first moment you jump for joy but the next you halt as one glance on the new capital weights makes you shiver: stocks and bonds are fifty-fifty. Your risk-portfolio is suddenly highly imbalanced and you are burning for action, there is a money transfer necessary. You go ahead and deposit the required amount on your bond asset.

Nothing simpler than that!

But hold on! I don’t have two assets, I have five or six! Besides that, I don’t place high figures at once on my portfolio, I do small monthly payments.

Calculating different amounts necessary for multiple assets that change in different directions is non-trivial and gave me headaches in the beginning. Very soon it became clear, that it must be solved iteratively through an algorithm.

The following python script shows how it can be done.

The Algorithm

"""Mapping for current balance on each kind of capital and the desired 
weight. Dict with identifier keys and dict-values that have current and 
target pairs."""
CURRENT_TARGET_DICT = {
    'Stocks':
        {'actual':  6752,
         'target_weight': 0.3},
    'Commodities':
        {'actual': 3448,
         'target_weight': 0.15},
    'Long-Term Bonds':
        {'actual': 8360,
         'target_weight': 0.4},
    'Mid-Term Bonds':
        {'actual':  3089 ,
         'target_weight': 0.15}
}

# The monthly rate that is remitted and split among all accounts
MONTHLY_RATE = 500


def main():
    max_months_until_targets_met = 24
    total_capital = sum(v['actual'] for k, v in CURRENT_TARGET_DICT.items())

    redeployment_split_plan = {}
    # find amount of months until portfolio targets met
    for _m in range(1, max_months_until_targets_met+1):
        further_increase_m = False
        for acc, current_dict in CURRENT_TARGET_DICT.items():
            rate = \
                (total_capital+_m*MONTHLY_RATE)*current_dict['target_weight']\
                - current_dict['actual']
            if abs(rate) <= _m*MONTHLY_RATE and not rate < 0:
                # current amount of months is feasible for this account
                redeployment_split_plan[acc] = {'total_required': rate}
                continue
            else:
                # too few months to meet portfolio targets
                further_increase_m = True
                break

        if further_increase_m:
            continue
        else:
            total_remit = sum(v['total_required'] for k, v in
                              redeployment_split_plan.items())
            required_months = int(total_remit // MONTHLY_RATE)
            # calculate schedule
            for acc, plan in redeployment_split_plan.items():
                rate = \
                    (plan['total_required'] / total_remit)*MONTHLY_RATE

                plan['schedule'] = [rate, ]*required_months

            # schedule calculation finished for all accounts
            break
    else:
        raise ValueError('ERR: No schedule found within the next {} months'
                         .format(max_months_until_targets_met))

    report_schedule(redeployment_split_plan, total_capital)


def report_schedule(plan, _total_capital):
    print('## Redeployment Plan ##\n')
    for acc, info in plan.items():
        print(acc)
        print('Current ratio: {:.3f} - target: {:.3f}'
              .format(CURRENT_TARGET_DICT[acc]['actual']/_total_capital,
                      CURRENT_TARGET_DICT[acc]['target_weight']))
        print('Total Required: {:.2f} EUR'.format(info['total_required']))
        print('Remit {:.2f} EUR for {:} months'.format(info['schedule'][0],
                                                  len(info['schedule'])))
        print('After that, continue with a rate of {:.2f} EUR\n'.format(
            MONTHLY_RATE*CURRENT_TARGET_DICT[acc]['target_weight']))


if __name__ == '__main__':
    main()

The user only has to adapt the CURRENT_TARGET_DICT and MONTHLY_RATE in the beginning of the script to her/his own investment situation. With the dummy values that are shown in the script, the following output is to be expected:

## Redeployment Plan ##

Stocks
Current ratio: 0.312 - target: 0.300
Total Required: 192.70 EUR
Remit 64.23 EUR for 3 months
After that, continue with a rate of 150.00 EUR

Long-Term Bonds
Current ratio: 0.386 - target: 0.400
Total Required: 899.60 EUR
Remit 299.87 EUR for 3 months
After that, continue with a rate of 200.00 EUR

Commodities
Current ratio: 0.159 - target: 0.150
Total Required: 24.35 EUR
Remit 8.12 EUR for 3 months
After that, continue with a rate of 75.00 EUR

Mid-Term Bonds
Current ratio: 0.143 - target: 0.150
Total Required: 383.35 EUR
Remit 127.78 EUR for 3 months
After that, continue with a rate of 75.00 EUR

Why not withdrawing from assets with too much weight?

The general approach in reassing the asset weights is by remitting positve amounts on each asset in an appropriated ratio. Reason for this is that withdrawing money from funds can lead to taxable profit, which would impair your earnings as the current point of time for reassing assets is seldomly the correct moment to sell shares with regards to the market course.

Mathematical Background

The approach is straight forward and will be illustrated with two hypothetical assets. With x1 and x2 being the unknown rates to remit to asset one and two, m being the unknown amount of months required, B being the total monthly rate remittable while S1 and S2 being the percentual target share for asset one and two and having K1 and K2 as current balances, we’ll have the following equations:

x1 + x2 = m * B
K1 + x1 = (K1 + K2 + m*B) * S1
K2 + x2 = (K1 + K2 + m*B) * S2

<=> x1 = (K1 + K2 + m*B) * S1 - K1  (1)
<=> x2 = (K1 + K2 + m*B) * S2 - K2  (2)

0 <= x1 <= m * B    (3)
0 <= x2 <= m * B    (4)

With eq. (1) and (2) we have the formulas for the amount to remit to each asset as soon as we have a value for the unknown m (number of months). Since negative rates are not allowed, the algorithm will count up m and attempt to get non-negative rates for each asset until a proper amount of months is found.

Conclusion

Every once or twice a year, when it comes down to reassing your portfolio, you can help yourself with this script in order to find the required values for remittance such that your risk profile is healthy again.

Please feel free to convert this algorithm to your preferred programming language.

If you want to read more on successfully building up a portfolio you can follow up on this Investopedia article.


High Cardinality and Custom Aggregations

Every now and then transforming data column-by-column, e.g. by adding the log of each floating-point feature to the dataset, is maybe not enough and more ambitious transformations are desired. Perhaps you would like a certain feature transformed dependending on the value the corresponding observation holds in another feature.

Pandas aggregate function

The Pandas library comes with a highly optimized aggregation function, that can be utilized with the most general statistical summaries by default.

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],
                       index=pd.date_range('1/1/2000', periods=10))
>>> df['D'] = np.tile([0, 1], 5)
>>> df
                   A         B         C  D
2000-01-01  0.506738  0.063401 -0.864185  0
2000-01-02  0.067840  0.272242 -1.425997  1
2000-01-03  0.371210  0.558194 -0.151807  0
2000-01-04 -0.363128 -0.882355  1.254074  1
2000-01-05  0.735415 -0.072763 -0.634464  0
2000-01-06  0.796265  1.525935  0.086280  1
2000-01-07 -0.048933  0.178912 -0.039579  0
2000-01-08 -0.730741 -1.026607  1.095198  1
2000-01-09 -1.649780  1.245154 -0.610097  0
2000-01-10  0.003570 -0.132191 -0.196222  1

>>> df.agg(['min', 'max', 'sum'])
            A         B         C  D
min -1.649780 -1.026607 -1.425997  0
max  0.796265  1.525935  1.254074  1
sum -0.311544  1.729923 -1.486800  5

The agg function is short for aggregation and takes either strings of known function names such as min or sum or homebrewed customized aggregation functions. One could also get these statistical characteristics by other means but the pandas aggregation is nevertheless worth a try since it runs with a c implementation in the background making it super fast.

Categorical features with high cardinality

Imagine you are given a categorical feature with a very high cardinality, say, the postal code of a citizen encoded into 200 integer numbers from 0 to 199. Postal codes have no greater-less relationship but the enumeration suggests it. While decisiontree-based models are capable of ignoring the inevitable ordering of categorical feature encodings, usual linear models desperately fall for that. The result is over-fitting.

>>> # Assume 10 postal codes for this demo
>>> df['PostalCode'] = np.asarray(list(range(10)))
>>> df
                   A         B         C  D  Postalcode
2000-01-01  0.506738  0.063401 -0.864185  0           0
2000-01-02  0.067840  0.272242 -1.425997  1           1
2000-01-03  0.371210  0.558194 -0.151807  0           2
2000-01-04 -0.363128 -0.882355  1.254074  1           3
2000-01-05  0.735415 -0.072763 -0.634464  0           4
2000-01-06  0.796265  1.525935  0.086280  1           5
2000-01-07 -0.048933  0.178912 -0.039579  0           6
2000-01-08 -0.730741 -1.026607  1.095198  1           7
2000-01-09 -1.649780  1.245154 -0.610097  0           8
2000-01-10  0.003570 -0.132191 -0.196222  1           9

One Hot Encoding

A popular way around is so-called one-hot-encoding, that is, converting the single categorical feature with n distinct values into n boolean features. For low cardinality features, such as gender (male or female), this is a pretty neat thing but for our postal code we will end up with sparse features that bloats RAM and may even hinder statistical models to learn latent patterns as the feature space explodes.

>>> # One hot encode
>>> encoded_df = pd.get_dummies(df['Postalcode'], prefix='postalcode_')
>>> encoded_df
            postalcode__0  postalcode__1  postalcode__2  postalcode__3  \
2000-01-01              1              0              0              0   
2000-01-02              0              1              0              0   
2000-01-03              0              0              1              0   
2000-01-04              0              0              0              1   
2000-01-05              0              0              0              0   
2000-01-06              0              0              0              0   
2000-01-07              0              0              0              0   
2000-01-08              0              0              0              0   
2000-01-09              0              0              0              0   
2000-01-10              0              0              0              0   

            postalcode__4  postalcode__5  postalcode__6  postalcode__7  \
2000-01-01              0              0              0              0   
2000-01-02              0              0              0              0   
2000-01-03              0              0              0              0   
2000-01-04              0              0              0              0   
2000-01-05              1              0              0              0   
2000-01-06              0              1              0              0   
2000-01-07              0              0              1              0   
2000-01-08              0              0              0              1   
2000-01-09              0              0              0              0   
2000-01-10              0              0              0              0   

            postalcode__8  postalcode__9  
2000-01-01              0              0  
2000-01-02              0              0  
2000-01-03              0              0  
2000-01-04              0              0  
2000-01-05              0              0  
2000-01-06              0              0  
2000-01-07              0              0  
2000-01-08              0              0  
2000-01-09              1              0  
2000-01-10              0              1  

Alternatives to One-Hot-Encoding

Great effort is put into finding alternatives to one-hot-encoding in order to overcome its drawbacks especially for high cardinality. To mention a few there is the supervised ratio or the weight of evidence. Another one worth mentioning is described in this paper, which I would prefer naming after its author Micci-Bareca. These transformations were especially useful for the recent Porto Seguro Kaggle Competition, where high-cardinality features were present.

The idea is to transform the categorical values to continuous values corresponding to their target frequency. For example, if the categorical value 2 very often comes with the boolean target True or 1, then all observations with a categorical 2 will be transformed to e.g. 0.97.

Please note, that these transformations are heavily dependent on the target distribution over the specific categorical feature. Since we are kind of introducing target information into our dependent variables, we might run into data leakage, where we consider too much information from the target variable that we actually try to predict. In the worst case we’ll get over-fitting. Again, there are several ways to mitigate that, for instance, by adding noise to the transformations.

Custom Functions for Pandas aggregate

Now how can we apply those alternatives? Let’s consider column D to be our target and we want to transform the PostalCode. For a more representative example we will distribute D and the PostalCode now randomly over 10000 observations. We will apply the weight of evidence and the micci-barreca transformations.

import numpy as np
import pandas as pd


def _woe(s, tp, tn):
    """Weight of evidence

    woe_i = ln(P_i/TP) - ln(N_i/TN)

    :param s: pandas groupby obj
    :param tp: total positives in full series (target prior)
    :param tn: total negatives in full series
    """
    p = s.sum()
    nom = p / tp
    den = (s.count() - p) / tn
    return np.log(nom / den)


def _micci_barreca_encode(s, tp, min_samples_leaf=1, smoothing=1):
    """Micci Barreca encoding

    This transformation outputs something between supervised ratio and target
    prior, depending on smoothing level.

    :param s: pandas groupby obj
    :param tp: total positives in full series
    """
    smoothing = \
        1 / (1 + np.exp(-(s.count() - min_samples_leaf) / smoothing))
    return tp * (1 - smoothing) + s.mean() * smoothing


if __name__ == '__main__':
    n_observations = 10000
    df = pd.DataFrame({'D': np.random.randint(2, size=n_observations),
                       'PostalCode': np.random.randint(10, size=n_observations)
                      })
    print('Original Table:')
    print(df.head(10))
    target_prior = df['D'].sum()
    target_size = df['D'].count()
    aggregation_agenda = \
        {'_woe': lambda x: _woe(x, target_prior, target_size - target_prior),
         '_micci': lambda x: _micci_barreca_encode(x, target_prior,
                                                   min_samples_leaf=100,
                                                   smoothing=10),
         }
    col = 'PostalCode'
    transformed_df = \
        df.groupby([col], as_index=False).D\
            .agg(aggregation_agenda)\
            .rename(columns={agg_key: col+agg_key for
                             agg_key in aggregation_agenda.keys()})
    print('Transformation/Mapping table:')
    print(transformed_df.head(10))

    df = df.merge(transformed_df, how='left', on=col)
    print('Merged Table:')
    print(df.head(10))

After creating the artificial dataset, the target’s prior and size will be computed, which is necessary for both, weight of evidence and the micci-barreca encoding. We summarize the transformations into an aggregation_agenda, that will be used by pandas agg function to compile the output. Note the use of the group_by function, which helps us getting the target values grouped per categorical value.

The last step would be merging both tables together, that are, the original and the transformed features. The above code will output the following:

Original Table:
   D  PostalCode
0  0           5
1  1           8
2  1           3
3  1           0
4  1           4
5  1           5
6  1           1
7  0           3
8  0           9
9  1           6

Transformation/Mapping table:
   PostalCode  PostalCode_micci  PostalCode_woe
0           0          0.507952       -0.012996
1           1          0.501496       -0.038825
2           2          0.514535        0.013348
3           3          0.515122        0.015699
4           4          0.497996       -0.052824
5           5          0.503462       -0.030960
6           6          0.509845       -0.005424
7           7          0.534010        0.091444
8           8          0.516771        0.022304
9           9          0.511263        0.000254

Merged Table:
   D  PostalCode  PostalCode_micci  PostalCode_woe
0  0           5          0.503462       -0.030960
1  1           8          0.516771        0.022304
2  1           3          0.515122        0.015699
3  1           0          0.507952       -0.012996
4  1           4          0.497996       -0.052824
5  1           5          0.503462       -0.030960
6  1           1          0.501496       -0.038825
7  0           3          0.515122        0.015699
8  0           9          0.511263        0.000254
9  1           6          0.509845       -0.005424

Do not forget to drop the original PostalCode column when working with the merged table!

Conclusion

We have seen how Pandas can be powered to create new features from high-cardinality features among others, that go beyond simple mathematical statistics such as mean or standard deviation. It should be clear, that one hot encoding is not always the best choice and alternative approaches are available. Getting familiar with the concepts of aggregate, group_by and merge can help everyone to facilitate high-performance data analyses with just a few lines of code.


Pandas and sklearn pipelines

Having to deal with a lot of labeled data, one won’t come around using the great pandas library sooner or later. The benefits of it over raw numpy are obvious.

Now pandas is a library that came up some time after numpy. Bad thing about this - some other great tools started growing immensely without having pandas at hand so they had to be built upon numpy.

With that I am talking about sklearn and in particular their awesome mechanisms for pipelines and feature unions. While most of the people involved in Data Science know what sklearn is, few have actually used the latter. One of the reasons leading to this collective unused potential might be the lack of a pandas support.

Here’s a brief refresher on pipelines. Creating a pipeline is as simple as this:

import pandas as pd
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline, make_pipeline

# create pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('polynomials', PolynomialFeatures()),
                ])

# alternatively
pipe = make_pipeline(StandardScaler(), PolynomialFeatures())

Transforming data through this pipeline is also straight forward:

df = pd.read_csv('path/to/data.csv')
transformed = pipe.fit_transform(df)

The fit_transform function is just a consecutive call of fit() and transform(), the same syntax as it is for sklearn’s famous regression and classifier models. You get the idea of pipelines.

Unfortunately there is a considerable drawback.

Problem

The returned object of pipelines and especially feature unions are numpy arrays. This is partly due to the internals of pipelines and partly due to the elements of the pipeline themselves, that is, sklearn’s statistical models and transformers such as StandardScaler. When you rely on your transformed dataset to retain the pandas dataframe structure e.g. because your model has a column-specific processing, there is no out-of-the-box solution.

Please note, that there is a wrapper available, called sklearn-pandas. Before I came up with the following solution I tried that wrapper and I felt it has some implausible limitations such as no pandas output in case of a default transformation or a very cluttered syntax among others. I kind of lost control over some pipelines when mixed with feature unions so that special handlings weren’t feasible anymore.

Solution

Facing some certain challenges in my own personal machine learning routines that I couldn’t tackle with the beforementioned wrapper I found a homebrew approach to be not that complicated as it might seem. Moreover, with the following blueprints I gained full control over each element in a pipeline that can be arbitrarily mixed with feature unions and I still have a pandas dataframe as returned object.

Let’s dive right into it!

Transformation Chains

Each pipeline and feature union consists of elements, their chains, that work on top of each other or in parallel, hand in hand. Here’s a compiled list of some transformers that are pandas-aware.

Simple Transformations

Imagine you want your dataset to be expanded with the data itself but transformed e.g. by taking the square root of it or by the log of it. You would approach this by a simple feature union but what are the elements of that? Examine the following:

class SimpleTransformer(BaseEstimator, TransformerMixin):
    """Apply given transformation."""
    def __init__(self, trans_func, untrans_func, columns):
        self.transform_func = trans_func
        self.inverse_transform_func = untrans_func
        self.cols = columns

    def fit(self, x, y=None):
        return self

    def transform(self, x):
        x = self._get_selection(x)
        return self.transform_func(x) if callable(self.transform_func) else x

    def inverse_transform(self, x):
        return self.inverse_transform_func(x) \
            if callable(self.inverse_transform_func) else x

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.cols

Each class you intent to put in a pipeline or feature union should inherit from BaseEstimator and TransformerMixin. While the former makes your class accessible for hyper parameter methods such as GridSearch the latter applies a robust fit_transform function. For the base classes the only mandatory functions are fit and transform but with __init__ you can give your transformer some config at initialization and the inverse__transform enables the inverse transform call on the full pipeline if each element supports it. Furthermore, the get_feature_names function is crucial when you want the names of transformed features in a feature union accessible, which is a key feature as we’ll see later.

Let’s take a closer look at the content. This class gets a transformation and untransformation instruction for transform() and inverse_transform() respectively. Besides that, at init we can specify the certain columns that need to be transformed. Note, that this can be arbitrarily adapted to your own needs. If you need the processing of all columns given, just omit the first line in the transform function.

So far so good. When expanding my dataset I often want the actual dataset also be present in my feature union, which is not retained by default. We can work around this by having an identity transformer. We can even use our new class SimpleTransformer for that - just pass None for trans_func and untrans_func.

A very basic feature union that expects the pandas dataframe format can look like this:

from sklearn.pipeline import FeatureUnion
import numpy as np

all_feature_names = ['Age', 'Gender', 'Height', 'Weight', 'y1', 'y2']

simple_union = FeatureUnion([('simple_trans_y',
                               SimpleTransformer(np.sqrt, np.square,
                                                 ['y1', 'y2'])
                              ),
                              ('identity',
                               SimpleTransformer(None, None,
                                ['Age', 'Gender', 'Height', 'Weight'])
                              )
                             ])

Don’t be confused by all those brackets - the FeatureUnion class just takes as argument a list of tuples where each tuple consists of a name and a transformer. Here, a dataset is simply filtered by subsets of all_feature_names where the columns y1 and y2 are transformed into their square root (assume their content to be floating point numbers). Well, the output of this union will still be a numpy matrix, but hold on, the magic is still to come. So far we have seen how to have transformations specific to certain columns but the df format is not retained yet. Let’s examine a few more examples before we get to this next step.

Scaling Transformations

The classic. Scaling your dataset by a StandardScaler or MinMaxScaler is what data scientists do for a living since a lot of linear models rely on this preprocessing in order to learn the latent patterns. But what if we want only specific columns to be scaled? What if I need a separate scaling for my independent and dependent features since I want to inverse the scaling of my predictions later that naturally embrace only the target variables?

class Scaler(BaseEstimator, TransformerMixin):
    """scales selected columns only with given scaler"""
    def __init__(self, scaler, columns):
        self.scaler = scaler
        self.cols = columns

    def fit(self, X, y=None):
        X = self._get_selection(X)
        self.scaler.fit(X, y)
        return self

    def transform(self, X):
        X = self._get_selection(X)
        return self.scaler.transform(X)

    def inverse_transform(self, X):
        return self.scaler.inverse_transform(X)

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.cols

This scaler class is pretty straight forward and calls in each of its functions the corresponding scaler function, where the scaler was given during initialization. The difference to the SimpleTransformer is that we now also do something during fitting but no surprises here. Build the feature union like this:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

scaling_union = FeatureUnion([('scaler_x',
                              Scaler(StandardScaler(),
                                     ['Age', 'Gender', 'Height', 'Weight']),
                              ('scaler_y',
                               Scaler(StandardScaler(),
                                      ['y1', 'y2']))
                             ])

You get the idea.

Rolling Transformations

You work with time series data and want your dataset expanded with rolling statistics? No problem, just consider this:

class RollingFeatures(BaseEstimator, TransformerMixin):
    """This Transformer adds rolling statistics"""
    def __init__(self, columns, lookback=10):
        self.lookback = lookback
        self.cols = columns
        self.transformed_cols = None

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = self._get_selection(X)
        feat_d = {'std': X.rolling(self.lookback).std(),
                  'mean': X.rolling(self.lookback).mean(),
                  'sum': X.rolling(self.lookback).sum()
                  }
        for k in feat_d:
            feat_d[k].columns = \
                ['{}_rolling{}_{}'.format(c, self.lookback, k) for
                 c in X.columns]
        df = pd.concat(list(feat_d.values()), axis=1)
        self.transformed_cols = list(df.columns)
        return df

    def _get_selection(self, df):
        assert isinstance(df, pd.DataFrame)
        return df[self.cols]

    def get_feature_names(self):
        return self.transformed_cols

Here we come up with completely new feature names the first time. Note, that we have a new class variable transformed_cols to take account of those cols that were generated here.

Cleaning the DataFrame

As simple as this:

class DFCleaner(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        X.dropna(inplace=True)
        X.reset_index(drop=True, inplace=True)
        return X

By now you should have realized, that the possibilities are endless. Every transformation you would apply on your data is ultimately convertible into a transformer class. Now let’s see how to put them into shape such that these chains are actually stackable.

Retain the DataFrame format for the output

For this challenge we can exploit the following simple trick. The FeatureUnion class has a method called get_feature_names that exhibits the feature names of each transformer although their output is a numpy matrix. In order to workaround the numpy output we can make each feature union a two-step pipeline where the union denotes the first step while a transformer fetching the actual feature names represents the second step. Sounds crazy? Check this out:

class FeatureUnionReframer(BaseEstimator, TransformerMixin):
    """Transforms preceding FeatureUnion's output back into Dataframe"""
    def __init__(self, feat_union, cutoff_transformer_name=True):
        self.union = feat_union
        self.cutoff_transformer_name = cutoff_transformer_name

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, np.ndarray)
        if self.cutoff_transformer_name:
            cols = [c.split('__')[1] for c in self.union.get_feature_names()]
        else:
            cols = self.union.get_feature_names()
        df = pd.DataFrame(data=X, columns=cols)
        return df

    @classmethod
    def make_df_retaining(cls, feature_union):
        """With this method a feature union will be returned as a pipeline
        where the first step is the union and the second is a transformer that
        re-applies the columns to the union's output"""
        return Pipeline([('union', feature_union),
                         ('reframe', cls(feature_union))])

This class does the job. The optional bool argument gives you the freedom to keep the default prefix in feature union chains that is the name of the transformer. The class’ static method is making things even more comfortable as we don’t need to instantiate it explicitly.

See the following example where all falls into place:

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

x_feats = ['Age', 'Gender', 'Height', 'Weight']
y_feats = ['y1', 'y2']

featurize_union = make_union(SimpleTransformer(np.sqrt, np.square, y_feats),
                             SimpleTransformer(None, None, x_feats),
                             RollingFeatures(x_feats, lookback=10)
                             )
scaling_union = make_union(Scaler(StandardScaler(), x_feats),
                           Scaler(StandardScaler(), y_feats)
                          )

featurize_pipe = FeatureUnionReframer.make_df_retaining(featurize_union)
scaling_pipe = FeatureUnionReframer.make_df_retaining(scaling_union)

pipe = make_pipeline(featurize_pipe,
                     DFCleaner(),
                     scaling_pipe)

The output of pipe will be our dataset as Pandas DataFrame with all transformations applied. Awesome!

Conclusion

The presented concepts are arbitrarily expansible and give full control over the transformations that are applied on the dataset while the output is retained as DataFrame. All key features remain such as hyper parameter optimization with GridSearch or avoidance of testset leakage through fit() and transform().

Bonus

You want to train Keras LSTMs on time series data and you need the dataset reordered for batch training? At the same time, the set must be back-convertible for a comparison of the prediction to the actual target data? Fear not, this will help:

class ReSamplerForBatchTraining(BaseEstimator, TransformerMixin):
    """This transformer sorts the samples according to a
    batch size for batch training"""
    def __init__(self, batch_size):
        self.batch_size = batch_size
        self.indices, self.columns = [], []

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        # cut the tail
        trunc_idx = len(X) % self.batch_size
        X = X.iloc[:-trunc_idx, :]

        # reorder
        new_idcs = np.tile(np.arange(self.batch_size), len(X) //
                           self.batch_size)
        assert len(X) == new_idcs.shape[0], \
            "{} != {}".format(len(X), new_idcs.shape[0])
        X['new_idx'] = new_idcs
        X.sort_values(by='new_idx', ascending=True, inplace=True)
        self.indices = X.index
        X.reset_index(drop=True, inplace=True)
        X.drop(['new_idx'], axis=1, inplace=True)
        self.columns = X.columns
        return X

    def inverse_transform(self, X):
        # columns undefined
        inversed = pd.DataFrame(X, index=self.indices).sort_index()
        return inversed

You’re welcome.


Auto-Downsizing data

There is a vast amount of data published on Kaggle Datasets, that one can download and play around with for experimental purposes. However, more than often this data is stored with data types being an overkill for the datum’s intentional use.

Consider this virtual example

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({'name': ['Walter', 'David', 'Jamie', 'Kendra', 'Zoey'], 
                       'age': [28, 31, 54, 44, 51]},
                      dtype=np.float32)

>>> df
    age    name
0  28.0  Walter
1  31.0   David
2  54.0   Jamie
3  44.0  Kendra
4  51.0    Zoey

Problem

The used numpy data type here, float32, takes four byte of disk space for every age datum stored in this mini data frame. Yet it is plausible that a feature as ‘age’ would not get close to the upper limit of a single byte datum, that is, 256.

For small datasets there should be no concern but when working with Big Data and millions of entries in a database the occupied disk and RAM space might get exhausted unreasonably and unnecessarily fast.

Converting the demonstrated example is as simple as this:

>>> df.age = df.age.astype(np.uint8)

Now nobody’s got time to skim through a dataset of multiple GBs just to identify the potential candidates for data type reduction. It should come naturally to have an automatic function or class dealing with this reduction by just checking min and max values.

Solution

The following mini class might come in handy for all reduction intentions. Please note, that reducing data types based on min and max values assumes there won’t be any future data accumulated to the dataset, which could not fit in the newly converted data type.

import numpy as np
import pandas as pd

from joblib import Parallel, delayed

__AUTHOR__ = 'Kirgsn'

class Reducer:
    """
    Class that takes a dict of increasingly big numpy datatypes to transform
    the data of a pandas dataframe to in order to save memory usage.
    """
    memory_scale_factor = 1024**2  # memory in MB

    def __init__(self, conv_table=None):
        """
        :param conv_table: dict with np.dtypes-strings as keys
        """
        if conv_table is None:
            self.conversion_table = \
                {'int': [np.int8, np.int16, np.int32, np.int64],
                 'uint': [np.uint8, np.uint16, np.uint32, np.uint64],
                 'float': [np.float16, np.float32, ]}
        else:
            self.conversion_table = conv_table

    def _type_candidates(self, k):
        for c in self.conversion_table[k]:
            i = np.iinfo(c) if 'int' in k else np.finfo(c)
            yield c, i

    def reduce(self, df, verbose=False):
        """Takes a dataframe and returns it with all data transformed to the
        smallest necessary types.

        :param df: pandas dataframe
        :param verbose: If True, outputs more information
        :return: pandas dataframe with reduced data types
        """
        ret_list = Parallel(n_jobs=-1)(delayed(self._reduce)
                                                (df[c], c, verbose) for c in
                                                df.columns)

        return pd.concat(ret_list, axis=1)

    def _reduce(self, s, colname, verbose):

        # skip NaNs
        if s.isnull().any():
            if verbose:
                print(colname, 'has NaNs - Skip..')
            return s

        # detect kind of type
        coltype = s.dtype
        if np.issubdtype(coltype, np.integer):
            conv_key = 'int' if s.min() < 0 else 'uint'
        elif np.issubdtype(coltype, np.floating):
            conv_key = 'float'
        else:
            if verbose:
                print(colname, 'is', coltype, '- Skip..')
            print(colname, 'is', coltype, '- Skip..')
            return s

        # find right candidate
        for cand, cand_info in self._type_candidates(conv_key):
            if s.max() <= cand_info.max and s.min() >= cand_info.min:

                if verbose:
                    print('convert', colname, 'to', str(cand))
                return s.astype(cand)

        # reaching this code is bad. Probably there are inf, or other high numbs
        print(("WARNING: {} " 
               "doesn't fit the grid with \nmax: {} "
               "and \nmin: {}").format(colname, s.max(), s.min()))
        print('Dropping it..')

This class can be used the following way:

import pandas as pd

dataset = pd.read_csv('path/to/data')
reducer = Reducer()
dataset = reducer.reduce(dataset)

The minimum datatypes to be converted can be controlled by the ‘conv_table’ that is an optional argument for the class’ init function. The default converts to all integers and at most to float32. Neat thing about this presented implementation: It even takes advantage of multiple CPU cores when available via the joblib python library.

Conclusion

Dealing with inappropriate data types can be now a thing of the past by utilizing sophisticated reducing classes. Converting the dataset and saving it again to disk can save a considerable amount of disk space and, even more crucial, lessens the allocated RAM such that more actual data can be taken into account.