Some capital investment experts tend to say there are two basic things that will lead to a high-performing depot:
A risk-balanced, diversified portfolio and
Periodic asset redeployment
The former has no definitive answer and should be adapted to each investor individually.
Generally speaking, higher-risk assets can take up a larger amount of younger money-savers’ total investment since they have more time to put up for adverse market trends whereas the older generation should stick to steadier securities. For a good overview of different portfolio types see Investopedia’s guideline on popuular portfolio types. However, diversification is comfortably achieved by passively managed funds as, for instance, ETFs, which come with large cost savings and outperform actively managed funds almost all the time - the Economist confirms.
The second virtue for a successful investment into stocks on the other hand is dead simple:
Reassess your capital as often as you feel comfortable with, at least once a year or as soon as your asset weightings fall apart drastically, say 5% in one asset weighting.
What is Asset Redeployment?
Reassessing capital means one would transfer money on all her/his different assets in such a way, that each asset holds an amount equal to the weighting you initially assigned to it.
Imagine for example you like 30% of your capital in stocks and the other 70% in bonds since that is a risk-apportionment you can live with.
The next day stock market booms and you see your stock capital soaring up to undreamed heights.
The first moment you jump for joy but the next you halt as one glance on the new capital weights makes you shiver: stocks and bonds are fifty-fifty.
Your risk-portfolio is suddenly highly imbalanced and you are burning for action, there is a money transfer necessary.
You go ahead and deposit the required amount on your bond asset.
Nothing simpler than that!
But hold on!
I don’t have two assets, I have five or six!
Besides that, I don’t place high figures at once on my portfolio, I do small monthly payments.
Calculating different amounts necessary for multiple assets that change in different directions is non-trivial and gave me headaches in the beginning.
Very soon it became clear, that it must be solved iteratively through an algorithm.
The following python script shows how it can be done.
The Algorithm
"""Mapping for current balance on each kind of capital and the desired
weight. Dict with identifier keys and dict-values that have current and
target pairs."""CURRENT_TARGET_DICT={'Stocks':{'actual':6752,'target_weight':0.3},'Commodities':{'actual':3448,'target_weight':0.15},'Long-Term Bonds':{'actual':8360,'target_weight':0.4},'Mid-Term Bonds':{'actual':3089,'target_weight':0.15}}# The monthly rate that is remitted and split among all accounts
MONTHLY_RATE=500defmain():max_months_until_targets_met=24total_capital=sum(v['actual']fork,vinCURRENT_TARGET_DICT.items())redeployment_split_plan={}# find amount of months until portfolio targets met
for_minrange(1,max_months_until_targets_met+1):further_increase_m=Falseforacc,current_dictinCURRENT_TARGET_DICT.items():rate= \
(total_capital+_m*MONTHLY_RATE)*current_dict['target_weight']\
-current_dict['actual']ifabs(rate)<=_m*MONTHLY_RATEandnotrate<0:# current amount of months is feasible for this account
redeployment_split_plan[acc]={'total_required':rate}continueelse:# too few months to meet portfolio targets
further_increase_m=Truebreakiffurther_increase_m:continueelse:total_remit=sum(v['total_required']fork,vinredeployment_split_plan.items())required_months=int(total_remit//MONTHLY_RATE)# calculate schedule
foracc,planinredeployment_split_plan.items():rate= \
(plan['total_required']/total_remit)*MONTHLY_RATEplan['schedule']=[rate,]*required_months# schedule calculation finished for all accounts
breakelse:raiseValueError('ERR: No schedule found within the next {} months'.format(max_months_until_targets_met))report_schedule(redeployment_split_plan,total_capital)defreport_schedule(plan,_total_capital):print('## Redeployment Plan ##\n')foracc,infoinplan.items():print(acc)print('Current ratio: {:.3f} - target: {:.3f}'.format(CURRENT_TARGET_DICT[acc]['actual']/_total_capital,CURRENT_TARGET_DICT[acc]['target_weight']))print('Total Required: {:.2f} EUR'.format(info['total_required']))print('Remit {:.2f} EUR for {:} months'.format(info['schedule'][0],len(info['schedule'])))print('After that, continue with a rate of {:.2f} EUR\n'.format(MONTHLY_RATE*CURRENT_TARGET_DICT[acc]['target_weight']))if__name__=='__main__':main()
The user only has to adapt the CURRENT_TARGET_DICT and MONTHLY_RATE in the beginning of the script to her/his own investment situation.
With the dummy values that are shown in the script, the following output is to be expected:
## Redeployment Plan ##
Stocks
Current ratio: 0.312 - target: 0.300
Total Required: 192.70 EUR
Remit 64.23 EUR for 3 months
After that, continue with a rate of 150.00 EUR
Long-Term Bonds
Current ratio: 0.386 - target: 0.400
Total Required: 899.60 EUR
Remit 299.87 EUR for 3 months
After that, continue with a rate of 200.00 EUR
Commodities
Current ratio: 0.159 - target: 0.150
Total Required: 24.35 EUR
Remit 8.12 EUR for 3 months
After that, continue with a rate of 75.00 EUR
Mid-Term Bonds
Current ratio: 0.143 - target: 0.150
Total Required: 383.35 EUR
Remit 127.78 EUR for 3 months
After that, continue with a rate of 75.00 EUR
Why not withdrawing from assets with too much weight?
The general approach in reassing the asset weights is by remitting positve amounts on each asset in an appropriated ratio.
Reason for this is that withdrawing money from funds can lead to taxable profit, which would impair your earnings as the current point of time for reassing assets is seldomly the correct moment to sell shares with regards to the market course.
Mathematical Background
The approach is straight forward and will be illustrated with two hypothetical assets.
With x1 and x2 being the unknown rates to remit to asset one and two, m being the unknown amount of months required, B being the total monthly rate remittable while S1 and S2 being the percentual target share for asset one and two and having K1 and K2 as current balances, we’ll have the following equations:
With eq. (1) and (2) we have the formulas for the amount to remit to each asset as soon as we have a value for the unknown m (number of months).
Since negative rates are not allowed, the algorithm will count up m and attempt to get non-negative rates for each asset until a proper amount of months is found.
Conclusion
Every once or twice a year, when it comes down to reassing your portfolio, you can help yourself with this script in order to find the required values for remittance such that your risk profile is healthy again.
Please feel free to convert this algorithm to your preferred programming language.
If you want to read more on successfully building up a portfolio you can follow up on this Investopedia article.
Every now and then transforming data column-by-column, e.g. by adding the log of each floating-point feature to the dataset, is maybe not enough and more ambitious transformations are desired.
Perhaps you would like a certain feature transformed dependending on the value the corresponding observation holds in another feature.
Pandas aggregate function
The Pandas library comes with a highly optimized aggregation function, that can be utilized with the most general statistical summaries by default.
The agg function is short for aggregation and takes either strings of known function names such as min or sum or homebrewed customized aggregation functions.
One could also get these statistical characteristics by other means but the pandas aggregation is nevertheless worth a try since it runs with a c implementation in the background making it super fast.
Categorical features with high cardinality
Imagine you are given a categorical feature with a very high cardinality, say, the postal code of a citizen encoded into 200 integer numbers from 0 to 199.
Postal codes have no greater-less relationship but the enumeration suggests it.
While decisiontree-based models are capable of ignoring the inevitable ordering of categorical feature encodings, usual linear models desperately fall for that.
The result is over-fitting.
>>># Assume 10 postal codes for this demo
>>>df['PostalCode']=np.asarray(list(range(10)))>>>dfABCDPostalcode2000-01-010.5067380.063401-0.864185002000-01-020.0678400.272242-1.425997112000-01-030.3712100.558194-0.151807022000-01-04-0.363128-0.8823551.254074132000-01-050.735415-0.072763-0.634464042000-01-060.7962651.5259350.086280152000-01-07-0.0489330.178912-0.039579062000-01-08-0.730741-1.0266071.095198172000-01-09-1.6497801.245154-0.610097082000-01-100.003570-0.132191-0.19622219
One Hot Encoding
A popular way around is so-called one-hot-encoding, that is, converting the single categorical feature with n distinct values into n boolean features.
For low cardinality features, such as gender (male or female), this is a pretty neat thing but for our postal code we will end up with sparse features that bloats RAM and may even hinder statistical models to learn latent patterns as the feature space explodes.
>>># One hot encode
>>>encoded_df=pd.get_dummies(df['Postalcode'],prefix='postalcode_')>>>encoded_dfpostalcode__0postalcode__1postalcode__2postalcode__3 \
2000-01-0110002000-01-0201002000-01-0300102000-01-0400012000-01-0500002000-01-0600002000-01-0700002000-01-0800002000-01-0900002000-01-100000postalcode__4postalcode__5postalcode__6postalcode__7 \
2000-01-0100002000-01-0200002000-01-0300002000-01-0400002000-01-0510002000-01-0601002000-01-0700102000-01-0800012000-01-0900002000-01-100000postalcode__8postalcode__92000-01-01002000-01-02002000-01-03002000-01-04002000-01-05002000-01-06002000-01-07002000-01-08002000-01-09102000-01-1001
Alternatives to One-Hot-Encoding
Great effort is put into finding alternatives to one-hot-encoding in order to overcome its drawbacks especially for high cardinality.
To mention a few there is the supervised ratio or the weight of evidence.
Another one worth mentioning is described in this paper, which I would prefer naming after its author Micci-Bareca.
These transformations were especially useful for the recent Porto Seguro Kaggle Competition, where high-cardinality features were present.
The idea is to transform the categorical values to continuous values corresponding to their target frequency.
For example, if the categorical value 2 very often comes with the boolean target True or 1, then all observations with a categorical 2 will be transformed to e.g. 0.97.
Please note, that these transformations are heavily dependent on the target distribution over the specific categorical feature.
Since we are kind of introducing target information into our dependent variables, we might run into data leakage, where we consider too much information from the target variable that we actually try to predict.
In the worst case we’ll get over-fitting.
Again, there are several ways to mitigate that, for instance, by adding noise to the transformations.
Custom Functions for Pandas aggregate
Now how can we apply those alternatives?
Let’s consider column D to be our target and we want to transform the PostalCode.
For a more representative example we will distribute D and the PostalCode now randomly over 10000 observations.
We will apply the weight of evidence and the micci-barreca transformations.
importnumpyasnpimportpandasaspddef_woe(s,tp,tn):"""Weight of evidence
woe_i = ln(P_i/TP) - ln(N_i/TN)
:param s: pandas groupby obj
:param tp: total positives in full series (target prior)
:param tn: total negatives in full series
"""p=s.sum()nom=p/tpden=(s.count()-p)/tnreturnnp.log(nom/den)def_micci_barreca_encode(s,tp,min_samples_leaf=1,smoothing=1):"""Micci Barreca encoding
This transformation outputs something between supervised ratio and target
prior, depending on smoothing level.
:param s: pandas groupby obj
:param tp: total positives in full series
"""smoothing= \
1/(1+np.exp(-(s.count()-min_samples_leaf)/smoothing))returntp*(1-smoothing)+s.mean()*smoothingif__name__=='__main__':n_observations=10000df=pd.DataFrame({'D':np.random.randint(2,size=n_observations),'PostalCode':np.random.randint(10,size=n_observations)})print('Original Table:')print(df.head(10))target_prior=df['D'].sum()target_size=df['D'].count()aggregation_agenda= \
{'_woe':lambdax:_woe(x,target_prior,target_size-target_prior),'_micci':lambdax:_micci_barreca_encode(x,target_prior,min_samples_leaf=100,smoothing=10),}col='PostalCode'transformed_df= \
df.groupby([col],as_index=False).D\
.agg(aggregation_agenda)\
.rename(columns={agg_key:col+agg_keyforagg_keyinaggregation_agenda.keys()})print('Transformation/Mapping table:')print(transformed_df.head(10))df=df.merge(transformed_df,how='left',on=col)print('Merged Table:')print(df.head(10))
After creating the artificial dataset, the target’s prior and size will be computed, which is necessary for both, weight of evidence and the micci-barreca encoding.
We summarize the transformations into an aggregation_agenda, that will be used by pandas agg function to compile the output.
Note the use of the group_by function, which helps us getting the target values grouped per categorical value.
The last step would be merging both tables together, that are, the original and the transformed features.
The above code will output the following:
Do not forget to drop the original PostalCode column when working with the merged table!
Conclusion
We have seen how Pandas can be powered to create new features from high-cardinality features among others, that go beyond simple mathematical statistics such as mean or standard deviation.
It should be clear, that one hot encoding is not always the best choice and alternative approaches are available.
Getting familiar with the concepts of aggregate, group_by and merge can help everyone to facilitate high-performance data analyses with just a few lines of code.
Having to deal with a lot of labeled data, one won’t come around using the great pandas library sooner or later.
The benefits of it over raw numpy are obvious.
Now pandas is a library that came up some time after numpy.
Bad thing about this - some other great tools started growing immensely without having pandas at hand so they had to be built upon numpy.
With that I am talking about sklearn and in particular their awesome mechanisms for pipelines and feature unions.
While most of the people involved in Data Science know what sklearn is, few have actually used the latter.
One of the reasons leading to this collective unused potential might be the lack of a pandas support.
Here’s a brief refresher on pipelines. Creating a pipeline is as simple as this:
The fit_transform function is just a consecutive call of fit() and transform(), the same syntax as it is for sklearn’s famous regression and classifier models.
You get the idea of pipelines.
Unfortunately there is a considerable drawback.
Problem
The returned object of pipelines and especially feature unions are numpy arrays. This is partly due to the internals of pipelines and partly due to the elements of the pipeline themselves, that is, sklearn’s statistical models and transformers such as StandardScaler.
When you rely on your transformed dataset to retain the pandas dataframe structure e.g. because your model has a column-specific processing, there is no out-of-the-box solution.
Please note, that there is a wrapper available, called sklearn-pandas.
Before I came up with the following solution I tried that wrapper and I felt it has some implausible limitations such as no pandas output in case of a default transformation or a very cluttered syntax among others.
I kind of lost control over some pipelines when mixed with feature unions so that special handlings weren’t feasible anymore.
Solution
Facing some certain challenges in my own personal machine learning routines that I couldn’t tackle with the beforementioned wrapper I found a homebrew approach to be not that complicated as it might seem.
Moreover, with the following blueprints I gained full control over each element in a pipeline that can be arbitrarily mixed with feature unions and I still have a pandas dataframe as returned object.
Let’s dive right into it!
Transformation Chains
Each pipeline and feature union consists of elements, their chains, that work on top of each other or in parallel, hand in hand.
Here’s a compiled list of some transformers that are pandas-aware.
Simple Transformations
Imagine you want your dataset to be expanded with the data itself but transformed e.g. by taking the square root of it or by the log of it.
You would approach this by a simple feature union but what are the elements of that? Examine the following:
classSimpleTransformer(BaseEstimator,TransformerMixin):"""Apply given transformation."""def__init__(self,trans_func,untrans_func,columns):self.transform_func=trans_funcself.inverse_transform_func=untrans_funcself.cols=columnsdeffit(self,x,y=None):returnselfdeftransform(self,x):x=self._get_selection(x)returnself.transform_func(x)ifcallable(self.transform_func)elsexdefinverse_transform(self,x):returnself.inverse_transform_func(x) \
ifcallable(self.inverse_transform_func)elsexdef_get_selection(self,df):assertisinstance(df,pd.DataFrame)returndf[self.cols]defget_feature_names(self):returnself.cols
Each class you intent to put in a pipeline or feature union should inherit from BaseEstimator and TransformerMixin.
While the former makes your class accessible for hyper parameter methods such as GridSearch the latter applies a robust fit_transform function.
For the base classes the only mandatory functions are fit and transform but with __init__ you can give your transformer some config at initialization and the inverse__transform enables the inverse transform call on the full pipeline if each element supports it. Furthermore, the get_feature_names function is crucial when you want the names of transformed features in a feature union accessible, which is a key feature as we’ll see later.
Let’s take a closer look at the content.
This class gets a transformation and untransformation instruction for transform() and inverse_transform() respectively.
Besides that, at init we can specify the certain columns that need to be transformed.
Note, that this can be arbitrarily adapted to your own needs.
If you need the processing of all columns given, just omit the first line in the transform function.
So far so good.
When expanding my dataset I often want the actual dataset also be present in my feature union, which is not retained by default.
We can work around this by having an identity transformer.
We can even use our new class SimpleTransformer for that - just pass None for trans_func and untrans_func.
A very basic feature union that expects the pandas dataframe format can look like this:
Don’t be confused by all those brackets - the FeatureUnion class just takes as argument a list of tuples where each tuple consists of a name and a transformer.
Here, a dataset is simply filtered by subsets of all_feature_names where the columns y1 and y2 are transformed into their square root (assume their content to be floating point numbers).
Well, the output of this union will still be a numpy matrix, but hold on, the magic is still to come.
So far we have seen how to have transformations specific to certain columns but the df format is not retained yet.
Let’s examine a few more examples before we get to this next step.
Scaling Transformations
The classic. Scaling your dataset by a StandardScaler or MinMaxScaler is what data scientists do for a living since a lot of linear models rely on this preprocessing in order to learn the latent patterns.
But what if we want only specific columns to be scaled?
What if I need a separate scaling for my independent and dependent features since I want to inverse the scaling of my predictions later that naturally embrace only the target variables?
classScaler(BaseEstimator,TransformerMixin):"""scales selected columns only with given scaler"""def__init__(self,scaler,columns):self.scaler=scalerself.cols=columnsdeffit(self,X,y=None):X=self._get_selection(X)self.scaler.fit(X,y)returnselfdeftransform(self,X):X=self._get_selection(X)returnself.scaler.transform(X)definverse_transform(self,X):returnself.scaler.inverse_transform(X)def_get_selection(self,df):assertisinstance(df,pd.DataFrame)returndf[self.cols]defget_feature_names(self):returnself.cols
This scaler class is pretty straight forward and calls in each of its functions the corresponding scaler function, where the scaler was given during initialization.
The difference to the SimpleTransformer is that we now also do something during fitting but no surprises here.
Build the feature union like this:
You work with time series data and want your dataset expanded with rolling statistics?
No problem, just consider this:
classRollingFeatures(BaseEstimator,TransformerMixin):"""This Transformer adds rolling statistics"""def__init__(self,columns,lookback=10):self.lookback=lookbackself.cols=columnsself.transformed_cols=Nonedeffit(self,X,y=None):returnselfdeftransform(self,X):X=self._get_selection(X)feat_d={'std':X.rolling(self.lookback).std(),'mean':X.rolling(self.lookback).mean(),'sum':X.rolling(self.lookback).sum()}forkinfeat_d:feat_d[k].columns= \
['{}_rolling{}_{}'.format(c,self.lookback,k)forcinX.columns]df=pd.concat(list(feat_d.values()),axis=1)self.transformed_cols=list(df.columns)returndfdef_get_selection(self,df):assertisinstance(df,pd.DataFrame)returndf[self.cols]defget_feature_names(self):returnself.transformed_cols
Here we come up with completely new feature names the first time.
Note, that we have a new class variable transformed_cols to take account of those cols that were generated here.
By now you should have realized, that the possibilities are endless.
Every transformation you would apply on your data is ultimately convertible into a transformer class.
Now let’s see how to put them into shape such that these chains are actually stackable.
Retain the DataFrame format for the output
For this challenge we can exploit the following simple trick.
The FeatureUnion class has a method called get_feature_names that exhibits the feature names of each transformer although their output is a numpy matrix.
In order to workaround the numpy output we can make each feature union a two-step pipeline where the union denotes the first step while a transformer fetching the actual feature names represents the second step.
Sounds crazy?
Check this out:
classFeatureUnionReframer(BaseEstimator,TransformerMixin):"""Transforms preceding FeatureUnion's output back into Dataframe"""def__init__(self,feat_union,cutoff_transformer_name=True):self.union=feat_unionself.cutoff_transformer_name=cutoff_transformer_namedeffit(self,X,y=None):returnselfdeftransform(self,X):assertisinstance(X,np.ndarray)ifself.cutoff_transformer_name:cols=[c.split('__')[1]forcinself.union.get_feature_names()]else:cols=self.union.get_feature_names()df=pd.DataFrame(data=X,columns=cols)returndf@classmethoddefmake_df_retaining(cls,feature_union):"""With this method a feature union will be returned as a pipeline
where the first step is the union and the second is a transformer that
re-applies the columns to the union's output"""returnPipeline([('union',feature_union),('reframe',cls(feature_union))])
This class does the job.
The optional bool argument gives you the freedom to keep the default prefix in feature union chains that is the name of the transformer.
The class’ static method is making things even more comfortable as we don’t need to instantiate it explicitly.
See the following example where all falls into place:
The output of pipe will be our dataset as Pandas DataFrame with all transformations applied. Awesome!
Conclusion
The presented concepts are arbitrarily expansible and give full control over the transformations that are applied on the dataset while the output is retained as DataFrame.
All key features remain such as hyper parameter optimization with GridSearch or avoidance of testset leakage through fit() and transform().
Bonus
You want to train Keras LSTMs on time series data and you need the dataset reordered for batch training?
At the same time, the set must be back-convertible for a comparison of the prediction to the actual target data?
Fear not, this will help:
classReSamplerForBatchTraining(BaseEstimator,TransformerMixin):"""This transformer sorts the samples according to a
batch size for batch training"""def__init__(self,batch_size):self.batch_size=batch_sizeself.indices,self.columns=[],[]deffit(self,X,y=None):returnselfdeftransform(self,X):assertisinstance(X,pd.DataFrame)# cut the tail
trunc_idx=len(X)%self.batch_sizeX=X.iloc[:-trunc_idx,:]# reorder
new_idcs=np.tile(np.arange(self.batch_size),len(X)//self.batch_size)assertlen(X)==new_idcs.shape[0], \
"{} != {}".format(len(X),new_idcs.shape[0])X['new_idx']=new_idcsX.sort_values(by='new_idx',ascending=True,inplace=True)self.indices=X.indexX.reset_index(drop=True,inplace=True)X.drop(['new_idx'],axis=1,inplace=True)self.columns=X.columnsreturnXdefinverse_transform(self,X):# columns undefined
inversed=pd.DataFrame(X,index=self.indices).sort_index()returninversed
There is a vast amount of data published on Kaggle Datasets, that one can download and play around with for experimental purposes.
However, more than often this data is stored with data types being an overkill for the datum’s intentional use.
Consider this virtual example
Problem
The used numpy data type here, float32, takes four byte of disk space for every age datum stored in this mini data frame.
Yet it is plausible that a feature as ‘age’ would not get close to the upper limit of a single byte datum, that is, 256.
For small datasets there should be no concern but when working with Big Data and millions of entries in a database the occupied disk and RAM space might get exhausted unreasonably and unnecessarily fast.
Converting the demonstrated example is as simple as this:
Now nobody’s got time to skim through a dataset of multiple GBs just to identify the potential candidates for data type reduction.
It should come naturally to have an automatic function or class dealing with this reduction by just checking min and max values.
Solution
The following mini class might come in handy for all reduction intentions. Please note, that reducing data types based on min and max values assumes there won’t be any future data accumulated to the dataset, which could not fit in the newly converted data type.
This class can be used the following way:
The minimum datatypes to be converted can be controlled by the ‘conv_table’ that is an optional argument for the class’ init function.
The default converts to all integers and at most to float32.
Neat thing about this presented implementation: It even takes advantage of multiple CPU cores when available via the joblib python library.
Conclusion
Dealing with inappropriate data types can be now a thing of the past by utilizing sophisticated reducing classes.
Converting the dataset and saving it again to disk can save a considerable amount of disk space and, even more crucial, lessens the allocated RAM such that more actual data can be taken into account.