Last two weeks, I mostly spent on learning ROOT tools and exploring ideas for showcasing VarTransform method by visual plots. One of the explored ideas that appealed me the most is Histogram i.e. displaying the variances of variables on a histogram and show the selected and rejected variables. Below is the kind of plot I had in mind:
To show this variance histogram, one needs to know the variance of each variable which is not directly accessible in TMVA because it is calculated internally in
VarTransform() method. Hence, to give more liberty to user, I tweaked TMVA a bit.
This is how my thought process went:
Currently variance of each variable is only calculated in
VarTransform method and it is not stored anywhere. In TMVA, there is a class
VariableInfo which stores all the necessary information regarding variables. One can set min, max, mean and RMS of each variable and get it anywhere when needed from this class. It seemed perfect place to me to add a set of new methods
GetVariance(). It already has these Set and Get methods for other norm parameters. After adding these methods I changed my
VarTransform method to set variance of each variable after calculation. But, I was still not able to access variance of each variable because
DefaultDataSetInfo() (a method in DataLoader class) is private. Since user should be able to get all the necessary details about dataset internally calculated by TMVA, I added a method
GetDataSetInfo() to DataLoader class which returns a
DataSetInfo object. After making these two changes, I was able to access variance of each variable.
But there is still a issue that needs to be handled. Variance of each variable is only set when
VarTransform method is called. Ideally user would first like to know the variance of each variable and might want to analyse the dataset by plots before specifying the threshold for selecting variables. Hence to calculate and set norm parameters of variables like mean, variance etc. I created a new method
CalcNorm() and called this method from
VarTranform() method whenever necessary. Now user can call this
CalcNorm() method directly to get an idea of mean, variance etc. of each variable.
VarTransform() can be applied later if selection of some variables is needed. I also added functionality to calculate variance for variables and targets in
In addition to above, I wanted to print the output of
CalcNorm() in tabular form so that it looks neat and readable. Due to different lengths of variable expressions, printing to
std::out was caused in haphazard manner. To achieve symmetry in table, I needed maximum length of variable and target name for which I added two new methods in DataSetInfo class
GetTargetNameMaxLength(), which already had a similar method to get maximum length of class name (
Following diagram sums up the changes:
All these changes can be viewed in the commit history of this new branch I created: get-variance. This notebook demonstrates the above updates.