Progress UpdateΒΆ

Last two weeks, I mostly spent on learning ROOT tools and exploring ideas for showcasing VarTransform method by visual plots. One of the explored ideas that appealed me the most is Histogram i.e. displaying the variances of variables on a histogram and show the selected and rejected variables. Below is the kind of plot I had in mind:

To show this variance histogram, one needs to know the variance of each variable which is not directly accessible in TMVA because it is calculated internally in VarTransform() method. Hence, to give more liberty to user, I tweaked TMVA a bit.

This is how my thought process went:

Currently variance of each variable is only calculated in VarTransform method and it is not stored anywhere. In TMVA, there is a class VariableInfo which stores all the necessary information regarding variables. One can set min, max, mean and RMS of each variable and get it anywhere when needed from this class. It seemed perfect place to me to add a set of new methods SetVariance() and GetVariance(). It already has these Set and Get methods for other norm parameters. After adding these methods I changed my VarTransform method to set variance of each variable after calculation. But, I was still not able to access variance of each variable because DefaultDataSetInfo() (a method in DataLoader class) is private. Since user should be able to get all the necessary details about dataset internally calculated by TMVA, I added a method GetDataSetInfo() to DataLoader class which returns a DataSetInfo object. After making these two changes, I was able to access variance of each variable.

But there is still a issue that needs to be handled. Variance of each variable is only set when VarTransform method is called. Ideally user would first like to know the variance of each variable and might want to analyse the dataset by plots before specifying the threshold for selecting variables. Hence to calculate and set norm parameters of variables like mean, variance etc. I created a new method CalcNorm() and called this method from VarTranform() method whenever necessary. Now user can call this CalcNorm() method directly to get an idea of mean, variance etc. of each variable. VarTransform() can be applied later if selection of some variables is needed. I also added functionality to calculate variance for variables and targets in TMVA::VariableTransformBase::CalcNorm() method.

In addition to above, I wanted to print the output of CalcNorm() in tabular form so that it looks neat and readable. Due to different lengths of variable expressions, printing to std::out was caused in haphazard manner. To achieve symmetry in table, I needed maximum length of variable and target name for which I added two new methods in DataSetInfo class GetVariableNameMaxLength() and GetTargetNameMaxLength(), which already had a similar method to get maximum length of class name (GetClassNameMaxLength()).

Following diagram sums up the changes:

diagram

All these changes can be viewed in the commit history of this new branch I created: get-variance. This notebook demonstrates the above updates.


Hey there! Feel free to email me if you have any comments.