In the past two weeks, I had two meetings over Skype with my mentor Sergei Gleyzer. One was personal meet and other was a general meeting of all TMVA developers. The one to one personal meet was a short Skype call on 19th May. We discussed about the work done and the first task I’d be focussing on. On 27th May, we had our first kickoff session to mark the beginning of Google Summer of Code program and getting introduced to fellow TMVA developers and mentors. We explained our projects in short and reported our progress in the past weeks. It lasted for about 45 minutes and it was very motivational for me. All the mentors are really cool and positive. We’d be having such general meetings once or twice a month.
After my first Skype meet on 13th May, it took me 2 days to setup the blog because of so many options of blog generating platforms and themes. The more I tried, the more confusion it caused. I finally chose Jekyll to setup my blog because it is very customisable with various beautiful themes, liquit templating and syntax highlighting options available. After setting up the blog and publishing my first post, I got started by writing my first code for TMVA. I’d be implementing all the proposed methods in C++ and showcasing/testing them in Python with Jupyter notebooks.
I created a new class for feature extraction method that I had proposed - Variance Threshold. As the name suggests, this method computes variance for all the features in a dataset and chooses the ones which lie above a specific threshold. Threshold is generally provided by user otherwise default value is 0 i.e. remove the features that have same value in all the samples.
In the following posts on this blog, I’d be using TMVA nomenclature for any description of a method. This is not some standard nomenclature but one should be aware of following basic terms in TMVA toolkit:
- Variables - Features of a dataset are called variables in TMVA toolkit.
- Events - Like we have samples in a dataset, we call them events in TMVA. Each event belongs to one class - signal or background.
- Trees - ROOT Trees are like spreadsheets which contain variables of different datatypes. They are equivalent to CSV files and we can access variable values for each event. We can have branches for each variable or for a group of variables in a ROOT Tree. They are stored in files with extension “.root”.
Last week, I analysed the codes of all six variable transformations implemented in TMVA and made an observation that is common among all - they take a variable
x or a class and transform it to, say
Ax. In TMVA, we can also apply such transformations to a subset of manually selected variables but none of the transformations was taking any parameters. However in Variable Threshold transformation I need to pass threshold value and it is not appropriate to pass variables because it automatically selects variables based on variance. By discussion with mentors, I finally decided to go ahead by implementing this method in DataLoader class where it takes transformation name and threshold value in a single string and return the new DataLoader with the selected variables based on variance.
newloader = loader.VarTranform(“VT(1.5)”)
where VT stands for Variance Threshold and 1.5 is the threshold value.
newloader = loader.VarTransform(“VT”)
assuming threshold value to be zero.
I have implemented the above
VarTransform method and code can be viewed here. I tried to keep the structure of code general so that it could be easily extended to other transformations if needed. Error checks are added to ensure that it doesn’t abort unexpectedly. I also created a Jupyter notebook to showcase this method.
Before the next meeting on Friday, I have following TODOs:
- Understand ROOT TTree, TBranch and TNtuple data structures to create and manage datasets.
- Add one synthetic dataset and one physics dataset to notebook.
Hopefully I’d be more productive in the upcoming weeks.