Skip to content

Package description

Gian Michele Innocenti edited this page Jul 6, 2019 · 15 revisions

Intro

This package is meant to run fast parallel analysis and machine learning optimization using modern servers with Python and Pandas. In order to start your analysis you need a list of unmerged flat ROOT TTrees for data and MC. For full compatibility, it is recommended to produce your TTrees using the same format presented in https://github.com/ginnocen/ALICETreeCreator. The TTrees have be saved in a folder preserving the standard Grid folder structure (E.g.production/child_1/0001/AnalysisResults.root).

In this tutorial we will go step by step through the package and you will learn the main functionalities and how to run a real optimization on a small dataset.

General introduction:

The package performs the following operations:

  • Conversion: the flat ROOT TTrees are converted into Pandas Dataframes saved in a pickle format.
  • Skimming : a layer of selection is used to select your candidates according to a given variable. If you want to do a cross section measurement vs pT you will have to indicate the name of the pT variable and the pT ranges you want to consider.
  • ML files creation: a subset of the MC and data are merged and used to optimise the selection strategy. For the ML optimization, the signal is taken from MC and the background from data side-bands.
  • Optimisation: selection strategy is optimised using recent ML algorithms from SciKit, XGBoost, Keras. Trained models are saved and made available for analysis
  • Model application on data and MC: unmerged dataframes are processed. Candidates are selected according to standard analysis cuts or according to a loose cut on ML probability
  • Data merging: Merged dataframes are created with candidates selected by the standard analysis or the ML probability
  • Invariant mass and efficiency building: On the merged dataset, invariant mass spectra and efficiency plots are created and stored in the same ROOT format as the regular task output (AnalysisResults.root)

General structure:

To run the package on any analysis you have to configure two databases:

  • default_complete.yaml is used to activate or deactivate a given step in the analysis chain (conversion, skimming, optimization). The first entry od the database also specify which analysis you want to perform (default is LcpK0s_multiplicity_test). More details how to configure this database can be found here
  • database_ml_parameters_case.yml: is the database (one for each analysis) that contains all the parameters used for running the analysis. If you want to include a brand new analysis you will have to create a new file of this type where you will store all the important parameters of your analysis. More details how to configure this database can be found here

How to run a simple test:

If you just want to get started with a very small dataset, please download this folder https://www.dropbox.com/sh/05qj33dxu8ksyqn/AADq6i1Ip-Svl2nMQZTV2mTpa?dl=0 and place it in your home directory as /Users/gianmicheleinnocenti/samplesMLexample.

Then simply run the following script from the directory MachineLearningHEP/machine_learning_hep

python do_entire_analysis.py

You will be running an example analysis of Lc vs multiplicity configured in the database database.