This document explains how to generate a microbenchmark from the most representative function invocation of the hottest function of a benchmark. To do so, the following high level steps are required:
- Profile the benchmark to get the hottest function of the benchmark.
- Profile the function of interest to get the behavior per invocation.
- Select the invocation of interest based on clustering analysis.
- Trace the invocation of interest to generate the corresponding Microprobe test files (MPTs).
- Generate a microbenchmark from the MPTs.
- (Optional) Trace microbenchmark detailed memory access pattern and regenerate the microbenchmark using the MPTs and detailed memory access pattern.
The rest of the document provides details on each of the steps.
The initial profiling of the benchmark is done using performance counters. Depending on the performance counter being sampled, the definition of hottest changes. It is up to the user to select what type of analysis wants to perform.
For instance, using CYCLES
, the functions with more execution time be
the hottest. If one uses INSTRUCTIONS
, the functions executing
most of the instructions will be the hottest. One can use any other performance
counter available via perf
interface to focus the analysis on other aspects.
E.g. one can sample the branch miss-prediction counter to focus the analysis
on the functions with poor branch predictions, same for memory misses or
pipeline vector stalls, etc.
In this example, we are going to focus on CYCLES, which at the end of the day, it is the first type of analysis one performs before going into more detailed ones. The command to execute will be as follows:
chop sample -data CHOPSTIX_DB -events CYCLES -period 100000 -- BINARY ARGUMENTS
where:
CHOPSTIX_DB
is the path of the database where all the samples and extra information (see following steps) will be stored.-events
specifies the event to be sampled.BINARY
is the binary executable (ELF) to execute.ARGUMENTS
are the necessary arguments fro theBINARY
.
Note that measurements are done in the real system. Therefore, when sampling counters which can be affected by other activity on the system, one needs to minimize the measurement noise by minimizing any other system activity. Also, can execute the above command N times, and the results will be accumulated so that noise can be minimized.
Once the sampling is done, one can postprocess all the sampled information to list which functions got the most samples, i.e. generated most of the events, and therefore the hottest function from that performance counter point of view. To do so, execute the following commands:
chop disasm -data CHOPSTIX_DB BINARY $(ldd BINARY)
chop count -data CHOPSTIX_DB
chop annotate -data CHOPSTIX_DB
chop list functions -data CHOPSTIX_DB
The commands above first disassemble the executed binary and the dynamic
libraries used (note the ldd $BINARY
subcommand). If you already know that
the functions to trace are in the main BINARY
, there is no need to disassemble
the dynamic libraries (which can be size and time consuming). The samples
are counted and annotated to the control flow graph (CFG) that ChopStiX generated
during the disasm
step. The final command lists all the functions, sorted
by the number of samples / coverage with respect to the entire benchmark.
The top one is the hottest function (HOTFUNC
from now on).
ChopStiX provides a help script to automatically select function based on coverage, function size and maximum number of functions to select. For instance, the command:
chop-score-table CHOPSTIX_DB 80 10 100
will dump the stats of at most 10 functions having a minimum of 80% of coverage, where each of them having at least 100 instructions in size (static function size).
Once a HOTFUNC
is has been selected. The next step is to know which
particular invocation we want to trace and reproduce. This is not straightforward
since hot functions typically are executed multiple times and most of the time
exhibit much different behaviors over time depending on the input parameters.
In such case, the question is: which particular invocation (or set of
invocations) represent best the typical behavior of the function?
First, we need to obtain the corresponding addresses in memory where the function starts and ends. i.e. the entry and exit points of the function. One can do so manually, by inspecting the code but ChopStiX provides a helper script that facilitates the process:
chop-marks BINARY HOTFUNC
will return the addresses of the begin/end points for the function (as
well as the library to profile, if the function is within a dynamic library).
From now on, we call them HOTFUNC_MARKS
. This information is going to be
used for profiling the function in the next step as well as for tracing the
selected invocation in the tracing step.
Now that we know the HOTFUNC_MARKS
that define the region of interest (ROI),
we can profile its performance using the following command:
chop-perf-invok HOTFUNC_MARKS -o HOTFUNC_PROFILE -max 1000000 -- BINARY ARGUMENTS
The command will generate a CSV file named HOTFUNC_PROFILE
with a maximum
of 1M entries. The stats include instruction, cycle and execution time
information, etc. Therefore, they are subject to experimental noise. Like in
the first step, you might want to minimize the activity on the rest of the
system or repeat the experiment various times to improve the robustness
of the approach.
You can manually analyze the stats of the CSV generated to understand in detail the behavior of each invocation and manually select a particular invocation you find worth to trace. Alternatively, you can use the clustering analysis explained in the rest of the section to perform the selection automatically.
ChopStiX provide a set of support scripts to analyze the generated CSV, create clusters and select a representative invocation for each of them. To do so, execute:
cti_cluster instr_ipc_density --max-clusters 20 --min-clusters-weight-percentage 1 --target-coverage-percentage 80 --output CLUSTER_INFO --plot-path CLUSTER_PLOT --benchmark-name BINARY_NAME --function-name HOTFUNC HOTFUNC_PROFILE
where:
CLUSTER_INFO
is the output JSON file that will contain of the clustering information.CLUSTER_PLOT
is the path to the plot that will be generated to visually debug the clustering analysis.BINARY_NAME
is the benchmark name.
The command will generate a maximum of 20 clusters or less is total coverage
reaches 80% before. Each cluster will have at least 1% of coverage. There
exists more parameters to control the clustering process. Please refer to the
actual command line information for further details. Once the CLUSTER_INFO
is generated a summary can be viewed using:
cti_cluster_info summary CLUSTER_INFO
and a representative of a cluster can be obtained using:
cti_cluster_info representative -c CLUSTER_ID CLUSTER_INFO
where:
CLUSTER_ID
is the ID of the cluster, you can get from the previous summary. Cluster IDs are sorted by coverage, therefore, selecting cluster ID to be 0, will select a representative of the most common behavior.
After executing the previous command, we obtain the invocation number of the
most representative behavior of the function, HOTFUNC_INVOCATION
from now on.
Next step is to trace that particular HOTFUNC_INVOCATION
of the HOTFUNC
.
To do so, execute:
chop trace HOTFUNC_MARKS -indices HOTFUNC_INVOCATION -max-traces 1 -trace-dir TRACEDIR -gzip BINARY ARGUMENTS
where:
TRACEDIR
is the output trace directory.
The command above will trace the ROI defined by HOTFUNC_MARKS
on the HOTFUNC_INVOCATION
invocation. We only need to generate one trace, therefore we set -max-traces 1
to stop tracing as soon as the particular traced invocation ends. The raw dump
will be place into TRACEDIR
. To convert it into Microprobe test files MPTs to
be used later on, execute the following command:
chop-trace2mpt --trace-dir TRACEDIR -o MPTBASENAME --gzip
where:
MPTBASENAME
will be the output base name of the MPTs generated.
From the MPTs generated in the previous step, we can then use Microprobe framework to process them and convert them to different formats. Check the documentation of Microprobe for all the possibilities available.
A typically use case is to convert the MPT into a self runnable ELF. I.e. the function invocation will converted into an executable that will continuously execute the traced function in an endless loop. To do so, you can use the following command:
mp_mpt2elf -t MPTFILE -T MICROPROBE_TARGET -O UBENCH_NAME.s --raw-bin --compiler COMPILER --wrap-endless --reset --compiler-flags="COMPILER_FLAGS" --fix-long-jump
where:
MPTFILE
is the MPT file generated in the previous step.MICROPROBE_TARGET
is a Microprobe target. E.g.z16-z16-z64_linux_gcc
orpower_v310-power10-ppc64_linux_gcc
.COMPILER
compiler is the path to the compiler binary.COMPILER_FLAGS
are the compiler flags to be used.
The command will generate an assembly file UBENCH_NAME.s
and its corresponding
compiled program UBENCH_NAME.elf
. Microprobe will maintain all original code
and addresses. The extra initialization code added is to wrap the ROI into an
endless loop and reset the user registers to the original state so that each
iteration the function is executed with the same initial values in the
registers. You can debug the assembly generated by opening the assembly file
generated.
You can try the functional correctness of the UBENCH_NAME.elf
by executing
it and checking the functional correctness and the performance profile. For
most of the simple functions, the workflow stops here and you succeeded in
generating a small version (microbenchmark) of the most representative
function invocation of the hottest function of a benchmark.
Still, there might be errors or the performance might not be expected. This is because we only reset the state of the registers. However, there might be ROIs on which their functional correctness or performance profile depend on the state of the memory, which we do not reset at each iteration. To fix such errors and mismatches, we need to proceed to the next step.
The final microbenchmark tuning step traces the memory activity of a single
microbenchmark iteration of UBENCH_NAME.elf
and generates an enriched MPT
from MPTFILE
that includes the memory access trace information. To do
so, execute:
chop-trace-mem -output MPTFILE_MEMTRACE -base-mpt MPTFILE -output-mpt MPTFILE_WITH_MEMTRACE -- UBENCH_NAME.elf
then, you can execute the previous Microbenchmark generation step as following:
mp_mpt2elf -t MPTFILE_WITH_MEMTRACE -T MICROPROBE_TARGET -O UBENCH_NAME_WITH_MEM.s --raw-bin --compiler COMPILER --wrap-endless --reset --compiler-flags="COMPILER_FLAGS" --fix-long-jump
and the final UBENCH_NAME_WITH_MEM.elf
will be generated.