From ec03ada247f1f574034aeaeb94f89e6f81f2acf5 Mon Sep 17 00:00:00 2001 From: Devin Matthews Date: Wed, 17 Jul 2024 14:53:10 -0500 Subject: [PATCH 1/4] Essentially finished plugin/control tree documentation (first draft). --- docs/PluginHowTo.md | 1295 ++++++++++++++++++++++++ docs/diagrams/mmbp_algorithm_color.png | Bin 0 -> 28861 bytes 2 files changed, 1295 insertions(+) create mode 100644 docs/PluginHowTo.md create mode 100644 docs/diagrams/mmbp_algorithm_color.png diff --git a/docs/PluginHowTo.md b/docs/PluginHowTo.md new file mode 100644 index 0000000000..cb27f23660 --- /dev/null +++ b/docs/PluginHowTo.md @@ -0,0 +1,1295 @@ +# Contents + +* **[Introduction](PluginHowTo.md#introduction)** + * **[Example Plugin](PluginHowTo.md#example-plugin)** + * **[Creating a New Plugin](PluginHowTo.md#creating-a-new-plugin)** + * **[Building a Plugin](PluginHowTo.md#building-a-plugin)** +* **[Kernels](PluginHowTo.md#kernels)** + * **[Accessing Kernels](PluginHowTo.md#accessing-kernels)** + * **[Reference Kernels](PluginHowTo.md#reference-kernels)** + * **[Optimized Kernels](PluginHowTo.md#optimized-kernels)** + * **[Mappign Kernels to Subconfigurations](PluginHowTo.md#mapping-kernels-to-subconfigurations)** +* **[Custom Operations](PluginHowTo.md#custom-operations)** + * **[Example: `bli_gemmt_ex`](PluginHowTo.md#example-bli_gemmt_ex)** + * **[The Control Tree](PluginHowTo.md#the-control-tree)** + * **[Modifying the Control Tree](PluginHowTo.md#modifying-the-control-tree)** + * **[Modifications to Blocking](PluginHowTo.md#modifications-to-blocking)** + * **[Modifications to Packing](PluginHowTo.md#modifications-to-packing)** + * **[Modifications to Computation](PluginHowTo.md#modifications-to-computation)** + * **[SYRKD](PluginHowTo.md#syrkd)** +* **[API Reference](PluginHowTo.md#api-reference)** + * **[Registration](PluginHowTo.md#registration)** + * **[Helper Functions](PluginHowTo.md#helper-functions)** + * **[Context Initialization](PluginHowTo.md#context-initialization)** + * **[Context Query](PluginHowTo.md#context-query)** + * **[Control tree modification](PluginHowTo.md#control-tree-modification)** + +# Introduction + +A BLIS plugin is a piece of user-defined code that provides additional linear algebra functionality, but leverages BLIS's internal framework for high performance. Through a plugin, users can: + +* Provide customized or optimized [kernels](PluginHowTo.md#kernels), and access internal BLIS kernels. +* Define new, custom linear algebra [operations](PluginHowTo.md#custom-operations) which extend the level-3 BLAS (for example, `GEMM`). + +Plugins are defined completely externally to BLIS (that is, the BLIS source code is not required). However, an installed copy of BLIS 2.0 or later is required (assumed installed to `$PREFIX`) in order to configure or build a plugin. Building a plugin then results in a shared and/or static library which can be distributed or linked into your code. The template and example files generated by BLIS are all in C99, but C++ is also supported. + +## Example Plugin + +A new plugin is created by running `$PREFIX/share/blis/configure-plugin `, where `` is the name you wish to give to the plugin which must be a valid C99 identifier. By default, this generates a fully-functioning example plugin containing the following files: + +
├─ [Makefile](PluginHowTo.md#makefile) **
+├─ [config.mk](PluginHowTo.md#configmk) **
+├─ [config_registry](PluginHowTo.md#config_registry) **
+├─ [bli_plugin_\.h](PluginHowTo.md#bli_plugin_nameh)
+├─ [bli_plugin_register.c](PluginHowTo.md#bli_plugin_registerc)
+├─ config
+│  ├─ \
+│  ├─ ...
+│  └─ \
+│     ├─ [bli_kernel_defs_\.h](PluginHowTo.md#configconfigbli_kernel_defs_configh)
+│     ├─ [bli_plugin_init_\.c](PluginHowTo.md#configconfigbli_plugin_init_configc)
+│     └─ [make_defs.mk](PluginHowTo.md#configarchmake_defsmk)
+├─ ref_kernels
+│  ├─ [bli_plugin_init_ref.c](PluginHowTo.md#ref_kernelsbli_plugin_init_refc)
+│  ├─ [my_kernel_1_ref.c](PluginHowTo.md#ref_kernelsmy_kernel_1_refc-and-my_kernel_2_refc) *
+│  └─ [my_kernel_2_ref.c](PluginHowTo.md#ref_kernelsmy_kernel_1_refc-and-my_kernel_2_refc) *
+├─ kernels
+│  ├─ \
+│  ├─ ...
+│  └─ zen3
+│     └─ [my_kernel_1_zen3.c](PluginHowTo.md#kernelszen3my_kernel_1_zen3c) *
+└─ obj **
+   └─ [\](PluginHowTo.md#objconfig) **
+
+ +Files marked with `*` (and some portions of other files) are for example only and can be omitted by passing the `--disable-examples` flag to `configure-plugin`. Files and directories marked with `**` are only required when you are ready to build the plugin and can be disabled with `--disable-build`. The remaining files and directories constitute the plugin "template". If you want to later generate only build files then these files (which presumably already exist) can be skipped with `--disable-templates`. + +#### `Makefile` + +The Makefile for your plugin is automatically generated by `configure-plugin` and should not be modified. Targets `make` and `make clean` are supported and will build your plugin based on the flags given during configuration. + +#### `config.mk` + +This file is also generated by `configure-plugin` and should not need to be modified. + +#### `config_registry` + +This file is provides the mapping from kernel sets to subconfigurations and configuration families. See [Mapping Kernels to Subconfigurations](PluginHowTo.md#mapping-ernels-to-subconfigurations) for more details. + +#### `bli_plugin_.h` + +This file is the main header for the plugin. It should be `#include`d in order to use the functionality provided by the plugin. ***Note:*** the name and contents of this header are a suggestion---feel free to structure your plugin however you like! + +The example file contains several sections: + +* Macros defining arguments to be passed to the registration functions. The example given uses externally-provided arrays to store the generated kernel, blocksize, and preference IDs. Many alternative strategies are possible, e.g. passing a struct, passing individual pointers/references to IDs, or using global variables and passing no arguments (defining these macros to be empty). You can also pass in any other arguments you might need during registration. Macros are preferred to define the parameters since the parameter list is used in several different files and in generated code. + +* Enumerations providing convenient names by which kernel/blocksize/preference IDs can be obtained. In the example, these are offsets into the arrays passed into the `bli_plugin_register_`. So, calling code could look up the kernel ID for kernel #2 as `kerids[MY_KERNEL_2]`. This section is entirely optional if you prefer a different way of accessing kernel IDs. + +* Prototypes for kernels. A prototype (and preferably a typedef) is recommended for each kernel you write so that you can provide type safety when calling kernels. Note that both kernels are assumed to have reference implementations (one for each enabled subconfiguration, expanded using the `INSERT_GENTCONF` macro to generate prototypes automatically), while a special "optimized" kernel #2 is available for double-precision operations on Zen 3 hardware. The latter prototype is given only for example---your plugin code would not need to know whether or not an optimized kernel is available and would only need to look up kernels by ID. The file `config/zen3/bli_plugin_init_zen3.c` handles registering this optimized kernel so that it can be automatically selected when running on Zen 3. + +* Prototypes for the plugin registration function (`bli_plugin_register_`) and configuration-specific initialization functions. The former function can be named and structured however you like, but we recommend keeping the latter (configuration-specific) functions as-is. + +#### `bli_plugin_register.c` + +This file implements the function `bli_plugin_register_` and illustrates how to register new kernels, along with associated blocksizes and kernel preferences. Each registration function generates a new, unique ID which must be saved and communicated to the rest of the plugin (for example, via global variables or arguments passed in to the function `bli_plugin_register_`) so that they can be used later. This function also calls `bli_plugin_register__` for each architecture which was enabled at configure time (see [`bli_plugin_init_.c`](PluginHowTo.md#configconfigbli_plugin_init_configc)). + +Any code using the plugin should call this function (which you can rename if you like) before making use of any plugin functionality. + +#### `config//bli_kernel_defs_.h` + +This file provides macros specific to one subconfiguration, such as the register blocksizes for the BLIS `GEMM` microkernel. You can add any macros or other definitions here that you want to be avialable to any code being compiled for the corresponding subcofiguration. Note that configuration families (e.g. `x86_64`) supersede individual subconfigurations. + +#### `config//bli_plugin_init_.c` + +This file initializes the "context" with any kernels, blocksizes, or kernel preference which are optimized for the corresponding subconfiguration. It also call the reference initialization function in [`ref_kernels/bli_plugin_init_ref.c`](PluginHowTo.md#ref_kernelsbli_plugin_init_refc) for the matching configuration. A full example is given for the `zen3` subconfiguration. If no optimized kernels have been written for a particular subconfiguration, then no modifications are necessary. See [Mapping Kernels to Subconfigurations](PluginHowTo.md#mapping-ernels-to-subconfigurations) for more information about how optimized kernels and subconfigurations are related. + +#### `config//make_defs.mk` + +This file contains additional build variables or compiler-/architecture-specific flags for each subconfiguration. Typically these files should not be modified in order to achieve the best performance and maintain compatibility with BLIS. + +#### `ref_kernels/bli_plugin_init_ref.c` + +This file handles initialization of the context with [reference](PluginHowTo.md#reference-kernels) kernels. This file is compiled once for each enabled subconfiguration, resulting in functions `bli_plugin_init___ref`. Whenever you add a new reference kernel, blocksize, or kernel preference, you must also add code to initialize it here. + +#### `ref_kernels/my_kernel_1_ref.c` and `my_kernel_2_ref.c` + +These are example reference kernels. Note that the kernels are instantiated for the four standard datatypes (single and double precision, for both real and complex domains), indicated by the letters `sdcz`. Your kernels can use the same macros to help with instantiation of different types (or combinations of types), or you can use a different mechanism such as C++ templates. + +#### `kernels/zen3/my_kernel_1_zen3.c` + +This is an example optimized kernel. Typically optimized kernels are written with a specific data type or combination of data types in mind. In this example, only a double-precision real version is implemented, specifically for the Zen 3 architecture. + +#### `obj/` + +This folder will contain the built object files and static and/or shared library for the plugin. Only one sub-folder is created corresponding to the configuration for which BLIS was built. + +## Creating a New Plugin + +To create a "blank" plugin without any build files or example code, execute `$PREFIX/share/blis/configure-plugin --init ` in the directory where you want the plugin to exist. At this point, you can start adding your own: + +* Kernels, [see below](PluginHowTo.md#kernels) for more details + 1. Create a reference kernel. The file must be in the `ref_kernels` directory in order to be compiled correctly. Your kernel can any name and interface, but should ideally be implemented for all supported data types and should be architecture-agnostic. + 2. Register your kernel in the `bli_plugin_register.c` file. + 3. Initialize the context with pointer(s) to your reference kernel in the `ref_kernels/bli_plugin_init_ref.c` file. + 4. [Optionally] implemented optimized versions in the appropriate `kernels/` directories, and initialize them in `config//bli_plugin_init_.c` +* Blocksizes + 1. Register the blocksizes in `bli_plugin_register.c`. + 2. Provide default values in `ref_kernels/bli_plugin_init_ref.c`. All data types should be given a default value. + 3. [Optional] provide values for configuration-specific optimized implementations in `config//bli_plugin_init_.c`. +* Kernel preferences + 1. Register the kernel preferences in `bli_plugin_register.c`. + 2. Provide default values in `ref_kernels/bli_plugin_init_ref.c`. All data types should be given a default value. + 3. [Optional] provide values for configuration-specific optimized implementations in `config//bli_plugin_init_.c`. + +You will also need to provide a way to get registered kernel/blocksize/preference IDs back to your code by filling in the `plugin__params` and `plugin__params_only` macros in `bli_plugin_.h`, saving to global variables, etc. + +## Building a Plugin + +Before building your kernel on a particular system, you must reconfigure to build using `$PREFIX/share/blis/configure-plugin --build []` in the plugin directory. Note that you do not need to provide the plugin name if it can be guessed from the name of `bli_plugin_.h`. There are several flags which can be used to control how your plugin will be built: + +| Flag | Explanation | +|-----------------------|-------------| +| -p PATH,
--path=PATH | Look for the plugin source in PATH instead of the current directory. This option is used to build the plugin out-of-tree. | +| -e SYMBOLS,
--export-shared[=SYMBOLS] | Specify the subset of library symbols that are exported within a shared library. Valid values for SYMBOLS are: 'public' (the default) and 'all'. By default, only functions and variables that belong to public APIs are exported in shared libraries. However, the user may instead export all symbols in BLIS, even those that were intended for internal use only. Note that the public APIs encompass all functions that almost any user would ever want to call, including the BLAS/CBLAS compatibility APIs as well as the basic and expert interfaces to the typed and object APIs that are unique to BLIS. Also note that changing this option to 'all' will have no effect in some environments, such as when compiling with clang on Windows. | +| --enable-rpath,
--disable-rpath | Enable (disabled by default) setting an install_name for dynamic libraries on macOS which starts with @rpath rather than the absolute install path. | +| --disable-shared,
--enable-shared | Disable (enabled by default) building BLIS as a shared library. If the shared library build is disabled, the static library build must remain enabled. | +| --disable-static,
--enable-static | Disable (enabled by default) building BLIS as a static library. If the static library build is disabled, the shared library build must remain enabled. | +| -d DEBUG, --enable-debug[=DEBUG] | Enable debugging symbols in the library. If argument DEBUG is given as 'opt', then optimization flags are kept in the framework, otherwise optimization is turned off. | +| --enable-verbose-make,
--disable-verbose-make | Enable (disabled by default) verbose compilation output during make. | +| -f, --force | Overwrite any files in the current directory which are normally copied by configure-plugin, for example 'Makefile' and 'config_registry'. | +| --enable-asan,
--disable-asan | Enable (disabled by default) compiling and linking BLIS framework code with the AddressSanitizer (ASan) library. Optimized kernels are NOT compiled with ASan support due to limitations of register assignment in inline assembly. WARNING: ENABLING THIS OPTION WILL NEGATIVELY IMPACT PERFORMANCE. Please use only for informational/debugging purposes. | +| --enable-arg-max-hack
--disable-arg-max-hack | Enable (disabled by default) build system logic that will allow archiving/linking the static/shared library even if the command plus command line arguments exceeds the operating system limit (ARG_MAX). | + +After configuring, you can now build using `make`. **Your plugin is always built for the same subconfiguration or configuration family that BLIS was.** This means that build configuration should ideally be done on the target system, unless you are using an installation of BLIS which is configured for a "fat build" for a full configuration familty, such as `x86_64`. The final shared and/or static library is available in the `obj/` directory, where `` is the configuration that BLIS and your plugin are built for. + +# Kernels + +Kernels are the high-performance pieces of code at the heart of BLIS. A kernel usually does one simple computational operation on one or more input matrices, vectors, or scalars. For example, one of the workhorse kernels in BLIS is the `GEMM` microkernel, which computes a small matrix multiplication of `MR*k` and `k*NR` matrices, where `MR` and `NR` are constants depending on the architecture. You can write kernels which are intended to replace or extend existing BLIS kernels, or for any other operation which you might encounter in your code which needs a high-performance, architecture-specific solution. + +The BLIS plugin architecture supports two types of user-supplied kernels: reference kernels and optimized kernels. The former type of kernel is coded once (typically in standard C or C++), and compiled separately for any architecture which might be encountered. Then, at runtime BLIS will select the appropriate version of the kernel for the current hardware. Reference kernels typically do not achieve the highest performance, but are useful for less performance-sensitive operations such as data movement (which is bandwidth limited and not FLOP limited). For performance-critical kernels, you can additionally provide optimized kernels. These kernels are specific to one hardware architecture or family of related architectures, and are also often datatype-specific. These kernels also often employ compiler intrinsics or inline assembly which is not portable. If you provide an optimized kernel for a hardware architecture which is detected at runtime, BLIS will automatically select this kernel in preference to the reference kernel. + +In addition to kernels, BLIS plugins support providing blocksizes (for example, the `MR` and `NR` parameters above) as well as kernel preferences (essentially, the logical true/false equivalent of blocksizes) which control or define the behavior of kernels. These too are looked-up based on the actual hardware encountered at runtime, and come in reference (essentially, default) and optimized flavors. While internal BLIS kernels endeavor to operate correctly for any kind of input (although they work most efficiently for inputs which conform to the corresponding block sizes and preferences), your kernels are not required to support arbitrary inputs or parameters. You only have to provide the functionality that you know you will need! + +## Accessing Kernels + +Kernels, blocksizes, and kernel preferences are accessed through the "context", which reflects the kernel set available for the hardware on which BLIS is running. Initially, kernels and their parameters must be registered. This creates a slot in the context to hold pointers, blocksizes, or other data, and then returns a unique ID. Next, this slot must be filled with user-supplied data (pointers to reference kernels, default blocksizes, etc.), using the supplied IDs. If optimized kernels or parameters are avialable these are then written over the reference data. All of these steps happen during plugin registration which must happen before any computations are performed with the plugin (although BLIS itself can be used). Finally, at any point after plugin registratation, the current context can be obtained and then queried using the unique IDs: + +```C++ +const cntx_t* cntx = bli_gks_query_cntx(); + +my_fun_ptr kernel = ( my_fun_ptr )bli_cntx_get_ukr_dt( BLIS_DOUBLE, MY_KERNEL_ID, cntx ); + +kernel(...); +``` + +The process for registering and intializing kernels is detailed below. + +## Reference Kernels + +A reference kernel must first be registered. This should happen in `bli_plugin_register_` defined in `bli_plugin_register.c` (although you can change the function and file names): + +```C++ +err_t errval; +kerid_t id; + +err = bli_gks_register_ukr( &id ); +if ( err != BLIS_SUCCESS ) + //handle error +``` + +Note that for registration we don't need to know anything about the actual kernel yet. Next, the pointers to the reference kernels must be supplied in the file `ref_kernels/bli_plugin_init_ref.c` (again, you can change the filename, but it must reside in `ref_kernels`, and it is not recommended to change the function name or signature since this must match `bli_plugin_register.c` and is generated automatically for each subconfiguration): + +```C++ +func_t ptrs; +gen_func_init( &ptrs, PASTECH(my_kernel,BLIS_CNAME_INFIX,BLIS_REF_SUFFIX) ); +bli_cntx_set_ukr( MY_KERNEL_ID, &ptrs, cntx ); +``` + +The `func_t` struc contains a function pointer for each data type. In this example the helper macro `gen_func_init` is used to automatically generate the correct symbol name for each type and for the current subconfiguration (since this file is compiled once for each enabled subconfiguration). It is strongly recommended to use the provided macros and naming convention for reference kernels. However, you are free to use any method you like to fill the entries of the `func_t` struct, *with pointers to the reference function of the correct type and for the correct subfiguration*. The kernel is now fully initialized and can be used safely on any hardware which BLIS was configured for. + +## Optimized Kernels + +If an optimized kernel implementation is available (as a function in a file in some `kernels/` folder), it should be initialized in the appropriate file `config//bli_plugin_init_.c`. For example: + +```C++ +bli_cntx_set_ukrs +( + cntx, + + MY_KERNEL_ID, BLIS_DOUBLE, bli_dmy_kernel_zen3, + + BLIS_VA_END +); +``` + +Here, it is not necessary to provide an optimized implementation for all datatypes. The automatically-generated template code and build system will handle building the correct files and calling the initialization functions for subconfigurations which are enabled in the BLIS installation you are using. So, you can simply provide optimized implementations for any hardware which is important to you and it will be picked up and used if possible. + +## Mapping Kernels to Subconfigurations + +It may seem strange that optimized kernel implementations are written in the `kernels` folder, but are initialized in the `config` folder. In fact, the sub-folders of these two directories are not even the same! This is because in BLIS, multiple *subconfigurations* (roughly mapping to specific hardware architectures), as well as *configuration families* (for example, all `x86_64` architectures), can use kernels from one (or more) of the folders in `kernels`, called *kernel sets*. The mapping from kernel sets to configurations is defined by the `config_registry` file. Essentially, this means that when adding an optimized kernel, you should initialize the kernel in each configuration which maps the kernel set where you defined the kernel. Conversely, this also means that if you define the kernel in a kernel set which is not mapped by any enabled configuration, then the kernel will not exist and linking will fail. + +By default, this file contains the mapping known by BLIS at the time of plugin creation. Thus, it might be a good idea to periodically reconfigure your plugin in order to pick up new `config` or `kernels` sub-folders and entries in `config_registry`. Instead, or in addition, you can define your own mappings in `config_registry` to reflect how your particular kernels should be used. *Note that this mapping only affects kernels in your plugin, and does not affect reference kernels.* See [here](ConfigurationHowTo.md) for more information on subconfigurations, configuration families, and mapping of kernel sets. + +# Custom Operations + +BLIS is written as a framework, meaning that user-written code can be inserted in order to achieve new functionality. For example, consider the mathematical operation $\mathop{\text{tri}}(C) := \mathop{\text{tri}}(\alpha A D A^T + \beta C)$ where $D$ is a diagonal matrix and the function `tri` operates only on the upper or lower part of a matrix. If $D$ were the identity matrix, then this would be a standard level-3 BLAS operation, `SYRK`, so we call this BLAS-like operation `SYRKD`. While it is technically not necessary to use the plugin infrastructure to implement `SYRKD` using BLIS, extending BLAS operations typically requires new kernels which are conveniently managed as a plugin. However, the code discussed in this section does not need to exist in the plugin directory (although it can be placed in the top-level plugin directory) but should have access to the kernel, blocksize, and kernel preference IDs registered by the plugin. + +Because $A D A^T = A (A D)^T = (A D) A^T$, it is actually even more closely related to the operation `GEMMT`, which implements $\mathop{\text{tri}}(C) := \mathop{\text{tri}}(\alpha \mathop{\text{trana}}(A) \mathop{\text{tranb}}(B) + \beta C)$ where the functions `trana` and `tranb` optionally transpose the operand. Essentially, this is just `GEMM` where we know the result will in fact be symmetric even though $A \ne B^T$. Then we can see that `SYRKD` is the same thing as `GEMMT` with $B = AD$, $\mathop{\text{trana}}(A)=A$ and $\mathop{\text{tranb}}(B)=B^T$. So, let's implement `SYRKD` by: + +1. Starting with the high-level code which defines `GEMMT`. + +2. Writing a kernel to handle the multiplication $A D$ when packing the "virtual" matrix $B$. + +3. Modifying the `GEMMT` operation to use our custom packing kernel. + +4. Supplying additional data so that the packing kernel can address $D$ (in addition to $A$ which is passed as a normal parameter of `GEMMT`). + +## Example: `bli_gemmt_ex` + +Consider the following code which implements `GEMMT`: + +```C +/* + * Step 0: + */ +void bli_gemmt_ex + ( + const obj_t* alpha, + const obj_t* a, + const obj_t* b, + const obj_t* beta, + const obj_t* c, + const cntx_t* cntx, + const rntm_t* rntm + ) +{ + /* + * Step 1: Make sure BLIS is initialized. + */ + bli_init_once(); + + /* + * Step 2: Check the operands for consistency and check for cases where + * we can exit early (alpha = 0, m = 0, etc.). + */ + if ( bli_error_checking_is_enabled() ) + bli_gemmt_check( alpha, a, b, beta, c, cntx ); + + if ( bli_l3_return_early_if_trivial( alpha, a, b, beta, c ) == BLIS_SUCCESS ) + return; + + /* + * Step 3: Determine if we can and should use the 1m method for + * cases with all complex operands. + */ + num_t dt = bli_obj_dt( c ); + ind_t im = BLIS_NAT; + + if ( bli_obj_dt( a ) == bli_obj_dt( c ) && + bli_obj_dt( b ) == bli_obj_dt( c ) && + bli_obj_is_complex( c ) ) + // Usually BLIS_NAT if a complex microkernel is available, + // otherwise BLIS_1M. + im = bli_gemmtind_find_avail( dt ); + + /* + * Step 4: Alias A, B, and C so that we have local mutable copies and + * to take care of implicit transpose, sub-matrix references, + * etc. + */ + obj_t a_local; + obj_t b_local; + obj_t c_local; + bli_obj_alias_submatrix( a, &a_local ); + bli_obj_alias_submatrix( b, &b_local ); + bli_obj_alias_submatrix( c, &c_local ); + + /* + * Step 5: Create a "default" control tree. + */ + if ( cntx == NULL ) cntx = bli_gks_query_cntx(); + gemm_cntl_t cntl; + bli_gemm_cntl_init + ( + im, + BLIS_GEMMT, + alpha, + &a_local, + &b_local, + beta, + &c_local, + cntx, + &cntl + ); + + /* + * Step 6: Execute the control tree in parallel. + */ + bli_l3_thread_decorator + ( + &a_local, + &b_local, + &c_local, + cntx, + ( cntl_t* )&cntl, + rntm + ); +} +``` + +### Step 0: Function signature + +The function name and signature is entirely up to you. Your function can take `obj_t`s as parameters, but you can also contruct `obj_t`s internally based on whatever function parameters you define (see Step 4 for more on this). + +For `SYRKD`, we would need to add a `const obj_t* D` parameter, and the `const obj_t* B` parameter can be removed since we know that $B = A^T$. + +### Step 1: Intialize BLIS + +This step is mandatory and must be done before calling any other BLIS APIs used here. Most BLIS API calls (like `bli_gemm`) check for initialization themselves, but the control tree and thread decorator APIs do not. + +### Step 2: Error and early exit checks + +BLIS has some standard checks for typical level-3 BLAS operations, as well as checks for conditions which enable an early exit. You may use these functions, but for new operations you may need to include additional checks. Also note that `bli_l3_return_early_if_trivial` assumes that `C` is a dense matrix and will attempt to scale by `beta` if exiting early. If `C` refers to a matrix-like object with alternative layout then you will need to check for early exit conditions manually. + +For example, when implementing `SYRKD`, all of the checks done by `bli_gemmt_check` are still relevant (relatize size/shape of `A`, `B`, `C`, triangular `C`, etc.). We would also want to check that `C` is symmetric, and that `D` is a vector and has the correct length. In a complex Hermitian version (e.g. `HERKD`) we might also want to enforce that `D` is real. Because `C` is a normal, dense matrix, we can also call `bli_l3_return_early_if_trivial` safely. + +### Step 3: Check for 1m or "natural" complex execution + +This step is optional. If your operation doesn't support complex operands, or if you don't want to support the 1m method (which requires additional kernels, see below), then you can always default to `BLIS_NAT` as the execution method. + +The functions `bli_XXXind_find_avail` are likely not useful for custom code, but you can check if an optimized complex-domain `GEMM` microkernel is available by using: + +```C +bool c_optimized = ! bli_gks_cntx_ukr_is_ref( BLIS_SCOMPLEX, BLIS_GEMM_UKR, cntx ); +``` + +Note that if the complex-domain `GEMM` microkernel is not optimized then using `BLIS_NAT` may decrease performance. + +### Step 4: Alias local matrices + +Your function may or may not operate on normal, dense matrices represented as BLIS `obj_t`s. If so, then code similar to that used for `GEMMT` is recommended, since it handles implicit transposition of the operants, cases where a sub-matrix of a larger matrix is indicated, etc. + +**If instead you are using "matrix-like" operands, then you will still need to construct an `obj_t`** In this case, the `obj_t` simply indicates the size and shape of the object when logically viewed as a matrix. For example, a dense tensor can be mapped onto a matrix by collecting some tensor dimensions as the "rows" and the remaining dimensions as the "columns". The locations of elements are determined by the tensor strides, ordering of dimensions within rows and columns, etc. and does not translate directly to a matrix row-major or column-major layout. This is OK! BLIS will simply keep track of the matrix partitioning, and as long as you provide a custom packing kernel which knows *how* to access the data, BLIS will tell your kernel *what* data (logical sub-matrix) to pack. The same concept applies to the computational (`GEMM`) kernel. + +BLIS only needs `obj_t`s for the `A`, `B`, and `C` matrices (as they are defined in `GEMM` and related operations). Note that the `obj_t` representing the output matrix (`C`) should have row stride (`rs`) and column (`cs`) stride values set to indicate a "row-preference" (`rs < cs`) or "column-preference" (`cs < rs`). All other `obj_t`s only need to have the matrix length and width set, unless you are using default kernels which then need a full matrix specification. If your operation references other data or objects (like the `D` operand in `SYRKD`) or your matrix-like objects need data which doesn't fit in an `obj_t`, then this information will be provided separately (see below). + +For `SYRKD`, the `A` and `C` matrices are already `obj_t`s and we should just alias the sub-matrices. The `obj_t` for `B` can be constructed from `A` with a transpose. For `D`, we have also chosen to pass in an `obj_t` representing a vector (mathematically, the diagonal of a matrix), and so we can just alias a sub-matrix to clean it up. + +### Step 5: Create the control tree + +The control tree determines exactly what operations are done during execution and their parameters (see more below). Custom operations will typically begin with a "default" control tree which corresponds to the most similar level-3 BLAS operation. The particular level-3 operation used does matter: for example `TRMM` will only operate on the upper or lower part of a matrix, and other factors such as threading can be affected as well. In order to accomodate custom operations, the control tree will then need to be modified. This is discussed in the following sections. + +For `SYRKD`, since the output matrix is symmetric (stored as triangular), we should use `GEMMT` as the template for the control tree. Note that `SYRK`, `SYR2K`, `HERK`, and `HER2K` all use the `GEMMT` control tree. + +### Step 6: Execute the control tree + +This step should be essentially the same for all operations. The `A`, `B`, and `C` objects are those `obj_t`s created earlier (and which may only be logical matrices with a length and width only if you provide custom kernels). The control tree and context are passed in, as well as a "runtime" (`rntm_t`) object. Typically, the pointer to the runtime is `NULL`, which uses default settings for threading. If you want to customize threading then you can also pass in a custom `rntm_t` object. + +## The Control Tree + +![BLIS GEMM algorithm](diagrams/mmbp_algorithm_color.png) + +*Figure 1: The GEMM algorithm in BLIS* + +A typical `GEMM` operation in BLIS is depicted visually in Fig. 1. The matrix objects `A`, `B`, and `C` (represented as `obj_t`s) only provide limited information about size/shape and, for normal dense matrices, data location and layout. The rest of the information about how to execute the operation, including what order to partition the matrix dimensions, what blocksizes to use, what kernels to use for packing and computation, what parts of the matrices to operate on (for triangular/symmetric matrices), how to apply threading, etc. is all stored in the "control tree". This is a tree-based data structure, where each node indicates a primitive operation to be performed, which is executed by a specific kernel. The built-in control tree nodes are: + +- Partitioning along the "m", "n", or "k" dimensions (as defined in `GEMM`). + +- Packing of the `A` or `B` matrix. Packing moves data into a specialized layout which provides better data locality. While packing kernels should typically place some sort of data into a packed buffer (in a format which the computational kernel can understand), it could also perform any operation on the input matrix while doing so. In general, we could denote this as $A_{packed} = pack(op(A,...))$, where the ellipsis indicates additional information that can be stored in the control tree. + +- Computation (`GEMM` and `GEMMTRSM`). Only `GEMM`-like computation can be customized currently. This operation doesn't have to actually perform a `GEMM` computation ($C = \alpha A B + \beta C$). Rather, it can perform any operation $C = op(A,B,C,...)$ where the ellipsis indicates additional information that can be stored in the control tree. + +The control tree manages the flow of data through the processor caches by partitioning the matrices, using the estimated number and timing of memory accesses performed by the kernels. Thus, for operations which are truly `GEMM`-like, most of the control tree would not need to be modified. Operations which deal with less, more, or simply different data may need adjustments to blocksizes, or in extreme cases, the structure of the control tree. Altering the control tree structure and adding new custom control tree nodes are beyond the scope of this tutorial. + +## Modifying the Control Tree + +In order to get BLIS to do anything other than `GEMM` (or the related level-3 BLAS operation used as a template for the control tree), the control tree must be modified. Currently, the following aspects of the control tree can be modified: + +- The blocksizes used when partitioning. The 5 "standard" blocksizes `MC`, `NC`, `KC`, `MR`, and `NR` can be modified for `GEMM`-like operations. Note that modifying `MR` and `NR` in particular may require changes to your packing and/or computational kernels, since this affects the layout of packed data. + +- The packing "microkernel", which is used to pack a small portion of a matrix (at most `MR x KC` or `NR x KC`). If data is to be packed in the same format as for a normal dense matrix, but read from a different layout and/or with operations applied during packing, then changing only the packing microkernel is likely sufficient. + +- The packing "variant", which is responsible for a) obtaining a memory buffer of the appropriate size for the packed data, b) setting up the packed matrix object, and c) actually performing the packing (typically by calling a packing microkernel, however packing variants can be implemented many different ways). Modifying the packing variant may be necessary if data is to be packed in a substantially different format or other situations beyond the scope of a simple microkernel. + +- The computational microkernel (which is the `GEMM` microkernel by default). The computational microkernel reads data from the packed buffers and then does "something" with it, typically in combination with data from the matrix `C`. Usually, the `C` matrix also represents an output which is written by the microkernel. Note that when replacing only the microkernel, both packed data and the representation of output data in `C` must be similar to the standard `GEMM` case. So, this modification is most effective for relatively `GEMM`-like operations. + +- The computational variant, which is responsible for performing some operation on the entirety of the packed data as well as the matrix `C`. By modifying the entire computational variant, the user has complete freedom in the amount and format of data which is packed and in the data represented by `C`. + +Between modifying the control tree and execution, users should always call `bli_gemm_cntl_finalize` in order to maintain consistency between various parameters (see below). + +Note that when modifying one component it may also be necessary to modify other components in order to maintain consistency. For example, changing the pack format used by the packing microkernel would require a computational microkernel which can read the new format. Likewise, changing the register blocksizes (`MR` and/or `NR`) might require writing new packing or computational microkernels to take advantage of the new blocksize. The default microkernels are written to accept any register blocksize, although they are most efficient with the default `MR` and `NR`. + +In the case of `SYRKD`, the only changes necessary are applying multiplication of either `A` or `B` by the diagonal matrix `D` (represented as a vector) during packing. So, it is only necessary to write a new packing microkernel and insert it into the control tree. + +## Modifications to Blocking + +The BLIS framework applies blocking (partitioning) to the matrices during the computation, resulting in smaller and smaller submatrices which ideally fit into successive levels of the cache hierarchy. The control tree nodes control which dimension is blocked at which level. Modifying the actual control tree structure is beyond the scope of this tutotial. However, it may often be necessary to control the blocksize used at each level. For example, the blocksize used in the top-level partitioning, `NC`, can be modified using: + +```C +blksz_t nc; +// S D C Z +bli_blksz_init_easy( &nc, 8092, 5150, 2244, 2244 ); +bli_gemm_cntl_set_nc( &nc, &cntl ); +``` + +where `cntl` is a `gemm_cntl_t` object which has been initialized with `bli_gemm_cntl_init`. Note that the `blksz_t` object used to set `NC` contains values for each datatype. Supplying values for all datatypes (even though the operation being performed uses a specific and known set of datatypes) is recommended because the control tree is sometimes constructed to use blocksizes and even microkernels from related datatypes, such as in mixed-domain computations. + +The function `bli_blksz_init` also lets users specify extended blocksizes. For cache blocking (`MC`, `NC`, `KC`), the extended blocksizes is the maximum partition size and the normal blocksize is the default partition size. If the remainder block after partitioning by the default size is less than max minus default, then it is merged with another block. This can be useful for smoothing out performance drops for matrices just larger than the blocksize. For register blocksizes (`MR` and `NR`), the extended blocksize is the size used for computing the size of the packed buffer and the leading dimension of the indiviual sub-matrices packed by the microkernel. If it is larger than the default blocksize then there will be unused elements in the pack buffer. This can be useful for e.g. aligning packed sub-matrices on cache line boundaries. + +BLIS requires the `MC`(default and extension) be a multiple of `MR` (default only), and likewise of `NC` and `NR`. Also, in operations with triangular/symmetric/hermitian matrices, BLIS may also require `KC` to be a multiple of either `MR` or `NR`. The function `bli_gemm_cntl_finalize` tweaks the cache blocksizes, if necessary, to maintain these relationships. + +Finally, note that changing, in particular, the register blocksizes may require changes in user-defined microkernels if any assumptions about register blocking are hard-coded. + +## Modifications to Packing + +Packing the `A` and `B` matrices is represented as nodes in the control tree. These control tree nodes contain a pointer to the packing variant, as well as to a "params" structure containing any additional data needed. Packing variants have the following signature: + +```C +void pack_variant( const obj_t* a, + obj_t* p, + const cntx_t* cntx, + const cntl_t* cntl, + thrinfo_t* thread ); +``` + +where `a` is the matrix to pack and `p` is an uninitialized matrix object which should represent the packed matrix on return. You can set a custom packing variant with: + +```C +bli_gemm_cntl_set_pack[ab]_var( &pack_variant, &cntl ); +``` + +If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](????) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: + +```C +my_params_t params; +// intialize params... +bli_gemm_cntl_set_pack[ab]_params( ¶ms, &cntl ); +``` + +and obtained within the packing variant using: + +```C +my_params_t* params = ( my_params_t* )bli_packm_cntl_variant_params( cntl ); +``` + +If instead only the packing microkernel needs to be modified, you can set a new packing microkernel with: + +```C +func_t pack_ukr; +bli_func_init( &pack_ukr, &spack_ukr, &dpack_ukr, &cpack_ukr, &zpack_ukr ); +bli_gemm_cntl_set_pack[ab]_ukr_simple( &pack_ukr, &cntl ); +``` + +for packing without datatype conversion (mixed-precision/mixed-domain), or in the most general case: + +```C +func2_t pack_ukr; +bli_func2_init( &pack_ukr, ... ); +bli_gemm_cntl_set_pack[ab]_ukr( &pack_ukr, &cntl ); +``` + +Packing microkernels must have a signature compatible with: + +```C +xpack_ukr( struc_t strucc, + diag_t diagc, + uplo_t uploc, + conj_t conjc, + pack_t schema, + bool invdiag, + dim_t panel_dim, + dim_t panel_len, + dim_t panel_dim_max, + dim_t panel_len_max, + dim_t panel_dim_off, + dim_t panel_len_off, + dim_t panel_bcast, + const void* kappa, + const void* c, inc_t incc, inc_t ldc, + void* p, inc_t ldp, + const void* params, + const cntx_t* cntx ); +``` + +Note that the packing microkernel also receives a params pointer. This pointer is `NULL` by default but can be set using `bli_gemm_cntl_set_pack[ab]_params` just as for the packing variant params. + +## Modifications to Computation + +Similar to packing, the computation phase of the operation is represented as a control tree node. The computational variant must have the following signature: + +```C +void comp_variant( const obj_t* a, + const obj_t* b, + const obj_t* c, + const cntx_t* cntx, + const cntl_t* cntl, + thrinfo_t* thread ); +``` + +You can set a custom computational variant with: + +```C +bli_gemm_cntl_set_var( &comp_variant, &cntl ); +``` + +If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](????) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: + +```C +my_params_t params; +// intialize params... +bli_gemm_cntl_set_params( ¶ms, &cntl ); +``` + +and obtained within the packing variant using: + +```C +my_params_t* params = ( my_params_t* )bli_gemm_var_cntl_params( cntl ); +``` + +If instead only the computational microkernel needs to be modified, you can set a new microkernel with: + +```C +func_t comp_ukr; +bli_func_init( &comp_ukr, &scomp_ukr, &dcomp_ukr, &ccomp_ukr, &zcomp_ukr ); +bli_gemm_cntl_set_ukr_simple( &comp_ukr, &cntl ); +``` + +for computation without datatype conversion for output to `C` (mixed-precision/mixed-domain), or in the most general case: + +```C +func2_t comp_ukr; +bli_func2_init( &comp_ukr, ... ); +bli_gemm_cntl_set_ukr( &comp_ukr, &cntl ); +``` + +Computational microkernels must have a signature compatible with: + +```C +xcomp_ukr( dim_t m, + dim_t n, + dim_t k, + const void* alpha, + const void* a, + const void* b, + const void* beta, + void* c, inc_t rs_c, inc_t cs_c, + const auxinfo_t* auxinfo, + const cntx_t* cntx ); +``` + +As with packing microkernels, a params pointer is also available to computational microkernels and can be set using `bli_gemm_cntl_set_params`. The params pointer is stored in the `auxinfo_t` struct, and can be obtained with `bli_auxinfo_params( &auxinfo )` (see also the [auxinfo API](???)). + +## SYRKD + +Taking all of the above considerations in to account, we can finally implement `SYRKD` (for double-precision operands only, some error checking omitted): + +```C +typedef struct +{ + const double* d; + inc_t incd; +} syrkd_params; + +/* + * This function should ideally be defined in a plugin and registered with the context. + * Then, it could be obtained using `bli_cntx_get_ukr_dt` below for the appropriate + * hardware detected at runtime. + */ +void dsyrkd_pack( struc_t strucc, + diag_t diagc, + uplo_t uploc, + conj_t conjc, + pack_t schema, + bool invdiag, + dim_t panel_dim, + dim_t panel_len, + dim_t panel_dim_max, + dim_t panel_len_max, + dim_t panel_dim_off, + dim_t panel_len_off, + dim_t panel_bcast, + const double* kappa_ptr, + const double* c, inc_t incc, inc_t ldc, + double* p, inc_t ldp, + const syrkd_params* params, + const cntx_t* cntx ) +{ + inc_t incd = params->incd; + const double* d = params->d + panel_dim_off*incd; + double kappa = *kappa_ptr; + + for (dim_t p = 0;p < panel_len;p++) + for (dim_t i = 0;i < panel_dim;i++) + for (dim_t r = 0;r < panel_bcast;r++) + p[i*panel_bcast + r + p*ldp] = kappa * d[i*incd] * c[i*incc + p*ldc]; + + bli_dset0s_edge + ( + panel_dim*panel_bcast, panel_dim_max*panel_bcast, + panel_len, panel_len_max, + p, ldp + ); +} + +void dsyrkd + ( + const obj_t* alpha, + const obj_t* a, + const obj_t* d, + const obj_t* beta, + const obj_t* c, + const cntx_t* cntx, + const rntm_t* rntm + ) +{ + bli_init_once(); + + if ( bli_error_checking_is_enabled() ) + bli_syrk_check( alpha, a, beta, c, cntx ); + + // Additional checks for D... + + obj_t b; + bli_obj_alias_with_trans( BLIS_TRANSPOSE, a, &b ); + + if ( bli_l3_return_early_if_trivial( alpha, a, &b, beta, c ) == BLIS_SUCCESS ) + return; + + num_t dt = bli_obj_dt( c ); + ind_t im = BLIS_NAT; + + obj_t a_local; + obj_t b_local; + obj_t c_local; + obj_t d_local; + bli_obj_alias_submatrix( a, &a_local ); + bli_obj_alias_submatrix( &b, &b_local ); + bli_obj_alias_submatrix( c, &c_local ); + bli_obj_alias_submatrix( d, &d_local ); + + if ( cntx == NULL ) cntx = bli_gks_query_cntx(); + gemm_cntl_t cntl; + bli_gemm_cntl_init + ( + im, + BLIS_GEMMT, + alpha, + &a_local, + &b_local, + beta, + &c_local, + cntx, + &cntl + ); + + func_t pack_ukr; + bli_func_set_dt( &dsyrkd_pack, BLIS_DOUBLE, &pack_ukr ); + + syrkd_params params; + params.d = ( const double* )bli_obj_buffer( &d_local ); + params.incd = bli_obj_vector_inc( &d_local ); + + bli_gemm_cntl_set_packb_ukr_simple( &pack_ukr, &cntl ); + bli_gemm_cntl_set_packb_params( ¶ms, &cntl ); + + bli_gemm_cntl_finalize + ( + BLIS_GEMMT + &a_local, + &b_local, + &c_local, + &cntl + ); + + bli_l3_thread_decorator + ( + &a_local, + &b_local, + &c_local, + cntx, + ( cntl_t* )&cntl, + rntm + ); +} +``` + +# API Reference + +## Registration + +```C++ +err_t bli_gks_register_ukr( siz_t* ukr_id ); +``` + +Register a new microkernel, which may have a different implementation for each supported data type. + + + + +
Parameters:
  • ukr_id – A pointer to value which will be set to the unique ID of the new kernel.
Returns: An error code which is BLIS_SUCCESS on success.
+ +```C++ +err_t bli_gks_register_ukr2( siz_t* ukr_id ); +``` + +Register a new microkernel, which may have a different implementation for each *pair* of supported data types. + + + + +
Parameters:
  • ukr_id – A pointer to value which will be set to the unique ID of the new kernel.
Returns: An error code which is BLIS_SUCCESS on success.
+ +```C++ +err_t bli_gks_register_blksz( siz_t* bs_id ); +``` + +Register a new blocksize, which may have a different integral value for each supported data type. + + + + +
Parameters:
  • bs_id – A pointer to value which will be set to the unique ID of the new blocksize.
Returns: An error code which is BLIS_SUCCESS on success.
+ +```C++ +err_t bli_gks_register_ukr_pref( siz_t* ukr_pref_id ); +``` + +Register a new microkernel preference, which may have a different logical value for each supported data type. + + + + +
Parameters:
  • ukr_pref_id – A pointer to value which will be set to the unique ID of the new preference.
Returns: An error code which is BLIS_SUCCESS on success.
+ +## Helper Functions + +```C++ +void_fp bli_func_get_dt( num_t dt, + const func_t* func ); +``` + +TODO + +```C++ +void bli_func_set_dt( void_fp fp, + num_t dt, + func_t* func ); +``` + +```C++ +void bli_func_copy_dt( num_t dt_src, const func_t* func_src, + num_t dt_dst, func_t* func_dst ); +``` + +```C++ +func_t* bli_func_create( void_fp ptr_s, + void_fp ptr_d, + void_fp ptr_c, + void_fp ptr_z ); +``` + +```C++ +void bli_func_init( func_t* f, + void_fp ptr_s, + void_fp ptr_d, + void_fp ptr_c, + void_fp ptr_z ); +``` + +```C++ +void bli_func_init_null( func_t* f ); +``` + +```C++ +void bli_func_free( func_t* f ); +``` + +```C++ +void_fp bli_func2_get_dt( num_t dt1, + num_t dt2, + const func2_t* func ); +``` + +```C++ +void bli_func2_set_dt( void_fp fp, + num_t dt1, + num_t dt2, + func2_t* func ); +``` + +```C++ +func2_t* bli_func2_create( void_fp ptr_ss, void_fp ptr_sd, void_fp ptr_sc, void_fp ptr_sz, + void_fp ptr_ds, void_fp ptr_dd, void_fp ptr_dc, void_fp ptr_dz, + void_fp ptr_cs, void_fp ptr_cd, void_fp ptr_cc, void_fp ptr_cz, + void_fp ptr_zs, void_fp ptr_zd, void_fp ptr_zc, void_fp ptr_zz ); +``` + +```C++ +void bli_func2_init( func2_t* f, + void_fp ptr_ss, void_fp ptr_sd, void_fp ptr_sc, void_fp ptr_sz, + void_fp ptr_ds, void_fp ptr_dd, void_fp ptr_dc, void_fp ptr_dz, + void_fp ptr_cs, void_fp ptr_cd, void_fp ptr_cc, void_fp ptr_cz, + void_fp ptr_zs, void_fp ptr_zd, void_fp ptr_zc, void_fp ptr_zz ); +``` + +```C++ +void bli_func2_init_null( func2_t* f ); +``` + +```C++ +void bli_func2_free( func2_t* f ); +``` + +```C++ +dim_t bli_blksz_get_def( num_t dt, + const blksz_t* b ); +``` + +```C++ +dim_t bli_blksz_get_max( num_t dt, + const blksz_t* b ); +``` + +```C++ +void bli_blksz_set_def ( dim_t val, + num_t dt, + blksz_t* b ); +``` + +```C++ +void bli_blksz_set_max( dim_t val, + num_t dt, + blksz_t* b ); +``` + +```C++ +void bli_blksz_copy( const blksz_t* b_src, + blksz_t* b_dst ); +``` + +```C++ +void bli_blksz_copy_if_nonneg( const blksz_t* b_src, + blksz_t* b_dst ); +``` + +```C++ +void bli_blksz_copy_def_dt( num_t dt_src, const blksz_t* b_src, + num_t dt_dst, blksz_t* b_dst ); +``` + +```C++ +void bli_blksz_copy_max_dt( num_t dt_src, const blksz_t* b_src, + num_t dt_dst, blksz_t* b_dst ); +``` + +```C++ +void bli_blksz_copy_dt( num_t dt_src, const blksz_t* b_src, + num_t dt_dst, blksz_t* b_dst ); +``` + +```C++ +blksz_t* bli_blksz_create( dim_t b_s, dim_t b_d, dim_t b_c, dim_t b_z, + dim_t be_s, dim_t be_d, dim_t be_c, dim_t be_z ); +``` + +```C++ +blksz_t* bli_blksz_create_ed( dim_t b_s, dim_t be_s, + dim_t b_d, dim_t be_d, + dim_t b_c, dim_t be_c, + dim_t b_z, dim_t be_z ); +``` + +```C++ +void bli_blksz_init( blksz_t* b, + dim_t b_s, dim_t b_d, dim_t b_c, dim_t b_z, + dim_t be_s, dim_t be_d, dim_t be_c, dim_t be_z ); +``` + +```C++ +void bli_blksz_init_ed( blksz_t* b, + dim_t b_s, dim_t be_s, + dim_t b_d, dim_t be_d, + dim_t b_c, dim_t be_c, + dim_t b_z, dim_t be_z ); +``` + +```C++ +void bli_blksz_init_easy( blksz_t* b, + dim_t b_s, dim_t b_d, dim_t b_c, dim_t b_z ); +``` + +```C++ +void bli_blksz_free( blksz_t* b ); +``` + +```C++ +bool bli_mbool_get_dt( num_t dt, const mbool_t* mb ); +``` + +```C++ +void bli_mbool_set_dt( bool val, num_t dt, mbool_t* mb ); +``` + +```C++ +mbool_t* bli_mbool_create( bool b_s, + bool b_d, + bool b_c, + bool b_z ); +``` + +```C++ +void bli_mbool_init( mbool_t* b, + bool b_s, + bool b_d, + bool b_c, + bool b_z ); +``` + +```C++ +void bli_mbool_free( mbool_t* b ); +``` + +```C++ +#define PASTECH(...) +``` + +```C++ +#define PASTEMAC(...) +``` + +```C++ +#define gen_func_init( func_p, opname ) +``` + +```C++ +#define gen_func_init_ro( func_p, opname ) +``` + +```C++ +#define gen_func_init_co( func_p, opname ) +``` + +## Context Initialization + +```C++ +err_t bli_cntx_set_ukr( siz_t ukr_id, const func_t* func, cntx_t* cntx ); +``` + +```C++ +void bli_cntx_set_ukr_dt( void_fp fp, num_t dt, siz_t ukr_id, const func_t* func, cntx_t* cntx ); +``` + +```C++ +err_t bli_cntx_set_ukr2( siz_t ukr_id, const func2_t* func, cntx_t* cntx ); +``` + +```C++ +void bli_cntx_set_ukr2_dt( void_fp fp, num_t dt1, num_t dt2, siz_t ukr_id, const func_t* func, cntx_t* cntx ); +``` + +```C++ +err_t bli_cntx_set_blksz( siz_t bs_id, const blksz_t* blksz, siz_t mult_id, cntx_t* cntx ); +``` + +```C++ +void bli_cntx_set_blksz_def_dt( num_t dt, siz_t bs_id, dim_t bs, cntx_t* cntx ); +``` + +```C++ +void bli_cntx_set_blksz_max_dt( num_t dt, siz_t bs_id, dim_t bs, cntx_t* cntx ); +``` + +```C++ +err_t bli_cntx_set_ukr_pref( siz_t ukr_pref_id, const mbool_t* prefs, cntx_t* cntx ); +``` + +```C++ +err_t bli_cntx_set_ukr_pref_dt( bool pref, num_t dt, siz_t ukr_pref_id, cntx_t* cntx ); +``` + +```C++ +void bli_cntx_set_ukrs( cntx_t* cntx, + siz_t ukr0_id, num_t dt0, void_fp ukr0_fp, + siz_t ukr1_id, num_t dt1, void_fp ukr1_fp, + siz_t ukr2_id, num_t dt2, void_fp ukr2_fp, + ..., + BLIS_VA_END ); +``` + +```C++ +void bli_cntx_set_ukr2s( cntx_t* cntx, + siz_t ukr0_id, num_t dt1_0, num_t dt2_0, void_fp ukr0_fp, + siz_t ukr1_id, num_t dt1_1, num_t dt2_1, void_fp ukr1_fp, + siz_t ukr2_id, num_t dt1_2, num_t dt2_2, void_fp ukr2_fp, + ..., + BLIS_VA_END ); +``` + +```C++ +void bli_cntx_set_blksz( cntx_t* cntx, + siz_t bs0_id, const blksz_t* blksz0, siz_t bm0_id, + siz_t bs1_id, const blksz_t* blksz1, siz_t bm1_id, + siz_t bs2_id, const blksz_t* blksz2, siz_t bm2_id, + ..., + BLIS_VA_END ); +``` + +```C++ +void bli_cntx_set_ukr_prefs( cntx_t* cntx, + siz_t ukr_pref0_id, num_t dt0, bool ukr_pref0, + siz_t ukr_pref1_id, num_t dt1, bool ukr_pref1, + siz_t ukr_pref2_id, num_t dt2, bool ukr_pref2, + ..., + BLIS_VA_END ); +``` + +## Context Query + +```C++ +const cntx_t* bli_gks_query_cntx(); +``` + +```C++ +const cntx_t* bli_gks_lookup_id( arch_t id ); +``` + +```C++ +const func_t* bli_cntx_get_ukrs( siz_t ukr_id, const cntx_t* cntx ); +``` + +```C++ +void_fp bli_cntx_get_ukr_dt( num_t dt, siz_t ukr_id, const cntx_t* cntx ); +``` + +```C++ +const func2_t* bli_cntx_get_ukr2s( siz_t ukr_id, const cntx_t* cntx ); +``` + +```C++ +void_fp bli_cntx_get_ukr2_dt( num_t dt1, num_t dt2, siz_t ukr_id, const cntx_t* cntx ); +``` + +```C++ +const blksz_t* bli_cntx_get_blksz( siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +dim_t bli_cntx_get_blksz_def_dt( num_t dt, siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +dim_t bli_cntx_get_blksz_max_dt( num_t dt, siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +siz_t bli_cntx_get_bmult_id( siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +const blksz_t* bli_cntx_get_bmult( siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +dim_t bli_cntx_get_bmult_dt( num_t dt, siz_t bs_id, const cntx_t* cntx ); +``` + +```C++ +const mbool_t* bli_cntx_get_ukr_prefs( siz_t ukr_pref_id, const cntx_t* cntx ); +``` + +```C++ +bool bli_cntx_get_ukr_prefs_dt( num_t dt, siz_t ukr_pref_id, const cntx_t* cntx ); +``` + +## Control tree modification + +BLIS_EXPORT_BLIS void bli_gemm_cntl_init + ( + ind_t im, + opid_t family, + const obj_t* alpha, + obj_t* a, + obj_t* b, + const obj_t* beta, + obj_t* c, + const cntx_t* cntx, + gemm_cntl_t* cntl + ); + +BLIS_EXPORT_BLIS void bli_gemm_cntl_finalize + ( + opid_t family, + const obj_t* a, + const obj_t* b, + const obj_t* c, + gemm_cntl_t* cntl + ); + +```C++ +gemm_ukr_ft bli_gemm_cntl_ukr( gemm_cntl_t* cntl ); +``` + +```C++ +bool bli_gemm_cntl_row_pref( gemm_cntl_t* cntl ); +``` + +```C++ +const void* bli_gemm_cntl_params( gemm_cntl_t* cntl ); +``` + +```C++ +l3_var_oft bli_gemm_cntl_var( gemm_cntl_t* cntl ); +``` + +```C++ +packm_ker_ft bli_gemm_cntl_packa_ukr( gemm_cntl_t* cntl ); +``` + +```C++ +pack_t bli_gemm_cntl_packa_schema( gemm_cntl_t* cntl ); +``` + +```C++ +const void* bli_gemm_cntl_packa_params( gemm_cntl_t* cntl ); +``` + +```C++ +packm_var_oft bli_gemm_cntl_packa_var( gemm_cntl_t* cntl ); +``` + +```C++ +packm_ker_ft bli_gemm_cntl_packb_ukr( gemm_cntl_t* cntl ); +``` + +```C++ +pack_t bli_gemm_cntl_packb_schema( gemm_cntl_t* cntl ); +``` + +```C++ +const void* bli_gemm_cntl_packb_params( gemm_cntl_t* cntl ); +``` + +```C++ +packm_var_oft bli_gemm_cntl_packb_var( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_mr_def( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_mr_pack( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_nr_def( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_nr_pack( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_mc_def( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_mc_max( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_nc_def( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_nc_max( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_kc_def( gemm_cntl_t* cntl ); +``` + +```C++ +dim_t bli_gemm_cntl_kc_max( gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_ukr( const func2_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_ukr_simple( const func_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_row_pref( const mbool_t* row_pref, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_params( const void* params, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_var( l3_var_oft var, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packa_ukr( const func2_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packa_ukr_simple( const func_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packa_schema( pack_t schema, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packa_params( const void* params, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packa_var( packm_var_oft var, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packb_ukr( const func2_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packb_ukr_simple( const func_t* ukr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packb_schema( pack_t schema, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packb_params( const void* params, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_packb_var( packm_var_oft var, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_mr( const blksz_t* mr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_nr( const blksz_t* nr, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_mc( const blksz_t* mc, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_nc( const blksz_t* nc, gemm_cntl_t* cntl ); +``` + +```C++ +void bli_gemm_cntl_set_kc( const blksz_t* kc, gemm_cntl_t* cntl ); +``` \ No newline at end of file diff --git a/docs/diagrams/mmbp_algorithm_color.png b/docs/diagrams/mmbp_algorithm_color.png new file mode 100644 index 0000000000000000000000000000000000000000..24e94e2aa50592518ee2f9a4bd13465e1e397b20 GIT binary patch literal 28861 zcmb@ubzIbKv@bd$B~lUs(jn3z(v36%0#Z_v(miwuNQoegq|zNjw}>zT(k0yt-JSOt z_dfgVd(OH4+|Thvd6{8;^Lt{gZ>;r9_-hq89IPi;5C{ZEL0(!N0=ee^fuNvcqJu|L zIdRFsHyTrUb!7;|ixC0|2!=o|!9xM-5D1hD0@-{Afe0l-AS8}yjcOv`33L-BIcdlp z^1p8_`ElTp2afW(F5tWTzrV8GHewKn!z%@8$u}NTJG1^i8e=!&``QqvLAp5>Vwo>_ z=W$DedZ^cx3JMgr0S#7Jn*7jlv#iYTe%HyYe{SBqI)$q6$&gRR3zn{Br3OUwqUC=J zI2E$^L?><^PhEz(L2N1cC}od&qlKY?-zt`|fxqvnLGHbavwxa?in~g`D8iR5MxFsh z3X@hQH*<)XCKx|f5Nj?NUp<(~gHTHTQPzSocq%NApEj6?7Hdwr(<$w8`M?7L5zAC{ z*UKNVa)pj8xiyWB@q%;1sGYekUt^0_;8wNKc7=`l9Di;3W}nkcp;(ZjXEFU;J}6if8F)z~11=v8zn^G6=0Yjr zfj?d6FhS9~;Yx>(`HPYy3|%nMyqyahRTA=zy7|QL6ReyD(@}r|A64?^hkBV(k1T3&TwY$@#sRCqv)HFf6&8I@-e_tX&(u3Hv9j_s-ZEhH^!AR-ja%3J1YZ_a zh4)6bPS$d$!VMtyb>urcJNT%yyB$&w);B0P-KNdj^1Vto;&X)zZQa$kxjiyMdsk69 zxSS@U5uUTA%;sLxXJ&sZj7j1xe}8Ykz>vUwp?-Mwj^G>77l zCmz^5UE_0Fl6G}-GieV*v$V33eEBlyGEJu2Zg_MQ6;e=M&fnOYg9igkb_Stka8XuP zX5!?8A8(A@bHyXZ4(+hCv&;RVd>DelPus#{+J%iA(V`=%g!{f>lN)=UO1ib{EGBI* zgPN9Bh(a9W_nI1F@LVDT^9H3d2&^?O@a|Becj7c&+Ep;v@VeI{DNKd$Fc?94dOF0? zt63Hb<-PV3Pg8l;XG_l@x9*$ZP1I}}mW7$xm)Bp^Yi7+$D^1-}ei)PohlACymwmVG zVFG5sS)cPip9plvR)`%HX8==jbWugp8>z6OSaLM)f><#B9}+uy0F2(*WfNeys@;iU#mk5xra-YBf@I9OXt zDJdysKA6kf<$Dm?A+Q)s7+EGOIW{_4z)bz1J6|1Z#y6%=Z?4Ls?|CCuh{x>(;&U0w z#EQy ztl&M{KNn$eST<_=x8mOWG$(XO!6#8xm?!p+inGMRAbz;hJKfrM{-YMOm?&;LQy(KD zu%K9#um}&n=ua1y;aQlO0~7ghG}Z=jtdl368KLug+S-kr#IX#Ey_{`tfPtH!F46e8t|G7jbfCTU6>``KctDUmf#~bvRK-t3`UxhELFpz-6DG z;3$D@bNPQv!2h3_`Tyb*(%D&AC}Nod>(=;E>faV9LFWFu;|9LoXlr-uosk9l$Dhwbd`waUb`9o_TkTF1`Wz$-3V{p?vfBBok7ORP^%g}xRS4`| zect$1=2IYlmj4^byPCJlJ*Ibqm(O#Xm=XsN8doipxELrOAhvv|Lz;c8YDBaO1W(Ax z7tgMkJrK#xzG#xK?|T~ErS?3-nhQhU`1OOm_c;hp5h^?%F=`R2=%KMcTU&VOL7S#c zv$#pUXrUeVB(A=XYdcY3fLt84457Gq{j338bZUC~p7gwtO1=ipitoiu0w+;(Q+O8X zSxwZkKn0r~fy28H;}`5SEif<{v)*861r9wXokt=iTCCuH#$W&lvWv&wV+MY^U}OcC zo0^I&dpnf1Fw0Jg{d31H$e0u?13&<94{h^C&u5>BZWQ~@!Ptt=MCAQ210f(K>((V73pZKz&J^V;!Xq|<0DF{#Owf}ovzFi@+ zqhw>!Y$Nq@lIKv%{Kt7T>cChoXnUQzXXU<_lL0{7-0JUa zVfazc!3l^`JZXqAO@v8iTlAO6!$ub;&IbX;XK^5O)lEdGRA+Dw)V2Pc#TE?w|CodI z!wVm$yQjg)578V5=M&iAp#}%>o-TyoN zLKDEH5((RxuCs|+brpJqiMkZ^P&z0LX8lFoy=9@a?nmf`xbVIVX7;sjjOV-qX`VRi{Mu zkCKg!#-Fx_c~bgcA20{ivGrO*PA+`*y;>kqX0J4;;7gu{I9YB@1~_x!<`OT$SakoI z^nRdc=Jm|bzWCJ=wo=O`V8<2azUDNc!?AJs9x7sN#F0KSoazY_- z{IaB>LAOd3CUkM*2l!@a$M?Jk(56{STU*o)uKlYUz*Z~Xmg38(i6$QE+s-B|P&WAr zA>ku2G4cBLQA{9aXb0v;DN9RB3BWUkcVf}^^70sirD*@8@P0yZf3Vv)^=?Q_7u zR!XY+f9k$ZPZ;N=Un9Z#Sn;kM~bVOLjr`waVZRqMWmG|U{ zI;&CmdtsY6lIH#`&(YvcU(5x|W&jzaCJmS5nXEKv(jo||J8T}=2p+TUD{9jbENANl z)9ZAicyejt*!)LXBv2(d!KICKYwayBby4Q18`I|?FZe=xkgF&v8_El zir@iSA}p(n5!5|in{kGnBYJ>%I)FWwmX@YC&pOSGZZZbryA+g^NXyBE=A|)i;Q#OP z<$nc9rqYr?S+(c=33So$I5)rO*4^Evy{^r%Vi)I<0r5MEgoFfAu5V}lmklD1JGeY{ zY9^cypPYc(17tF*s0iz9-rv8!?0sd!&FQpL2M#}}!RItC=)M-LCSfXv_9-*bYE78e z#cI+}oBv(fHKpxrBXLIv&K9?*J#F;U-sgt?Hz#QZ<YaRLu`kzqMWUtb@%B%tk^UZodimrg)* zvP%=!b?mLut*3!-0B``uS=(-31QvCN}b21$HWK#dd40jx!~274|Z zcq}*yEVv2*z(x5WG1o8Cp8AE3A%d>ikW*SMczzF~gA2S52+A)E4|FtFiHk$N*v|D56mrK-uTdpWl(n-t=(v!lrm1 zT|ryGvYVTmhm2`lgy1V(IC2vL0Kk+Q=V$ZGn4sXq$6PsCV*@C72?04548M%93dmPL zK+AcB$@kB?89I(cqQovOvx{SR z!0y^u1`<$BTRUHjit8OjCy~9ud{9~~$9xb-DN=x58-xJm*k@iTVxjGk_OQ&kHBpX@ zjinr<{7N&^MISfnt`C*cr~9jwyuouznCKI z$&ZMqUsfa7dh-kk)!P4y%xyyzA^b{3YU*B%$hSX6IFby0yq0zVJQ+}bn;HOpA;twy zNdWbh;wST!k&$WN%3|O>^bkow#J|b@AQ2GA&s;B$DlkKA2X$YqTvg_0YeEZ>1a`uS zMQ*J8?A}?@%&Tg%KdyKI#-vF9K%*5vmLo;Jg)?iXJ*WB;CJ%R0-Nzv6Qk+*Ubd2(_ zV0+lKc4P^l>i?L_U)7 zmX+b6SURjczTcxG7fB#lJ&g)s3qwCzcW97l7e>n*(2oDWs?j4=E1IDB=FPfaGBI(?UUdd777fU+$xOy;9Y3V`Z{x)$!(h+# zp<|kxuLsrJo%AK0Tjd%@+KZx(T&eysHIUhaiK@#@N%LrWzsWQbs^1&6LmFT7!|BOb z5@!MqimzlaX>-Vt;lvOcUlIuUd;}P5)(~-c!E!Tu#cL#=HNmGuP~Oa%2on5u`F!~w zJQqoQotX%DMCS;;BlDO#(NdSE-hYrjDO5S||7?$*Ab^p!f@zJ+%(zgQfR0fF4!~|T zqR(aM+&Ub0xB^kVlSPLO0$R)^ac>prV&q@dDTJ#GJ2mJ(C9Bs>MC*mgYInTwJMW-2 zIoTW&RA6~wJ<6oR_E>{*s~|h-XKs>VO4XK3WP(7)bED;EWy)sYs&(gYYdQt=~$g5bsJIN{Fg_J7O3-9pA=@$O)$lerG4brAD1zy#>odGh2* zq@Q32xu7!xpWRfjaw;F0K|)}vh^O=1z@zU&jKR_XSHHKmimmbi@Lu2E-gXj3Dba5j zUT672@oho*GWb96p3DG8%Cdll_A3DO94&tCj~ql6+~`m2q#{zLck&cCe!Y2-t$aAT zrQDuM#H*VK003AhXinueAo!sJb^0U-z|m!c!25l z&%gw-l2##hXpG($P$C&FQh+p}BOBY?=5i*z!bbEh%pP%BV}Ji*LXeE`pOXdf3}U`X z-;nJSbmSqztT*#fwmtQjOc8=1@t`lvII9>gr#F3LnZ{2ms_}Lqn3S0y{MU(#29P1Y zj@cf6TDw2Gbz+g0Dc0y%?l(#(^9b-+QN|7~Kt2CVqL6{h5mLWf|{oCaTc<20) zuOrK!vM>QwWORUm?wuj7p=RD0>o~objg5`kG5KVPU_a9XU~?u$y$9UBnwpw|(o(R> z)?4HDbzA73V6b)5l-i6DSze}3!lKRw+*S+U@Yhy}s%mdz_hJspWP?rJ30TDaFb+gM zHUc%PePHV((3lj=CU~tR{vfyy0M$nqG!a=(Xau8z*m^e$=l}o$FbDGpkY%BCL^Os6PQge3hv)Rg~EyiTkd*?1G3pPw)D z%-Y6g%syr=Eq@T{mSFwda4q7eO$g}WiWCwaM;rlRQPd$4)J_AS*CRu_H(s}vb~hle%gFx8>mM+-YrnQ9uLPsEDnFLEcCSOf9shQ9KsGu2);IYFN&1rSvwIf%i;asN+K2ZY1?c+R==x+D}oE3Ylq+ zA1H<>&_J~nD%0|xAqBOs?9y){`hLQ23Nm8?O1|Vwp+NzM*vG5rIJ6e#AwO_j%c^6lQvq1XNo9%ve1ZZlNhAY60;C$SAejUC3Y95ax-J*70`2s~8Hn~HpYnmP(Mab# zRl2D@;jz0m!-;Re9Ed`Qw8I?O^1z(o<*)5n3}8?I3?+q}r!&@>H?DaOHO&kDl(H3Ntd9I8roP4o45Lo$53d#t8rqCg4nvt%H~CT*#rwMs2A*UV(q!Fw`TUKMQFKvJ zQT#NM0b^x5@7Sy>xCNS`$j&iVV9dYy>Pezk_F9Z9UV;h}b-}F(8cj_8En$HTZ1!v5 z$W+(Wy-YNRrrDSLwF3fh?7L-8Ehu{M_^@lHgVh%DdG(;Wa-{6WPu!M|99TBeYHy-* zMe;{;YiqS|kUl(Nvv#@RXa0E~9!`H@2H#@s@~$6*)-U*(kqtq(_}n)uHqiaG){}z(gD6U`87_YJ{uz5Y(x{SGwc4Jr7qRPiY(K z-1iJ$I85gES4f{>_jboUGX*-3`3wi-Z)@wHOP!y*XCBM*c<}S`eg}WxiR8?NiVJTn zVsnGcbx%UR5&SsrAyUlx{>#}sFXg$i(AYu3q1h}0gv;zVJ`jOArQ;=+5eY{&Z<}$FIbJ6KezzAeqY(<+EkM#d+X(wD=xd}EW7 z;esw}1uYZ+S%B5A7Supc?oD9_12)K?WbTf28>t8iG<$VLlOxafUTv_F5^NreKR+Uf zS=H^6#$QV|N-AK|Zm5wMulI8#_gTgPAaZqe&6x@5`VU2gMEBU2m4FT0uz99ZsUp2C z=WThDBA<`;aTs6$I#O zz1K3ap7fqcA@;c<`Cw87Tk0uAplxSaP7sIEoO5cjkcW}`EL5k;;xjO@esw>~jI^7s zH3o1CJcNRxqMx7+f#l!dx_`d%V1Ob|@(U{h`{n~#sG?U{yfdWjB2n+)v`y>yRzl|Y z@&Rc|kZ{r~%H|L2s8QU7xP!82ZNOc1cU zmLc_DB>r1(|8%n~!^*YBc{d%$+c;c7mNTwBf1mo!2SWC(cCiq*vZFt-UJrBB|Q%6RWO ze_oBK^lqJ99cUt`Dr(PnzxLhfqJ`3*&1N7|CPRBXIrWcC89dWJ!Ac`jhWMx+yZ-nZ zj9;}~S=l~Hl1wz`)vV0W+q)#5k}6FW>G;g^%<%CJVz-sE z^>9zs9sw24Rd?s=gT zXLp$ll-YUg?2sH>7HL7rA3|KC-@+iI_NDC!;@;TvEaX$7t+Fn`2vDg@~*eIW#*2p!| z9f-&L%3L4hOzjIX*hFJPrKg-_FhIB#xol z0}^m{HCEmQY>|rcZV>P(3B0J1VOPuVNEyqfC3Nk&tZSYOJ7gC5c|nVdilj6(HD%*s z1)@Ib(FnNdX!G3}{ybdqP4(KcXCVxov3KuQW>lI$&Yk59$LRZk;<6i_HMchXtSa-qn-&FIz{kWzb_RtbG9^9UF+Vws%pONaC)INdZ)=SDvx zJatO70N*Jx2eNz9^Eoec^Ua9`G>dbU)2SOHX<@Utyz3<%rS6ly!1Xs*4+t55Y@KYD zy)xQ#tF!Bz?{5@S~h8;P=xP0Dt}im~x$ zs>83CC0T1ZwXANsVI}Cvw&PBJ-rvNJJ3DML_QJfPrBuPJ7RnMaA&$jXIz17}KV};6 z7RN*M9#ukKhEjccmf=A#Vd(w2(Ty>Ds@^`=D_gy?X@_S!uI-1^o*g8hd!&pSdZOQR|>D#727vEN8GCK*T{mfey7W~frG9_^vT*x(sm+P*WPdq zW{=&mUglUHzq&mC$(Z!q4-s@jAPmGw-uT$}B9U4vLd!lgT+4JCj8$0+>5RIG%1+DA z4J-_y&G^t>L|amy#mko+rS*1eb~>k*3_k zkHMpGF`o_xSK`>`(EZ!%qk&53)_ClncApt1_`NU=`FjxHO!GG?qG^BvI&k&2%9?P& zBXV+b7(QY~!!a~prz%ZDhd3)&h+i)732R{N0a(T+=lc`S|L`yz;mM{~iw)Ta*CjVd z4+Qb`6Er{+iLD9XY8yZ`RqUImcHAOKx#{?ccQUH8Ir#LXKS%DQ>W$ie`nTi|1UYbR zxuN?=8w&J7D0?rn8=*=v6-j(?c>D=7MtAqzqk5a$?<3jlw`jTd(pNA@}1|V6K507`uZMjQB%Jg`(#x z`R|}qW(myK6INo3HA6At&2EU$D z3M#UXKgk?vsHqvdj$twj0yMPZ=`E77AlQg-<@80d5s~Fd~4N+q_UT5-iqk`TW zVQ;na65v9e$s)z2g9HDiIlj6&lSf>&TBdv8E-=~=%8?T7sX4&?Se6N}bY{sNNNrS> zzX!aWCov}76=wB7UYZc74?*eCq?Z8-=U{$qTy60_vj?^mTDDv?Un2_|bbmpTLdf;^ zx^;5b?akG}MB6PfXY;;Pc{KD7H+*Gx&HC)MByaJz7?SXK3b5i}NuRt7!RDUbL|iaN z(23oin`RdqKlx94@oSv^AAqev=r}zJEp(7U&9}aDK>2{WwJo0j1uiI?MZ?n`9-l(< zaOXfpi&V_Vqq{T*l-+?pVdl#F%JfITJrKvw{4pA^=fWTcmAYt30sl?bvo(Yyd!{xUSs5p51DT?oD{|MRMU=0b9hkigdb1svR9 zQr~A01k8--osfwwU~HiF!klJVDd10WX0n*1Vi|ywie&jHC7@{5NAdNOY1~X!B0nt} z!L(f#Q1uk>55}Z5tl$@+QtC^Npnev&ri~E5)hgDst6`O#M;vuh%6W=?1yGC{)LwtY zAY40lpkNFXZw`r<@S8!P;Mq_AmJuk#0*3_{2jJhaZmxgjTS}1uxbl}pPb#fq-ZB{^ zG4a-TF=qA*xc3icx}$JtmQ%|Z-2cs%yt|mg({l?R^}$8QZ$OakytmMO46J-$x?Ag- zWV(ZnEno=%;YJ#wcJ({1W_v`L70e}|Ez6h98kE8#%V1Os%JQ7Q(D&YSE61LMKGR#s zYG@GL;4aa!nPDOhF?N+JMcS-8da9vwpL^cG(X-ecHlK5&&sH$x;3@HX<&Z|H#%6<= znT#N_LxEVNHD&R$fr^!6g3dWZnp~e_=;cgC@{|R;o%{~FpP2%sA=}0v<<&RZeib7} z)lMHRV&cjh)~;2iZ{@qmOBb^nE~MC-n#Ipd0nFR~Olw|} z3J;a!zyI6)iVJ)$CM?75;D;QNtZd~`Ia%${czoc?17r9-D9h76jd(k!0P3c|i_lFp zu9^gulNo1K7@LjOpqo&-g0?I-Vcm4P8)tQ4*X6>{q&^ z7SQ5f%yoTtdbWTE53G~2-5&mozdIua$EPEz;xYj0%5Sw|!(Hz(rw(@`Zu67pNpkjq zckx|_sjx;Q4YilX4J9El& zb928T=sQcnO#%W8>f67WGAt;Xlh||v6B8ecoG%2)kk^lZ_R%SVKhIhoH&5@#g4;lq z44}b;d^wdW_kxoPIKG>LgF>BELatCo@9MB9sPX`e0wj7j=nLW^2bdY&mkXmj<~{L| zX9^S4RJU=vmT|y4XikqiR#r-6Yj5^E0u{dP>a%VE4^E@jW(Jfx8ca}9zriX8!4Pb& z3!E^B9J0eOa9IE}Xfgkf9>dOb{>e(6MCB=E6bOKs5C2){F8?2eZmTio7|pRk+CA_0gkPa=z3);CLG*JNB>q7TF8@y}2M4Gu9qWXY(<-Nh2LJ;wr?_ zAv@z32*RUl!kx6kyI)NGl2FX#U{>kbJottdrhtpAic{d`w}|4{(=pgQg3NZH(GxZE zfX*o3jDmkbLG6KqE^zbYYqKi#cFv++2NqT9mTI8}NrDDa(7t<}e1q)J`C>o%gpI=d zR}6f)J&|1x_C(>$CP<{irTOFE7!(R9#UIolUUdb#FfPaSQB!U_-?0C9|A*)Hg+R0@ zHK7gXk2G&To8w+O-}vwCO&~D++A>@D9%-x>^8V(+K80>~v_`em?`~!;Ow4o$`2Nuq z7(SzHABo1;U$%L|we&0I+{2XBPsd-ke@kJH9EFs$Sw<}w;2&J|eXK(eK0Oz95bs_& z3f&KvPq%?1MFY~F7hmYv9U%G9WS@5KWLhatnIQz)QJ;>Ja$@&}dU}T#*$>@IzSWbjg3u z-P8y^SC26A{0zTFi(csSUlnt_KXvwb%$mLEEb_mV!og-eZtadWZr6)tP&UK`y&_+| zFU~s#o&smpi30SP1-8?qq^8C(7SpXgL!!F&$Xyb+DGuK~!xp3~|H(aV;;-xUj`3`{6due2Va-UEK|4)chtI!fK)6VecFk-8V% z8L=oI%qOcG=}1E)kJ_}CkGbD3mQB_zb?4s&Ji9#%ZAQ13xN2q2JHy7unFH0|4Oe7w z8xoNpaT)hG47jwqDsohQ1}Uw8sqnry91>cZ^vQQ0Y_U zP>8);Oe@niL{z%ZX9QyGzGli{MCG~iMh~#yWubZUCn4=rC7)0wKQ7QS(X0y$FFBzBi(i&W7 z0q+fbw^{L`jgkDV-4S3r0{<+P!=s9=SXWn8(tYsLeSjbll9ln#d9H;+fJK)s%x^`b zHAn4r)t4bU?RrmyY&D4JxBbv(^*8&8XOUz)uQfE{hWsWB+zli-6&;wj*kYF~Lj2*t zIRsH|Ui}jb0WcE_X4699upH2*`VsVeDnOKgZC{==97XGcSoBk=p-V3W`lZ&SMd}&h#*T!gl-e_$lI10s7N^FfCN}WO+ zEjGxMoBR#?^F%Du|MG##6JM_`=pT!=)AVVs;KE;O_oMqQ#XTnw??Te!ohZ+$L%FNz zsvL+X$DJmkUUyAUL2!5O2&h}{+`6vJv%a-o4Ss)l4m3oa9Kihi=PEAo@|ju&lePhGBw9G?+RS?+d%p7^h!`#nx= z2s~MBcJ{VZ|27NB4c9cces@RUeVSJfC>;sv^YR>#`%iAyE#*FQbcW)nko&92UWwlA z-z_wH-;O@pIe1vIDhsv83^3xihgS$<>!5ViT>L?%$2t{B*}E?*g*9w$2V- zGHVz_JK*|#1go;&`*F`ELkU{;5~HmvYS3%u_VuDV5uok9<7=w??+muC1QyQ;M_7zb z=CB|?oc>fa$DfS@x7HN+f4r;Y6Q#d~-&6g(NV>s(pmRO zL&0q?{((h#m9tNt=u}#YjhDWSxNM>t^1ZD-AlXcGm>-U)ad`2)k?}}PvSoIX2A-Dwm87HQ+_xb%sv+FG=3r-gvV)Jly;7^M}ujpNo&STv^Z2q6D0Z z4cI0nI>)H?D;c9WGhVe0_U-iUvfAejS3?uSyBKr8=FT|x6h%+M>r|!oFyxod6tSQ{ zZi<`_&?}JRI&c3rn_=D12Gc8VU+~^rkvMwVJL2x~Jn@rcOWJ}=liCi3QRW#YL z+IxRv-Z+u>F+h&{^OlzJa+~(glhn5Mi>i6^GrX@)tv%xcLG||^_k9SG`l90x>l()N zs8*9stp7NfAh7ebmb2xOjYFUu*Hkt$M(bi(^~&+8k19S1r(4OhjJb5@fW$DS`f1^_ z+M0Cx!6>`#x!$72AFO5J-r2)NkDdd&2{Q zK+Zoep?wUa2EOEOb-p39b_e`2a#4bMgkgW?rnD|=?^)Zu4@AD3`5Ij>%sN0tU%4A` zWM1<#Vy{xLen?Pb#O}Mc-}ydm&7H88LvKPr3wVQtqF#sJlhCWBBm6F(D->)NVRR4= z5#IKh%@4`N&Pnp-MKEfh+J&oNx=m)MKeeZE5wJcM{XqOFBW!-J*I@l`6?vQAX;A$k z-kcvAb?em=78bcc|HzTj-pUTy1>E_p`os5Yb6B5psgZjV_?IbL$DIh;nMH4AjzN3f zJF1`4{HDJvJ6z9V-0eHQ9h!o^2{A8!nI56q(7Nxon1uXOD`20O=pVj_5--F8OB_Ac zyA-s4>-&7FWBNcs!{;S34n)rO2OO^zG>80lgSRrR;}u&$_~+%6V8?Z%+;3VnNcPy3 zso!sItnP958wyd{^82=N*DrISnKT~&-I^cJEhdNL%{oTgUm!DqzxZ^3o__;Au-{3} zW{|ZYguia-Hd;EfS5gZb zDAM#*Pt#2ACcSEs2Bp?CWh3v}S=ve0N2>&A4(cy}XNc^w4(>31l`cdKsdzu1bUxPM zUO3&{dTl=I`jMJ^FwhnFEFV&3$d-1;}|Zdwu2Yi*0RZlb~)R3{J?YzcCrPUsqGpDOz_i zf|GJ<3DLP~S9~RMmQ~%WCw2>2?iZ<>SOvS@S!hmpAMRaw>kH1fcymK`TzBrLdpl{_ zQFe75Q&WRvD=v=Tw|ID?w+sIvX%0$u<}-d6ZMTXeh~|rqu?6a*cfXRI&TrX=@X~+X zT%L(zjV&*|CkA(2nQM-7d)f50K&0VGewrT#$ZRTA%58o(yd$^MV4Lx{%nJBP6LU1I zy>EnWPdGrj(nJM!z4njZFMpz>^+8tlbtCjmcHEk~mo7=<&Kg7XD0}4fStL8uSx138~lgOq!l`(qU zZSDC{2-`{z$51rgUNP@p0MKN&$#X_Uh9!JG$MXin?c60CC9U1C#8U-#??=U>Z0MH9 z`ohHD{r0xkYJ$%>Jn&ekBg%zN7ph%2q^8F|TTVA}+qbc^7Y|X*l|7il0ME=$WGD++ z=ZiMuT(w@K;Gv22Pqw+M0$WvV=b-J}@lYW@N3$#vjoNc~og%?zV46Jq13}o`6MqD~ z(hH+yj0Au!ez))Y{3>VBz32tIRI1@M3&gCC`=U>4?zY_g#qKYkYFVN*Y`57xRWQjLnZA?%`{L=rikSU$ri>atBOajgh5P zE;O(VJbfB7d>bXoOvwDShtY}aNlix*4BBRifAtK1#IbLUTvb z5DL+qZq18jX|3=Fjl(!IJtys>kSS*>)2M9n$1zhv@Wg-9fSJAh?Y5m}_cH zfqR8Uh>42joB_`~U6Mt-!Y-#6GWE2L`1?18wuO0~Vk$sJDf<%<+uKXFR`-K5m%~O7 z2h+5BGw55ohz*sY#ExmlU3Y){3GEgD=(p$o+IbY+PTx$U8TX(^5lIy*8yMVqO{$(3 zS?8J&xRlpV;1upCn`IH1{d32mO$YO;{0x-cm35*N{JzwX@p2rx{p4q>OesM|JMfb` z^1JU%$jCeV!CpYc{~+O6^iU~1+hcvtEzV`b;3+(&uX}|8&!5vJNCmg$6 zr79-jarO93IxlV6i36o~1|t{icuNZ^H7_UJ-C`!oGpdl{!=2%LTRE=>$O%sxP#kW8`WI~2h$E1Uw2dX zv2)V#3+rdds4_Gql^rP>XYF;*bRT2RU;25JXU*8fz*;WgMtbY8=I{dEEpt)LteutZ zV$}b3HvH1ra8D}Fwy)M`9@`VyO&HiMm+5X|ALrkED{_0#c>1|m+rAQlCw}m-ueN~& zBVZgyPQ7%K-6QRDYjcDOEs6YZx2_!BLoW@mo#Ud#_Dp*0U|d{n+E*Gj!u(LSwQ)B# ztT_e*T}&7)!hkfnoU9E80z}C@ygO&fw@B#NUl75gN)c_FSoOa?3QE4In(1rwK8U^x z9T2|}SVegd-cariO1+u8H?%6Gf{Ys|4?=PMa=(%_m<;)pcx*gK<2rmZ8^dn2v-hJC zjt5t+x^Y@)E?c@lcV? z^=zlBmvWD5kwtpEyKQe$x25%*iI;N?$~>v+WVx}{qy30sui44e3n^qDWjGJx2TO^G ztulG7+w&2_?MYTdj?Wch3dC#-DHCz!Lk5Pq7sB@EOy`DkIZZ~}FAk-l+5@k=?yHc| z+qv9EuQl{KG~(TlzD>-#UihGLy_F!oMa4Opg;mLJ5ZXU_WOl_2ebX35P%a0%rtdis zxPwt!Whn6~%5=9qrwck*Y5+B{W8I#{*k=^}CT~}Z=k)GD1fs|ilMa2jo|o5AwGcaD zeVq_a<)B;jq5!y?DK;2iEiCO2uc&&VeHfkxj}v2d8-De_st|YSvK|-q>B8B`SM(UL z)LK6f$B(@nIvO?0^F%+3Ok15OQj&o78NkUocdV{^PDlkU^ktdv!&+_|K-qElnYmE+ z+Cjc={f$Mmg@7@*ShsDdov3eaO?vTGCLU3${%^|QK;OL|g$886%X?pOG_Jj}ShfI1EMy6m(19C%_9 zau!CXt^!HQdV4Tht`W5oCf)d=xV0R}jjcJu$Ys@{;gu)%0(^~je-1`0I^L{J);sUn zX}lIY7QhyC(*ej`*XPU4A>jb1?s)mlvk)qGlVP0)3T$hq~f+Vs@59cSgM@HnM z%(9i^6Hky3`w8#*auV;`d(a)VK)uX$E6nqPeJds0ZI(Q}GTZX_a#G{7x`GxQsdG^+ zJ{~GBj~m@+9?gK4ibkn_y^RxO0)^Y0fxql~KYzJ!4l`sbYPBAH;?sY7c3=6hsjyW%Mpj$bROwv9&dc;SF@F3Ik8S2r2KcC`ERqCTp~rLrQ|-D#PZ zhx@epz$&T9&y|ER!$r-oCsaFX&~<17)a{APl@NJ%^T$pr_HS z+K3N<65!ywrAM}>A&=2IMa{qC(~_1rTd^eWCFGH zKBf1Mfy~dGZo#8y1h-Y2nk9y(`oq5QHw1bxprNSK`PwYSKm%MM`7c7fX`}_>(w3Qd zs4tvza$bDK7fH%fCJgZp2bWpeEFk%HvPnuwpK<9EUb+L8?6e;5o1GnNIq%rX@BBb? z2ObO*bsy4@4-(p<+9IZQP@^_BZ^reswf}{sXrrE02@Fd>54UU}d1Z14P)< z*>bNLhBxkkalwLVm)Neuak9AYPJpqK#T#6`HC}a664g!REmhQq&#r0R_(p zEdh!yP$Al1&yId#kWdXwO5?6bGSGe7J;{w=^LVyAno}6)#-csrYShJp6;1AkMd9wj zv#b6)&u_A|b~z+7b_^n4b6)C<#=V}%)2qvc_QDtMeOB&sSSTg&&m8g#rT<+N zWw5SYA+#&*NK-5nMSZseevJqy#8Gez!`bf-I9&<>pS{sC#EA#|euLj%zvzJKu7cyw z2hT_i(4@SO#UE5C6xt$=N$ZL1XrQB4AYfn?su}>j^KDgZ4?~P2%cz*~!3i3ouhg@* z4}FG2!_BW{l1yiNw7Eg=rr}$G=s@=e$ci$5-84@5gv}-7d>4k;|GPCk2A^-(M!&Xl z69py3F>BsNuN(s;5M;7)9-P%jWpJ0aO#X^Dw^$o9^7Cy z@Y>lS9Tac1{pnu*P0Ky}JE*8bz^Ax?t^|IFVAOrkBYAZrtIgTJ#VHc?;fCEwd}&%~ z%qwnK0(4eDz<~`=H2@zgaCQYg{6(KbTc3+RE}P&|;ZnC?K4%npXdUwJgdh5U97+U- z=V$RPw(f4*aojC0n>hBL9>%P}nGfD|WNU#EHxy}JKz_tCYjF9T@8?Xz*d8XD8`sH^ z@YTF6b^4%>kJ|VD)O8h5QGVNg1hD`?Km;WuBm|^GQV|3s6j3^)ySot-q?HZ<>5^_~ zDFFp3>2g3yI)``&tQEnO~QzL@Wv@9cf{FSdNz$d5eGVT1lSJbHD~iXYY? zrXV1u+I9dN;}0tY#OaP`Sn&0Z_Er=0?sFuaYjwL1nLrN;mO2-( zKm-Zw$Q!aBG0rw!oc{@s*MoxtFzQItnA($64n&-elFKV9W+ABjUQM~sS+YRd)A4nD zEtKU*#pYtz=j8W=;WAtKb#)Xlfz{YUhx2D9$HwyRD44G#fmGC-CYG6O-2Srf3WmbV{KVk9Xmp!$9cbsJDgeZ84oVV^nPwCE2a=7LDAK*eWy<% z8Zd@cC5i~wF)>Qm%r&+c2oIosO;g6D7xR+GTnMI(%bmvll&t*oOPmG7@~7x z?lp>=hsjh|2e+STZmsBYBDEaWd;k)D=p zu9aXO`2BK`L-s!5%&yyzjBu-_a^N?C{JcwIHG1c)YJS_vZ%_%D(Nicb+sLVrC%c?l z@0`)Zg6JavOp=xI3VLrc|K{6@ttEcvX8FDM(x+1&*oskGPrVXjR|~mnT`$RhM;4fU zzC!DDN~=f^S0|Vrv{Jw^y*E;tq*4h_2~D1AOdcmM!~sMM6bnzsjvx{Pi$Q6Fj|S{W z>?Q>hG@pNl8+cn?)yt-DmtnP8#_Yg4SO731kd?;sp zF-gJg5ysY+3hgpU!53+$0CM1x1O$!j|ZAJcyj~X@xtM` zTe;QAk&D!^ar;pyCGRDa=D|tGUVPQ@m(P750>UV(e|<}oLU9~ItYx76BWCnj3=-_p z`@HmBqeL_oya5zJ(FE++p2hB{K@937`G zjG!P&ccHKs*#(uJVzbptgI-4-@(12O_>spKm40>&7kE>Ti=pwpen-0Loa$J*TGv%C zQ?ee-U{5Ywrd<0RQ^_ZAM}XS-Tby1Nw%Rlsv*uC}K5* zu_=?>dFOeJll|^bU;h~6=&;s;!Kv1&18v8RGRR8*n%QN3Cy3bE_ z+QG8l#;YJFgray*$jvMaPYRS1&#H(vu|uYO*VH&t#_RC@}EIy|5|I~^gEB*dXo|N zeaFQQjGh!pfPIh?!^~z|J2{0swY0PZi!*FctgC*o0LLnT?FEIDNF;%1%F76T4L&Wi z91gY^$fM)a50q`Xy^;ESaEvG}0)vgzMJ}~ta=<#XNpX<@ zz`Au59k}D+tQqJx7^?XL=SwP)gqH5bg9i^L^dUd9tNHoKD$0tOj?@KHDaGx{Z;9(O z&Un7&{(GSsKochR8xfIPWxWQRUcb+NTv^Mjxvk41Tbb;>Vkunm8(*MDjG9=1((gyX z_|N2N^sb5kaxT(#%eJOg=F!(+->PN_#nvGeo!8PRE@jZAzZkWLfgHRyH4tYF#x&s_ z=}u)Nf|>dK^DU(xuV;XY^G~Mmx_5zP1dCuCeMEk4huUYH*dK?11-+mV%-LY%YxLg5 zy1VGpPbt-^Q_m1;AsS=-?*tUWZ@PO#&QDW_zH;VoMP;v#OolzD_IxE|EZ4C+eWGx` zke3rM72?Gp3T5x`84lWaCczVcV;)otkZ)7sox{hS$z{S7o~u>ZimmP2+B~niaTGhi zGF^5m#0pB}}ry%4w zT!(gTOU5NH)F%lJ9jqc%2gevL!AZVfC9bO_JVSeBx8#p7HH~T6NrZ{+i9nd_^=f;X z$PZeIitlV=AK!6sSbHF~pe?V`)}w`%TsZYCPPS9mtl1fT7==&eOYEOSrZNR%kRyEB zS{H4+QYa=CisG~UENPr{5#h2=UDPUqZkwvFHaME}|`wR#kKX8WXPF}5%LKd_5!yB-Yc!s$VO zJ6udrsRyqBvuGi^sJ+L6uP}yT+5g_Gzaw^#q0~f?Sj#!QkmQz^WpVPOpc;K}&E}Lm z8%a<#@jljE`fDI&#t~VmVqI~Ny!?P9(ROG^Ozy74llUNQJTOYR--|&GRfi8n-%mGQ zyixUWI4Z?)tO{BAm%u@;Z)W$@%WZe0uAw0Sy$LoI8yCIYCo3EVO3x}<5yQLUkSlh@ zfre;_%8@AWLrU9mN|Eo)CcTmzv3m|dja#anidJQ#5vOzNQVXKOr`M>h(1B8kcZ+L1 zW}04_xoWta2M7kripcG>g*@TdAJ&sW)Nbk1<*@u_Hfg1Or6zymGbONIxYn?)2l8r+ zrhNp2*YkF{(-+*jdLsyc;P6rKs+X4B!{(7K=T9165F7ggqQD}M;wA-(}2k(5pxacIh%bt~tU?o^XhP1*B8Wc1lxUtqP-L4Tp%&{1n=Zt*vLn1EF&k4NKXZTHU}qH;EKAW;|tRB9scQOt*g33s8f^MD8S;-alU9)wRdw!(=K3h_gN+_Lz%qSr>*lF zsbOJSoPIskRbG)4#8o0ozT0B^-pz0HO`tNpC%oAoSW?I7i4@hR>b=uh+%TsI0y_%C z*Uib8-Ga#dr zqxY7I!2F!qH`u$4qS1|zLg9>USxlb9W?z{qJKup%?yH7-N`ab4 zZP6O(Yq&AMDA%OLon`D9<`G)USR(}kt)~H28r-bukq#k+YPL(TZY0%;hO?MMQ9Q4L z<;l_NHt#%IZ8~I|Q7({02jZ0l`k0&JLHLU2fp^7xeaTGNxMTM;ymy$aC#P{SEvn_L z%(VF5GcPvij$XY>LFp>))`lACwe4)<{@z!kqNHvA>+aZS?4LYpe@KC7Yi#=%-Flig!+UG~k8K91 zsaw7``)%&@o(6!O*)nmn&@_m+#>TgV3Br(s7tR9v5)26OQhH;zdOTI;iVxF6*27Lg z^+f(c5?49vQo4dbnPB~7z7 zInBUH%~1klsj$RLtQW9=VS7ygJSUfzdHE+Dh{3!R%032akbI`EaT#{W@y=J{wd@F# zJz=ks(RrVt#PomVPdo#^%GerfPA_A4Nr=7|qLAHus~mMPTy>8_A9BeFIO18=&kG4;3mIzrTCMeQvb6t8J_c(=M>Yl?bN2mq;$*4id151r+46y$};EMp7YQl#A2s$TT zV5|UCblCMiz#r*a<_jDMjQ)@k_a-;Hf(dD@N~Mgx(K-91`6&Q)I`X2?3N`B`Dady7 z@=5B^m=sTj;7xtToQoxP6~4#Wvs!Tuho)p)CWKRy#WZU*p64&Wn3tlSB~8R1_C^V{ zYk557@T7HF!I8>j<+~pN>FM}tn7gPGYd%pqxj$&0Z69Qvy}wwpo@l5hKNDj}4z>#O z5id6H3-*ijBw{)9RiaVcc#C3)Dn>k9Ipx#dDpJfm_$IJ96{F+W7G4aS8Zt2_@t*b% zy$N15RtHLx!B@m=P5>Ya8Q>Pa&8+O1)~j)`5^M>OzB&E0%2=iSZpvLGK>LO25XM99 zWuh+GU#Tg$Z-HAgy`_9`S`O6>pF>@Q#8J=gVK353HE)t(aH}Wkc;2eq#U#CL#Nw-SRO&qIJWJ<9y+Dmx5Oo+)QHcqr}YtOAdp($sFyStQP zxN#M>*sNxOP=mr0YF^}@H34QY?1jL*02uKy5fPP`Omo7jYK$KO7^J|Udv^Lg^70DZ zLzBI}bw2Q~iBGU==8+@wewigir*@E3H5CqIg9PMU-Z8*r32krf=X1T!Z62OSw~MHF zrT97p!dsANI2y%|vzt61>mC1_46wGf8r3N|M8x4PN6skWH9 zxnM(?QYJXp()>UN8O^WVBsQ&X;qbiQE@$OQ(cbi4+qM0+ZFw>wGm_+kQOdZ(V4c|n zTy~!oBB99~R!$i2YuJh;R{|v*vZ*>M>S~ofKV)2*yW`S$xSU@v_U59-cH(?xWyH?li-+})i5(;I-8rbKn2YYsM=6DCqJ>1 z4uc4F=A?}b{8uD)ooruy`Xo3wG|#9=%6SyO<=3+wONDRLX*}DrSw*m-|u0eK%tIehA=z4O@d~NrXa4u$G8+ysiw}`kdY{<+qrDrLV5_ zTefoI5veg%7Z*Y6L>|-GS}2rI@Qm5*n*;p3Q*C1nd>NFKh(FI3_Z8(XwoVrRZcxlE z|6?%%ULx1E+TJIMxI1+%L_hbAnqTo*G?VBmwS$)oc%mQ|)Ft~P+f#OI{8x<@fq>nC zP@AS2Z7*^Y)Pzu*g(~@`3;6H3eH`I^tiLjYIzsNrNP#aQhL;|jLI%10B`8$Xiz@W^ zqQEG`79a#jKAOMdDp~yc8XsxX!UgtB5*%POLWQlRLm!m(>6sS9l1htFECh%fLCRw` zBd=+vFxm@s!|t)sRVPq_!8X%IU@&tLik_gpgVp*!^+wp0Pr*m1=u}%nBmdm`h-W`g z;%_+h$(Wl)NTe{(`Bm>JLeBsFxkVU0zyVaXg z${sMht$u8MG%I5Yh1Rc~wmH4yXMYkaQLZIZJx12KLqZOB0F{d2l~CMYUG_d+Gw0G? zLV?rQPz@s@#aNptk62Wea~<4Tdmv5o&&b+Ny(8+iOmk4AKN6{wvu;d2{1CkwRd_X% zZ0q^O-A8^ltz6Y?IIuX?Rr&W=gr36C1#X z$H8a>52oh#8w>IS$bU^V%v|~)ADH`Q6l1Oy*560h%>||mWBN{Q9NF1kcnwwa5|c82}my}ke^h! z2%eL$*xw_Ogm4Ig(vYXTbC+gDwABi7ipdXHFzDY>#@+jyP$2N z8JeBqiB)vRTfuqSd@_>Rhqwi%O^B5^18$Va{M$~06pOG}i$pL*+pVsrm%&+IQHCWX zZpf@{XQdt4OMu30+(kcO6I=h;Cjfs4*Y%$eIWoJ^i|klVfZrV~)l{ z8ibj5%Xa*Z{M`pTwo?wYCXlcj24XcaV&iZt*7oPdJ4~_>J}!FP$_{!5z1eq`I&Qnx zIaHpX+}@9zATX5C-0?yUwCL%0C4Q_zs22zS>FR8V4st{>ox^-NDrC>qWm+52H_?YN z{`o5-N~oS;lOb|R&bGFc3?jYB7wEozj4z(z@lvPyH0(R$>{=hg(n*MD0TyCoC&E3{ zi@enU%$Y#1;|GZYt6J$<5baRHb_{XzupnH$w=GzGiotd{H(T!urys}Gp#G*Tilur# zXuzh+v_qfOoz_1|fPTp<#cCon0KJZ{>1obJbu?s`u5+%_Pph`^-auD0`Rc&tkP0Hr zyc?uLn9c=a*LsgnDbw>GZ_#7TMCrz}CGBojF8=I> zb_T)M*w6XAKWBw-u~Z|Bop?L-x%-cFWCcBV)*2S!usK3WPOD2E*0uI}!zv1ZYs|c1GOJ= zbpTNQ6aU_%k?~t<0&vKe~Txds&4aUn*;7{Vf39TuC|3J!Xtb7I~;5 z1UmWLW?pcae_(H4cBl%zx2D^mJrpT=V)f+*$YylLZ)>X^4%<}O zy~1v+C%@8VjJcu3CeE|BwseBu+?tr#iDjn8|41-hxEPO3~cX5(hCjC}F+E$ICnK!I0WFMU>cgxY*_`Up=$1sqaRfLKFB%e_ADe~k0R!ra8`i}-doCeXy z>XsqEtY2d4Z|DXX;lJ42eR%6#0iab(ha?V+$1zS}e{4DrGGYb7(#c~?K_hIZ5Hs)V zW%zvSxytkBsijV#s+?&kq?itb=A6{}Gr`Lxk3#}wFhB37Xs+6kFVClH9 zedZ0r4}!B;lGFX1U5qK7{y5D$DyHXspu|w$UTUG)-Tt>=coGveS?nA} zDNJS@N)g{Ds^M(-LrNit02qtrkErz?8y@`LUh7sM<6~WkcG!|3r_dpVhV!vb?3~J7gfiLQu|Y*$o?@W*dvkI?v3uBAh)5tgt01C-@8ZPpv0IiZUUot zHzHgV+MP6Qd-R2m${Wsqa(wHYT}2q6hYX{1EgEKKeDC94cagBTeH>F{E+$g zSkAj;3gNLRqe`B{T$qT&2;LD+++pM(@5z`o|IBG6w?jWX!`V-Xpm!VbePzDy5X%3r-hO;9u;rp@ zzc5!`snuvCv}ai9U-~fb!{l)G=M~@%z3D6%-{4#uEfGog-1FXz?i*4_T_YM>yvL|=!Ra01YkQOgqjB;Oz$G_w3u* z)(;Qez9#Y$Kvy7f_&DN8KDaWv>2^lkf=&z2-s!P>3YDKTC;`Xn&VM*{?sYZCZZ31-QATE_WEHz6_p@TS25=;7 z4PgU~X)g8tu6;4+HPdydX54F`$`GgcVEE!vKUPEtD8RZAYU%Jw2m0TOnG~Z zad6+7?>>V7H+NeUIlzX44=ZkgqlKW55T+p%^sA8NZpnr{IM~9&bAH&!05`P_1IuAG zAt3K}7dlKpCn%SV=JU`OixhwA9*;DbiekD>@HoNv$Vr&K;OF6Kbxvfu@lmT*xn6N} zfb0R>=Rp-b7_ggHC5}b<@)L$`K)8bQ4`MOCBE)w5+jjw;VuH)Zi(tkn{xv-2KHb0D z%zQ_mH}A7%D8rr(hX#AM93)j840Ig~1@-L=;U5GS2Nw?;hX5NF|1(a0K~7FV4n7tR z4nYo%$(mp6|Kk8lYXegw*Z=>3k8|3Xc79C%zJi0PrJ=opuBFZYf6URQw;UWpk2%K1 zz(`Qq^r@k}xV58|fvJ_TxU~y{n~(nvmkTeaz#SeI4i*mFf-j%o^ThvrUfIyj-qhL( c@kHVw8|NKfB|i0L_%%XCQbD3n{Dtp-0T&4G8~^|S literal 0 HcmV?d00001 From beee9fa99cd0804553f489c8377aba8b86394424 Mon Sep 17 00:00:00 2001 From: Devin Matthews Date: Fri, 20 Sep 2024 10:56:46 -0500 Subject: [PATCH 2/4] WIP --- docs/PluginHowTo.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/PluginHowTo.md b/docs/PluginHowTo.md index cb27f23660..0c4f5e726c 100644 --- a/docs/PluginHowTo.md +++ b/docs/PluginHowTo.md @@ -17,12 +17,14 @@ * **[Modifications to Packing](PluginHowTo.md#modifications-to-packing)** * **[Modifications to Computation](PluginHowTo.md#modifications-to-computation)** * **[SYRKD](PluginHowTo.md#syrkd)** + # Introduction @@ -459,7 +461,7 @@ where `a` is the matrix to pack and `p` is an uninitialized matrix object which bli_gemm_cntl_set_pack[ab]_var( &pack_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](????) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params; @@ -533,7 +535,7 @@ You can set a custom computational variant with: bli_gemm_cntl_set_var( &comp_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](????) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params; @@ -578,7 +580,7 @@ xcomp_ukr( dim_t m, const cntx_t* cntx ); ``` -As with packing microkernels, a params pointer is also available to computational microkernels and can be set using `bli_gemm_cntl_set_params`. The params pointer is stored in the `auxinfo_t` struct, and can be obtained with `bli_auxinfo_params( &auxinfo )` (see also the [auxinfo API](???)). +As with packing microkernels, a params pointer is also available to computational microkernels and can be set using `bli_gemm_cntl_set_params`. The params pointer is stored in the `auxinfo_t` struct, and can be obtained with `bli_auxinfo_params( &auxinfo )` (see also the [auxinfo API](frame/base/bli_auxinfo.h)). ## SYRKD @@ -714,6 +716,7 @@ void dsyrkd } ``` + \ No newline at end of file From 16becb7161950eeda3dcc857fe0d55da1a4d5971 Mon Sep 17 00:00:00 2001 From: Devin Matthews Date: Fri, 20 Sep 2024 11:08:09 -0500 Subject: [PATCH 3/4] WIP --- README.md | 19 +++++++++++++++++++ docs/PluginHowTo.md | 6 +++--- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 05104b976a..676bd69513 100644 --- a/README.md +++ b/README.md @@ -106,6 +106,18 @@ all of which are available for free via the [edX platform](http://www.edx.org/). What's New ---------- + * **Plugin feature now available!** BLIS addons (see below) provided a way to +quickly extend BLIS's operation support or define new custom BLIS APIs for your application. +BLIS plugins extend this support to completely external code, needing only an installed BLIS +package (no source required). BLIS plugins also allow users to define their own kernels +and blocksizes, combined with the cross-architecture support provided by the BLIS framework. +Finally, user plugins can utilize the new API for modifying the BLIS "control tree" which +defines the mathematical operation to be computed, as well as information controlling packing, +partitioning, etc. Users can now modify the control tree to implement new linear algebra +operations not already included in BLIS. See the [documentation](docs/PluginHowTo.md) for +an overview of these features and a step-by-step guides for creating plugins and modifying +the control tree to implement an example operation "SYRKD". + * **BLIS selected for the 2023 James H. Wilkinson Prize for Numerical Software!** We are thrilled to announce that Field Van Zee and Devin Matthews were chosen to receive the [2023 James H. Wilkinson Prize for Numerical Software](https://www.siam.org/prizes-recognition/major-prizes-lectures/detail/james-h-wilkinson-prize-for-numerical-software). @@ -529,6 +541,13 @@ use the multithreading features of BLIS. overview of BLIS's mixed-datatype functionality and provides a brief example of how to take advantage of this new code. + * **[Extending BLIS functionality](docs/PluginHowTo.md).** This document provides an +overview of BLIS's mechanisms for extending functionality through user-defined code. +BLIS has a plugin infrastructure which allows users to define their own kernels, +blocksizes, and kernel preferences which are compiled and managed by the BLIS framework. +BLIS also provides an API for modifying the "control tree" which can be used to +implement novel linear algebra operations. + * **[Performance](docs/Performance.md).** This document reports empirically measured performance of a representative set of level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries diff --git a/docs/PluginHowTo.md b/docs/PluginHowTo.md index 0c4f5e726c..3755c5dd67 100644 --- a/docs/PluginHowTo.md +++ b/docs/PluginHowTo.md @@ -461,7 +461,7 @@ where `a` is the matrix to pack and `p` is an uninitialized matrix object which bli_gemm_cntl_set_pack[ab]_var( &pack_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](../frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params; @@ -535,7 +535,7 @@ You can set a custom computational variant with: bli_gemm_cntl_set_var( &comp_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](../frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params; @@ -580,7 +580,7 @@ xcomp_ukr( dim_t m, const cntx_t* cntx ); ``` -As with packing microkernels, a params pointer is also available to computational microkernels and can be set using `bli_gemm_cntl_set_params`. The params pointer is stored in the `auxinfo_t` struct, and can be obtained with `bli_auxinfo_params( &auxinfo )` (see also the [auxinfo API](frame/base/bli_auxinfo.h)). +As with packing microkernels, a params pointer is also available to computational microkernels and can be set using `bli_gemm_cntl_set_params`. The params pointer is stored in the `auxinfo_t` struct, and can be obtained with `bli_auxinfo_params( &auxinfo )` (see also the [auxinfo API](../frame/base/bli_auxinfo.h)). ## SYRKD From 5fe5575dd1f34ea2a6c5f1d8f0b54a0236ac11a1 Mon Sep 17 00:00:00 2001 From: Devin Matthews Date: Fri, 20 Sep 2024 11:10:34 -0500 Subject: [PATCH 4/4] Duh --- docs/PluginHowTo.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/PluginHowTo.md b/docs/PluginHowTo.md index 3755c5dd67..e9f6b93fd0 100644 --- a/docs/PluginHowTo.md +++ b/docs/PluginHowTo.md @@ -461,7 +461,7 @@ where `a` is the matrix to pack and `p` is an uninitialized matrix object which bli_gemm_cntl_set_pack[ab]_var( &pack_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](../frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object (and which is pointed to by the default value of the params pointer) can be used by your custom packing variant, then the [packing parameters API](../frame/3/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params; @@ -535,7 +535,7 @@ You can set a custom computational variant with: bli_gemm_cntl_set_var( &comp_variant, &cntl ); ``` -If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](../frame/3m/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: +If the default parameter structure which is included in the `gemm_cntl_t` object can be used by your custom computational variant, then the [computational parameters API](../frame/3/gemm/bli_gemm_cntl.h) can be used. However, if different information is required then you can create your own structure (which must be treated as read-only during the operation) and insert it into the control tree with: ```C my_params_t params;