Hello reader! I am Micky and I am going to describe in detail about what I have explored and done towards the parallelization of MOOSE Solvers, primarily "KSOLVE". This document will contain things that I learned about MOOSE, scope of parallelization in MOOSE, the challenges to achieve it, what I explored(in full detail, containing the codes also), and the parallelized solver. ABOUT MOOSE Here I am gonna tell you about the code that was written as a test script to check whether we got speed up or not. The code snippets are taken from the following:
https://github.com/mickyaero/moose-core/blob/issue4/tests/gpu/test_gpu_ksolve.py The neuron is divided into man voxels (compartments) and the reactions/computations take place in the voxels. There are many voxels and a lot of reactant species in each voxel. Diffusion of species can occur in between the voxels also. So, for creating voxels, we need to make a bigger compartment which will be subdivided into smaller compartments(voxels). The length of the bigger compartment is fixed and then the number of voxels required is fixed. If the voxel type is selected to be a cylinder then the radius at both of its ends is decided. The following snippet describes the above:-
Now the compartments for the computation have been made and now it needs to be fixed that which solver will be used in the compartments. The following snippet explains that:-
Now, the reactions that will take place in the voxels will be specified. The following snippets explain the process of defining reactions in the voxels:-
Now the substrate and the reactants will be fixed:-
Now, we have defined that reactions that will take place in the voxels. We will now build and run the system:-
Apart from this tutorials are also there on the MOOSE website, which can help you understand MOOSE better:- Tutorials We are now done with the script writing in the MOOSE! Important Files For Parallelisation For parallelising a solver, it is should be known how the solver works and from where it is called. There is a file Clock.cpp in scheduling folder which controls the time for simulation and calls the required function. In Clock.cpp handlestep is the function that calls the functions. The number of ticks is the number of times the function is called. For implementing Cuda the function has to be called from the Clock.cpp file. For OpenMP also this is one of the files where things have to changed as explained in the subsequent sections. For parllelising KSOLVE, Ksolve.cpp , VoxelPoolsBase.cpp and VoxelPools.cpp should be understood as these are involved for extracting the data of the voxels. Data extraction is required if the calculations have to be ported completely on the GPU as is explained in the Cuda section. Becoming Familiar With MOOSE MOOSE has a huge code base, understanding it takes some time. The following links may come handy while exploring MOOSE:- https://moose.ncbs.res.in/?q=documentation http://vivekvidyasagaran.weebly.com/moose/some-computational-neuroscience-theory You will get comfortable with it when you will write code in it and then break the code and then debug it :).
This section will follow the following guideline:- Logic of the code implemented Where is it implemented? How the user can run the code? Numerical tests implemented What are the outcomes of the tests What else should be done?
OpenMP was implemented in various ways. The first one was parallelising the process function's for loop. This process function is called multiple times and each time it is called threads are created and then at the end of the loop the threads are destroyed. The thread creation and destruction overhead was huge as instead of getting speed up, a slow down was observed. To replicate the results, the user will have to use OpenMP in the following file's process function's for loop:-
Ksolve::Ksolve.cpp
for ( size_t i = 0; i < nvPools; i++ )
{
voxelPools_[i].advance( p );
}
The following were the results obtained:-
We can speed up if the threads are static, they should be created in the Clock.cpp file and they should parallelise the process function for loop and then get destroyed just once after all the timesteps have finished. So to overcome this a new strategy was used in which the threads are created and destroyed just once. In here the outer for loop in the Clock.cpp handlestart function was parallelised and in the ksolve.cpp file each thread was made to calculate on a fixed number of voxels (~(1/4)th of the total number of voxels). This way four (=no of threads launched) copies of the for loop in handlestart function was made but the process function calculations were divided. The result was we got a speed up but the answers were wrong. The following changes can be made to replicate the results:- Clock.cpp
omp_set_dynamic(0);
omp_set_num_threads(4);
#pragma omp parallel
{
for ( isRunning_ = (activeTicks_.size() > 0 );
isRunning_ && currentStep_ < nSteps_; currentStep_ += stride_ )
size_t thread_no = omp_get_thread_num();
for(size_t k=0; k<4; ++k)
{
if(k == thread_no){
for ( size_t i = thread_no; i < nvPools; i = i+4 )
{
voxelPools_[i].advance( p );
}
break;
..
}
It gave speedup but the answers were different, it’s because the threads were using some variable which was supposed to be private. This variable was currentStep_, a copy of that variable got created for each and every thread but that variable needed to be changed by just one of the threads, all the threads changed that variable and we got wrong answers. The solution to the above is that the currentStep_ should be changed just by one of the threads. In order to achieve that the increment in the variable would be done by thread number 0 only. It was achieved by making the following changes in the Clock.cpp
if (omp_get_thread_num() == 0 )
{
currentStep_ = currentStep_ + stride_;
currentTime_ = info_.currTime = dt_ * endStep;
}
#pragma omp barrier
The above was done but the above broke the code and gave segmentation faults. We spent a lot of time to get rid of the error but could not, so we left OpenMP at this stage and proceeded towards Cuda. This was all that was tried in the OpenMP, now it can be made to work if the segmentation fault is removed or a different strategy to parallelise (using OpenMP) it is implemented. Using Cuda With KSOLVE This section will follow the following guideline:- Logic of the code implemented Where is it implemented? How the user can run the code? Numerical tests implemented What are the outcomes of the tests What else should be done? We can use CUDA in KSOLVE either using the GPU Accelerated library Thrust with the boost library or writing your own code on GPU. In KSOLVE using the library does not seem helpful as I think it will slow down the code rather than speeding it up. The reason for that is in KSOLVE AX = B equation is solved but the size of the matrices A and B is not very large and what that library has functions are for parallelising the AX = B process. The library will be helpful if the size of the matrices is very large, a lot of computation should be done, but it,s not the case here in KSOLVE. But the number of times the matrix is solved is huge in MOOSE, so if we use the library functions then each time the data would be ported to the gpu and then solve the equation and then transfer back the data. Here we will get slow down rather than speed up, so the idea for using the libraries was dropped (although the functions written in the libraries are extremely efficient). So, writing our own code on GPU was tried. As mentioned before each computation is small but the number such of computations is very large, so we thought of doing each computation using a different thread on gpu by transferring the whole problem to the gpu and then solving it entirely over the gpu and then put back the final solution. This strategy is bound to give massive speed up but in this case we need to write our own solvers for the device/gpu. So the strategy can be summarised as follows: I have 'n' computations (solving a matrix equations) and i will launch 'n' threads and each thread will do each computation and then put back the results. 'rk4' was written for gpu and then the solver was integrated with the KSOLVE. In order to test it we used dummy data for matrices rather than the actual matrix data. The dummy data was transferred to GPU and rk4 solver was used on this data. After the doing the computations for the required no of time, the data is transferred back to the cpu. So, how it works is at every time step the gpu function is called but the data is not transferred to gpu at every time step as the data has already been transferred to the gpu, this saves a lot of time. So it is expected to give speed up. Initially if the system is run for small time(less no of time steps), gpu is very slow. But as the time(no of time steps / moose time) for which the simulation is run, the cpu time exceeds the gpu time and hence we get speed up. Below is the graph:-
The red line is the CPU Time and the blue line is the GPU Time. What happens during the start is not visible in this graph, s below is the data :-
We get speed up after increasing the time. The above results show that Cuda can give speedup if used with moose. We got speed up with dummy data and then we thought of extracting the actual data of the matrices and then run the code. This is where things got stuck, code for extracting the data of the matrices was written and was working fine. When the data extraction code and the solver code were integrated we got segmentation fault. Segmentation fault is not resolved completely but the cause for the seg fault has been identified. The discussion regarding the seg fault can be found here . The gpu code can be found here. The priority would now be to test the solver with the actual data. Then improvising the solver to make it more efficient. Thrust library can be used with the KSOLVE to see how actually it works.