conclusion.tex

% commentary and interpretation of the main outcomes or findings, relative significance of findings, implications of results, limitations and future work.

% mpbenchmark: novel solution, C++ and SIMD suitable for both desktop and embedded processors to assess multi-core performance. State of the art acheivement, limitation would be how short the application is for more high-end CPUs. Future work could involve having a larger portion of application benchmark. SMT failed to deliver performance on x86 processor.

The first objective was met with strong results, surpassing the initial aim. A novel benchmark was developed using modern \texttt{C++}, which exceeded the capabilities of the previous \texttt{mpbenchmark}\cite{mpbenchmark_paper} in several aspects especially those implementations in \texttt{C} and \texttt{Ada}. The novel software design was modular and object-oriented, enhancing scalability. An alternate solution was also developed to assess CPU performance using SIMD intrinsics, with the application automatically selecting \texttt{AVX2} or \texttt{NEON} based on the CPU type. Results from desktop (\texttt{x86}) processors indicated negligible performance gains when utilizing SMT (or Intel's ``Hyper-threading"). The latest Raspberry Pi 5, equipped with the \texttt{Cortex A-76} processor, significantly outperformed the older Raspberry Pi 4's \texttt{Cortex A-72} by 75\%. On \texttt{ARM}-based CPUs, using single-point decimal precision with \texttt{NEON} instructions resulted in a 42-51\% performance gain but limited precision to four decimal places, posing a trade-off between performance and decimal precision.

Future research and development would focus on enhancing the benchmark to include more complex and demanding calculations to better assess the performance of high-end CPUs. The \texttt{x86} processor used in this project completed the benchmark application in less than 10 milliseconds, which might lead to a slightly misleading assessment of CPU capabilities if the application is too simplistic, despite utilising multiple threads. While this benchmark is effective for testing on embedded processors, it may be insufficient for assessing the performance of modern high-end CPUs, which now often feature more than 10 physical cores.

% MobileNet: novel solution and a dramatic performance boost. Desktop SMT failed to deliver again. On embedded devices need to tailor the solution to the target CPU. Limitation would be to improve the MobileNet to include the usage of modern C++ features. Future work would involve investigating SIMD optimisation on embedded/ARM CPUs for optimal performance. 
The second objective achieved novel results. The popular image classification algorithm \texttt{MobileNet}\cite{mobilenet_paper}\cite{mobilenet_repo} was parallelised using the \texttt{OpenMP} library. This optimised application provided a substantial 87\% reduction in runtime on the test \texttt{x86} processor when all system threads were utilised, though SMT did not yield any performance gains. On Raspberry Pi devices, performance improvements were more complex, and optimal performance was achieved when the application's multi-threaded architecture was specifically tailored to the target CPU. The optimised \texttt{MobileNet} application saw performance gains of 59.9\% and 65.2\% on Raspberry Pi 4 and 5, respectively, when all system threads were employed. However, the use of SIMD optimisations on the Raspberry Pi devices did not result in a significant performance improvement. In terms of application runtime, the latest Raspberry Pi 5 outperformed the Raspberry Pi 4 by 58\%. Furthermore, the optimal configuration on the Raspberry Pi 5 involved utilising three parallel regions of the application instead of only two on the Raspberry Pi 4, indicating that the \texttt{Cortex A-76} processor is better suited to exploit the \texttt{OpenMP} library's parallel regions.

Future work would focus on improvements to the \texttt{MobileNet} application’s software design. Enhancements could aim to increase modularity and leverage modern \texttt{C++} features. Another area for investigation is to determine why the \texttt{OpenMP} library's SIMD clause did not yield performance gains. This could be explored by manually employing \texttt{NEON} instructions and benchmarking the application to analyse the effects.

% DeBate-FI: unprecendented performance improvement in both local setup and the main setup. Limitation would redesigning the code of the application to make it more modular. Future work would be to collect more results from the main setup and use the C++ library instead of the Python obsolete telnet library. 
The third objective, the most challenging of all, was also met with strong results, achieving and even surpassing the initial aims. The \texttt{DeBaTE-FI} platform's application software, specifically its multiprocessing and multithreading capabilities, was optimised to reduce runtime and enhance scalability. The optimised solution successfully reduced runtime by 62.1\% in the local setup. In the main setup, the optimised solution achieved a 43.5\% reduction in runtime, confirming that the performance improvements are consistent beyond the local environment. Both results demonstrate significant and unprecedented improvements in performance. Additionally, an alternative solution was developed, which involved integrating an open-source \texttt{C++} library into the \texttt{Python} application for \texttt{telnet} communication. This solution also showed strong performance gains in the local setup, though not as substantial as the optimised \texttt{Python} solution, and it was not tested in the main setup due to project time constraints. However, the developed \texttt{C++} library can replace the existing, obsolete \texttt{Python} \texttt{telnetlib} library currently used by the \texttt{DeBaTE-FI} platform. Thus, this project not only enhanced the performance of the \texttt{DeBaTE-FI} platform but also offers a high-performance library that can be integrated into the \texttt{DeBaTE-FI} platform to replace its existing \texttt{telnet} library.

Future work would involve collecting more detailed results using the main setup to further analyse the optimized solution's performance. This task is time-consuming due to the long runtime of this application. Another area worth researching is the optimisation of the application's design, which currently remains convoluted and challenging to enhance, fix bugs, and potentially discover further optimisations for improved performance. The project also proposes using the developed \texttt{C++} library for \texttt{telnet} communication to replace the application's soon-to-be deprecated \texttt{telnetlib} \texttt{Python} library.