HPL Benchmarking

Writeup on my first try on HPL Benchmarking

Posted Jun 1, 2026 Updated Jun 1, 2026

By Alan Ong

17 min read

HPL Benchmarking

1. Starting with AI

Being a complete beginner in this field, the only way I knew how to start was to follow AI’s instructions for the project. In this writeup, I will skip the process of setting up and connecting to NUS Compute Cluster, instead I will focus on what I did, learnt, and my takeaways from this small & fun project. Note that this writeup is not standard, nor does it follow competition structure. This is purely for my own understanding and ease of reading.

1.1. Hardware Reconnaissance

Goal: Mathematically derive the hardware constraints of the target CPU nodes (xcn) to calculate the Theoretical Peak Performance (Rpeak).

Steps Taken:

Requested an interactive node session via srun --partition=normal --pty bash -l
Executed lscpu to determine CPU architecture and core counts
Executed free -g to map physical memory limits

Hardware Profile (Single xcn Node): * Processors: 2x Intel Xeon E5-2620 v3 (Haswell) @ 2.40 GHz

Physical Cores: 12 (2 * 6) (Hyperthreading ignored to prevent cache contention)
Instruction Set: avx2 / fma (16 FLOPs per clock cycle)
RAM: 62 GB usable

Takeaway & Mathematical Target: The Rpeak for a single node is calculated as:

12 cores * 2.4 GHz * 16 FLOPs = 460.8 GFLOPS

1.2. Software Stack

Goal: Locate the MPI communicator and BLAS math engine required to orchestrate and execute the benchmark.

Steps Taken:

Attempted standard module avail commands, which failed. Sourced /etc/profile to initialize the environment
Discovered the cluster does not use isolated modules for base compute; software is installed globally
Executed which mpicc to locate the MPI compiler (/usr/bin/mpicc)
Executed ldconfig -p | grep -i -E "mkl|blas" to map installed math libraries

Obstacles Faced:

GPU vs. CPU Libraries: The system scan revealed highly optimized NVIDIA libraries (libcublas.so). However, as the constraints strictly forbid GPU usage, these had to be avoided.
Missing Intel MKL: The cluster lacked an architecture-specific Intel Math Kernel Library (MKL). The only viable CPU option was the generic, unoptimized Reference BLAS (/lib/x86_64-linux-gnu/libblas.so).

Takeaway: The lack of an optimized BLAS library established a severe performance bottleneck from the outset. Reference BLAS relies on generic arithmetic rather than leveraging the CPU’s advanced vector instructions (avx2).

1.3. Compilation

Goal: Compile the raw HPL source code (C/Fortran) directly on the target architecture to ensure hardware compatibility.

Step-by-Step:

Downloaded and extracted hpl-2.3.tar.gz
Duplicated a generic Linux configuration template (Make.Linux_PII_CBLAS) into Make.soc_cluster
Edited the configuration file:
- Cleared legacy MPdir paths, allowing the global mpicc compiler to handle MPI mapping automatically.
- Hardcoded the LAdir to point directly to the Reference BLAS location found in Partition 2.
- Swapped the standard gcc compiler for the mpicc wrapper.
Executed make arch=soc_cluster to build the xhpl binary.

Takeaway: Code must always be compiled natively on the execution hardware (Linux x86_64) rather than locally (e.g., on a local machine environment) to prevent immediate architectural incompatibility failures.

1.4. Tuning & Single Node Baseline

Goal: Prove the engine works on a single core before scaling, and establish a baseline GFLOPS score.

Step-by-Step:

Configured HPL.dat for a tiny, single-core test:
- N (Problem Size): 1000
- NB (Block Size): 192 (Standard Intel sweet-spot for cache chunking)
- P x Q (Grid): 1 x 1
Attempted to execute via ./xhpl directly, which failed due to a lack of Slurm PMI support in the cluster’s Open MPI build.
Executed the supervised launch: mpirun -np 1 ./xhpl

Obstacles Faced:

The Legacy Parser: Initial modifications to HPL.dat corrupted the file spacing, causing the vintage C parser to crash (Value of NB less than 1). This required replacing the file header with a strictly formatted template.
Slurm/MPI Conflicts: The execution environment requires a strict hierarchy. Open MPI crashed when attempting to spawn multiple tasks without explicit Slurm -ntasks allocation, and crashed again when launched without the mpirun supervisor.

Takeaway: The baseline 1-core run achieved ~2.98 GFLOPS (roughly 7.7% of the theoretical peak). This mathematically confirmed the theory from 1.2: utilizing unoptimized Reference BLAS throttles the supercomputer’s capability.

1.5. Multi-Node Scaling and Network Optimization

Goal: Maximize the cluster score utilizing the maximum allowed resources: 4 nodes (48 physical cores total) running under automated background batch execution.

Step-by-Step:

Configured HPL.dat for the scaled allocation:
- N (Problem Size): 20000 (Optimized to balance numerical intensity and unoptimized engine runtime limits)
- NB (Block Size): 192
- P x Q (Process Grid): 6 * 8 = 48 tasks total.
Authored a batch file (submit.sh) to query 4 nodes and 12 tasks per node under Slurm execution.
Successfully isolated and resolved multiple cascading network bugs during runtime debugging.
Executed the final benchmark via sbatch submit.sh and verified the output matrix residuals.

Advanced Network Obstacles Resolved:

Phantom Network Trap: Open MPI automatically selected a dynamic management interface (eno2 on the 192.168.56.0/21 subnet) instead of the cluster’s primary data plane. This triggered firewall drop rules, hanging the allocation indefinitely.
Interface Name Asymmetry: Attempting to force the physical device name (eno0) failed because interface naming strings varied across the allocated node boards (xcnc[1-4]). This was bypassed by configuring Open MPI to include resources strictly by IP subnet block: --mca btl_tcp_if_include 192.168.48.0/21.
Intra-Node Shared Memory Deadlock: Cores residing on the same motherboard triggered file-size exhaustion crashes (Read -1, expected 1228800, errno = 1) when writing to Linux shared memory space (vader/shm). This required completely disabling shared-memory structures and forcing loopback network layers via --mca btl tcp,self

1.6. Results

Metric	Performance	Efficiency
Theoretical Peak (Rpeak)	1,843.20 GFLOPS	-
This run (Rmax)	22.27 GFLOPS	1.21%

We can observe a 98.8% discrepancy from our first run compared to the theoretical peak. We also observe that the 1-core run achieved ~2.98 GFLOPS, which should have produced us ~143 GFLOPS with 48 cores. However, we only achieved 22.27 GFLOPS, a long way off the theoretical best with our current constraints.

Here are the three primary reasons for the performance gap:

A. The Execution Engine (Missing AVX2 Vectorization)

The Problem: The cluster lacks a pre-compiled Intel Math Kernel Library (MKL). Consequently, the HPL binary was linked against the OS-default Reference BLAS (libblas.so).
The Impact: Reference BLAS executes generic, scalar C-code (one mathematical operation per clock cycle). It completely ignores the Intel Haswell chip’s Advanced Vector Extensions (AVX2), which are physically capable of executing 16 floating-point operations simultaneously per cycle.
The Fix: Compiling an architecture-aware library like OpenBLAS from source using -march=haswell flags to unlock the silicon’s vectorization capabilities.

B. Data Starvation (Problem Size Constraints)

The Problem: Due to the slow Reference BLAS engine, the matrix size was restricted to N = 20,000 to prevent the job from exceeding Slurm’s 2-hour timeout limit.
The Impact: A matrix of this size is computationally trivial for 48 physical cores. Because the processors chew through their assigned data chunks instantly, the mathematical workload (compute time) is overshadowed by the time spent passing those chunks back and forth over the network (communication time). The CPUs spend the majority of the run idling.
The Fix: Once a vectorized BLAS library is installed, the problem size must be scaled up to N = 140,000+ to saturate roughly 80% of the cluster’s 248 GB of RAM, keeping the CPUs mathematically bound.

C. Communication Saturation (MPI Topology & TCP Overhead)

The Problem: The baseline run utilized a flat MPI topology, treating all 48 cores as entirely independent networked entities (a 6x8 process grid) communicating over a standard 192.168.48.0/21 Ethernet subnet.
The Impact: Ethernet has significantly higher latency and lower bandwidth than dedicated supercomputing interconnects (like InfiniBand). By forcing 48 independent workers to broadcast data via TCP, the network switch becomes heavily congested with packet collisions and overhead.
The Fix: Implementing Hybrid Parallelism. By launching only 4 MPI tasks (one per node) to handle network traffic, and using OpenMP to spawn 12 intra-node threads that share local RAM, inter-node network traffic is reduced by orders of magnitude.

2. Learning

Following the initial exposure from AI, I continued my journey with some more in-depth learning of how high performance computing actually works, and the optimisations that I can be making. This section serves as my takeaway from the videos and readings that I have done, and possible optimisations that I have learnt.

Disclaimer: This section has nothing to do with the HPL project, move on to section 3 to continue.

2.1. MIT 6.172 (Performance Engineering of Software Systems)

Given a basic python file on matrix multiplication, the run time is abysmal, getting a result of only 0.00075% of peak. Several optimisations were made in lecture 1 to improve the run time.

The optimisations include:

C: Python is an interpreted language whereas C is a compiled language.
Interchange loops: When you interchange your loops, you will get a variation with more cache hits, which saves time.
Optimisation flags: You can tell clang how much to optimise (-o0, o1, o2, o3).
Parallel loops: Use more cores and run loops in parallel with each other.
Tiling: Manage cache for data reuse, tiled matrix multiplication with parallelism.
Parallel Divide and Conquer: Tiling for two-level cache, recursive matrix multiplication with parallelism, however when unoptimised, this takes a lot of time due to the overhead of recursion. We can coarsen the recursion to get a much lower running time.
Compiler vectorisation: Using vector hardware, SIMD with compiler, we can tell the compiler to use the vector instructions specific to the hardware.
AVX intrinsics: Provides direct access to hardware vector operations.

After all of these optimisations, we will get a result of 41% of peak. Much better, but why aren’t we getting all of peak?

Reaching 100% of a supercomputer’s theoretical peak performance is physically impossible because that metric assumes a frictionless environment where CPUs perform continuous math without ever waiting. In reality, performance is strictly governed by the laws of physics and hardware architecture: data takes crucial nanoseconds to travel from main memory to the processor (the memory wall), coordinating calculations across multiple physical machines introduces unavoidable network latency, and running all cores at maximum capacity generates immense heat that forces the silicon to temporarily throttle its clock speeds to avoid melting.

2.2. SIMD (Single Instruction, Multiple Data) and Vectorization

Generally for normal operations, ie. Single Instruction Single Data. We have 2 operands and an operator. However for SIMD, we usually have 2 vectors and an operator, and we want to execute the instruction at the same time. This is achieved through a SIMD register, which is taking a 64-bit integer and in some sense expanding it to hold more data. This can be represented as two 64-bit integers or four 32-bit integers etc. This is accomplished by using intrinsics in C/C++ which tells the compiler to place in the SIMD instructions in the assembly code.

2.3. Message Passing

In distributed supercomputing, message passing is the essential communication mechanism that allows independent physical machines to act as a single computational brain, but it is also the primary bottleneck for scaling. Because processors on different motherboards cannot directly access each other’s memory (RAM), they must explicitly package data into digital envelopes, attach routing instructions, and push them across physical network cables using software like MPI. This architecture introduces a massive penalty known as communication overhead. Every message sent incurs strict network latency and requires perfect synchronization; if one node sends a message before the other is ready to receive it, the entire system can deadlock.

3. Optimisations

Given the massive communication overhead and network deadlocks observed in the initial 4-node cluster attempt, it became clear that scaling out was actively harming performance. Therefore, the strategy was pivoted to scaling up. The following optimizations document the process of abandoning the distributed network entirely and focusing all efforts on squeezing maximum performance out of a single, highly dense compute node.

3.1. OpenBLAS

The default BLAS library relies on unvectorized, single-instruction single-data (SISD) operations, which leaves the processor’s 256-bit wide YMM registers completely unutilized. Compiling OpenBLAS with TARGET=HASWELL forces the compiler to generate Advanced Vector Extensions (AVX2) assembly instructions. This allows each processor core to execute fused multiply-add (FMA) instructions, processing four 64-bit double-precision floating-point numbers simultaneously per clock cycle, effectively moving the system from a heavily memory-bound bottleneck to the compute-bound peak ceiling.

Baseline Performance: 22.27 GFLOPS
Optimised Performance: 31.70 GFLOPS
Performance Gain: ~42% speedup

3.2. Scaling it up

Scaled the global matrix size (N) to 60,000 and consolidated the process grid to a highly dense, single-node architecture (P=1, Q=1), dynamically routing all workload parallelization through 24 localized OpenMP threads.

Reason: At smaller problem sizes (e.g., N = 20,000), the benchmark is heavily bottlenecked by operational overhead rather than raw mathematical computation. By scaling the matrix up to 60,000, we intentionally saturated approximately 28 GB of the node’s local memory. This shifted the workload to be entirely compute-bound, ensuring that the 24 CPU cores were continuously fed with AVX2 floating-point operations. Furthermore, by deliberately scaling up the memory footprint on a single motherboard rather than scaling out across the cluster, we completely bypassed all inter-node network latency, TCP overhead, and firewall packet inspection, allowing the local hardware architecture to operate at absolute peak efficiency.
Pre-Scaling Performance: 31.70 GFLOPS
Final Peak Performance: 34.71 GFLOPS
Performance Gain: 9.5% speedup

3.3. Micro-Tuning (Thread Pinning & Cache-Line Alignment)

Implemented strict CPU thread affinity binding (OMP_PROC_BIND=true, OMP_PLACES=cores) and executed an automated block-size sweep (NB ∈ {128,192,256}) at a controlled matrix scale (N = 30,000).

Reason: * Thread Pinning: By default, the Linux kernel dynamically schedules OpenMP threads across different physical processing cores to distribute thermal load. However, this causes severe cache-miss penalties as data loaded into a core’s local L1/L2 cache becomes remote when the thread is shifted. Forcing strict thread-to-core affinity prevents context switching and keeps data localized.
- Cache Matching: The block size (NB) governs the granularity of the matrix factorization blocks. If NB is too large, the data block overflows the high-speed L2/L3 cache and spills into the high-latency main system RAM. If NB is too small, the execution pipeline stalls due to excessive loop overhead and data request requests. Running a systematic sweep identifies the exact hardware threshold where memory blocks align perfectly with the processor’s cache lines.

Block Size (NB)	Execution Time (s)	Performance (GFLOPS)
128	514.10	35.01
192	Timeout / Cache Thrashing	N/A
256	Timeout / Cache Thrashing	N/A

3.4. The Non-Linearity of Scaled Optimization

From the last benchmark, the optimal cache-aligned block size (NB = 128) discovered during the N = 30,000 sweep was applied to the maximum-density global matrix (N = 60,000), utilizing strict thread pinning.

Peak Micro-Tuned Performance (N = 30,000): 35.01 GFLOPS
Scaled Final Performance (N=60,000): 28.99 GFLOPS

Analysis & Hardware Bottleneck: Contrary to linear scaling projections, applying the micro-tuned NB=128 block size to the maximally scaled matrix resulted in a ~17% performance degradation. This highlights a critical hardware limitation: Translation Lookaside Buffer (TLB) thrashing and instruction overhead. While NB=128 perfectly optimized the L2 cache hit rate for a smaller global matrix, quadrupling the total matrix area at N = 60,000 generated an excessive number of localized blocks. The CPU overhead required to manage, fetch, and synchronize this massive volume of small computational chunks ultimately outweighed the nanosecond gains of the L2 cache alignment.

Conclusion: This final run definitively proves that in HPC hardware tuning, cache granularity (NB) must scale dynamically with the global problem size (N). For matrices exceeding 20 GB of system memory on this specific xcne architecture, accepting minor L2 cache misses with a larger block size (e.g., NB = 192) yields higher continuous throughput than strictly optimizing for L2 cache residency.

3.5. Motherboard Routing & Topology Micro-Tuning

Executed a high-speed parameter sweep of High-Performance Linpack’s internal broadcast topologies (BCAST 0 through 5) at a localized matrix scale (N = 20,000, NB = 128).

Reason: In a dense, single-node compute environment (24 physical cores operating on a shared memory bus), network latency is eliminated, but internal motherboard traffic becomes a micro-bottleneck. The default routing algorithm (1-Ring) forces data to pass sequentially from core to core, creating internal traffic congestion. Testing alternative algorithmic routing paths determines how the processor physically prefers to distribute data across its silicon.

Routing Algorithm	BCAST Value	Execution Time (s)	Performance (GFLOPS)
1-Ring (Default)	0	176.36	30.24
Modified 1-Ring	1	176.66	30.19
2-Ring	2	175.45	30.40
Modified 2-Ring	3	175.25	30.44
Blended	4	174.84	30.51
Modified Blended	5	175.24	30.44

3.6. MKL

Replacing the generic, open-source OpenBLAS linear algebra library with the proprietary Intel Math Kernel Library (MKL).

Theoretical Advantage: The OpenBLAS library relies on standard GNU C compilers to translate matrix mathematics into generalized machine code designed to run adequately across all CPU architectures (AMD, Intel, ARM). However, because the SoC Compute Cluster xcne nodes utilize Intel Xeon processors, compiling the High-Performance Linpack (HPL) binary directly against Intel MKL would replace this generalized math with proprietary, hand-tuned assembly code. This hardware-specific software is explicitly engineered to fully saturate the Xeon’s AVX2/AVX-512 vector registers with zero wasted clock cycles. Furthermore, MKL employs aggressive hardware-level memory prefetching, which would theoretically eliminate the Translation Lookaside Buffer (TLB) thrashing bottleneck observed during the massive N = 60,000 memory saturation run.
Execution Blockers: Implementation of this optimization was halted by strict system administration constraints. Reconnaissance of the xlogin and compute nodes confirmed that standard HPC module managers for proprietary compilers (e.g., module avail intel) are unconfigured. Further bare-metal directory probing confirmed the physical absence of the Intel compiler suite within standard enterprise installation paths (e.g., /opt/intel). Without root or sudo privileges to install the requisite libraries, the execution environment is strictly confined to the GNU compiler and the OpenBLAS stack.
Impact Analysis: If administrative access were granted to install and dynamically link MKL (libmkl_core.so), theoretical projections indicate an immediate 10x to 15x performance speedup. A single node currently operating at peak OpenBLAS efficiency (~35 GFLOPS) would mathematically scale to between 400 and 600 GFLOPS, approaching 80% of the silicon’s absolute physical limit. Consequently, the benchmarks achieved and documented in this report represent the absolute maximum theoretical efficiency possible within the restricted, open-source software environment provided.

3.7. Final Results

A final, unified benchmark was executed to combine all discovered architectural optimizations while maximizing global memory saturation. The matrix was scaled to the physical limit (N = 60,000). To counteract the Translation Lookaside Buffer (TLB) thrashing observed in Phase 2, the L2 cache block size was dynamically scaled up to NB=192. Furthermore, the CPU threads were rigidly pinned to the physical silicon (OMP_PROC_BIND=true), and the optimal Blended broadcast topology (BCAST = 4) was implemented to streamline the internal motherboard memory bus.

Results:

Sustained Performance (N = 60,000): 34.76 GFLOPS
Execution Time: 4,142 Seconds (1.15 Hours)

Analysis & Conclusion: This unified run successfully resolved the scaling degradation bottleneck. By matching the cache chunk granularity to the massive data volume and optimizing the internal electrical routing, the system sustained near-peak theoretical efficiency (34.76 GFLOPS) for over an hour. The minor variance between this sustained load and the micro-tuned peak (35.01 GFLOPS) is directly attributable to the hardware’s expected thermal clock-scaling during prolonged, 100% load AVX execution. This configuration represents the absolute pinnacle of optimization for this localized architecture under standard open-source compiler constraints.

4. Conclusions

Overall this was a pretty fun experience, from learning about the history of optimising performance on a single core to figuring out a way to get multiple cores to work together more efficiently, I found myself asking why is this so annoying every 5 minutes. I still do not fully understand all of the theory behind this project, but I will be looking forward to understanding more about it as I explore other areas of this field.

HPC

This post is licensed under CC BY 4.0 by the author.

1. Starting with AI

1.1. Hardware Reconnaissance

1.2. Software Stack

1.3. Compilation

1.4. Tuning & Single Node Baseline

1.5. Multi-Node Scaling and Network Optimization

1.6. Results

2. Learning

2.1. MIT 6.172 (Performance Engineering of Software Systems)

2.2. SIMD (Single Instruction, Multiple Data) and Vectorization

2.3. Message Passing

3. Optimisations

3.1. OpenBLAS

3.2. Scaling it up

3.3. Micro-Tuning (Thread Pinning & Cache-Line Alignment)

3.4. The Non-Linearity of Scaled Optimization

3.5. Motherboard Routing & Topology Micro-Tuning

3.6. MKL

3.7. Final Results

4. Conclusions

Trending Tags