Overview
My algorithm processes three decimal numbers at the same time (10 digits * 3 + 2 line breaks in between = 32 bytes, which fits in ymm registers). Let’s call this a unit operation.
unit operation (3 numbers): .---.
We process a _block_ of data backwards, we start at the end, and we keep applying the unit operation, moving the pointer 3 numbers back every time, until we reach the beginning of the block.
block: |<.---.---.---.---.---.---.---.---.---.---.---.---.|
We process 8 blocks at the same time, all the blocks have the same size, and are positioned in memory one after another, without gaps. This collection of 8 blocks we call a superblock. Each block software-prefetches 12 cachelines ahead (ie. back) to L1.
superblock (8 blocks): |<---.---.---|<---.---.---|<---.---.---|<---.---.---|<---.---.---|<---.---.---|<---.---.---|<---.---.---|
We split the entire input into a collection of 4 superblocks of about 130 MB, and the fifth superblock of a few MBs. Superblocks are processed sequentially.
superblock 1 (8 blocks): |<--|<--|<--|<--|<--|<--|<--|<--|
superblock 2 (8 blocks): |<--|<--|<--|<--|<--|<--|<--|<--|
superblock 3 (8 blocks): |<--|<--|<--|<--|<--|<--|<--|<--|
superblock 4 (8 blocks): |<--|<--|<--|<--|<--|<--|<--|<--|
superblock 5 (8 blocks, remaining few MBs): |.....|
A few details
I don’t have access to a Haswell machine, and even if I had, I wouldn’t really know how to interpret the perf counters related to memory access. So the information in this section is mostly speculation based on timing measurements.
Why is processing multiple blocks simultaneously faster than one at a time?
I think the main reason is to exploit the inherent CPU parallelism. A single unit operation in this problem (processing 3 numbers) is mostly sequential, and keeps the CPU ports idle most of the time. Running multiple unit operations at the same time, allows us to better utilize CPU ports.
Why is processing 8 blocks at the same time faster than 7 or fewer?
I think the answer here is that the more the better, and 8 doesn’t hit any of the bottlenecks that could slow things down. For example:
- Haswell’s L2 streamer prefetcher should easily handle 8 parallel streams (or 16 if for some reason moving across 4K pages consumes two streams).
- We need 8 64-bit registers to store pointers, 16 are available, so we don’t spill to memory.
As a side-note: why do we care about hardware prefetchers when we use software prefetches? Software prefetches are useful because, as opposed to hardware ones, they can cross 4KB page boundaries. On the other hand, software prefetches consume instructions, hardware prefetchers don’t. But the main reason to care about hardware prefetchers is that L2 prefetchers are not bound by the line fill buffer limit (10), and only by a higher superqueue limit (16). And it’s those limit that mostly determine the RAM-to-CPU transfer speed on a single core, not the bandwidth of the RAM chips.
Why is processing 8 blocks at a time faster than 9 or more?
As we increase the numbers of blocks above 8, the bottlenecks listed in the previous section start playing a role. But I think the main reason for the sharp drop above 8 has to do with RAM banks/ranks assignment (see the next question).
What is a good size for a block?
I’m not sure what the optimum value is but there are two main concerns that I think are relevant here.
The first factor is that crossing a 4KB page boundary slows things down. For example, hardware prefetchers need to retrain, and accessing an unaligned address that spans a 4KB page boundary can take tens of cycles. If all blocks encounter such speedbumps around the same time, this can result in unnecessary CPU pipeline stalls. So keeping the block size away from a multiple of 4KB (or even of 2KB) mitigates this problem.
The second factor is RAM organization into ranks and banks. Accessing two addresses more than 16KB apart that map to the same bank and rank is slow because of bank thrashing. I’m assuming the highload.fun server RAM is organized into 8 banks and 2 ranks. This means we have only 16 different combination, and since each of the 8 blocks in principle can access two such combinations (because blocks prefetch data, and also are not perfectly in sync with one another), this requires a careful arrangement since there’s no room for error here. This also suggests that going above 8 simulaneous blocks would inevitably result in bank thrashing.
Paper [1] below shows that a given physical address is translated into bank/rank ids as (a[14] xor a[18], a[15] xor a[19], a[16] xor a[20], a[17] xor a[21]), where a[14] is the 14-th bit of the address. This suggests that we should align blocks so that bits a[18..14] are the same, but bits a[21..19] are always different.
The block size I ended up using is 2^24 - 2^19 + 0x4c0 = 16,254,144 bytes, which strikes a reasonable compromize between these two factors (4KB pages and bank thrashing). Measurements confirm that most other choices of the block size tend to be slower.
Why don’t we just treat the whole input as one superblock?
The theoretical answer is that the input size varies from run to run, and if we just split the input into 8 blocks every time, we could sometimes be unlucky. By using fixed-size superblocks, we remove this uncertainty. How much this matters depends on the particular scheme of dividing the input into blocks, but when I was originally working on the problem, I noticed that splitting the input into four superblocks of size input_size/4 each led to a significant speed up over treating the entire input as a single superblock. I kept investigating why, and this is how I got to the superblock scheme described above.
Why do we process the data backwards?
Processing data backwards ensures we’re always aligned to the last digit rather than to the first digit, which significantly benefits the computational part of the algorithm (the unit operation). There’s no benefit to processing data backwards from the memory access perspective, if anything, it may be slightly slower than accessing data in the forward direction.
Why do we access unaligned data?
While alignment should not matter for memory access when it occurs inside a single cache line, it is slower when that access spans multiple cachelines, which is about 50% of the time. Accessing unaligned data that spans a 4KB page boundary is particularly slow. But all of these memory access costs are outweighed by the computational benefits coming from being always aligned to the last digit.
How do we keep the blocks in sync as we process data?
Since different numbers have different numbers of digits, the blocks will not be perfectly in sync after some processing has occurred. For simplicity, let’s assume that each number has 9 or 10 digits, with 50% probability each. This means the standard deviation of 0.5 bytes per number, which means √50,000,000 * 0.5 ≈ 3,500 bytes for the entire input. So if the whole input was a single block, we would expect the difference between block to be in the order of a single 4KB page. For a 16 MB block, that number is about 600 bytes. Of course, this calculation is not perfect, but the order of magnitude should be correct. So inter-block de-sync should not be a huge problem.
Still, I observed a small benefit for periodic re-syncing between blocks. Every 50,000 number or so, the algorithm would let all blocks “catch up” with the fastest one. The impact from this way maybe 300 μs (30 points) or so.
FYI This is a republishing an article that I originally sent to the telegram channel pre-upgrade.