↑ Blog ↑ A Failed Experiment with Siso ↦

NULL BITMAP on SIMD

This was a guest post contributed to the Two-Year ZINE Special Edition of NULL BITMAP.

I’ve been learning about how to write SIMD code recently, and I’ve come to appreciate that there’s two distinct skills involved: knowing how to transform a given loop into one that has sufficient parallelism to apply SIMD, and knowing how to fit a parallel loop’s computation into the restricted set of available SIMD instructions. We’ll be looking at only the first part of that today: learning a pattern of how to transform one loop type into its parallel version.

Simple examples of SIMD optimizations tend to show off the speedups on problems that are already embarrassingly parallel. map is an easy function to parallelize: there’s no dependencies between loop iterations, so each element can be handled in parallel trivially. reduce with a commutative operation is similarly simple: repeatedly apply the operation over pairs of elements in parallel until there’s only one value left. Many problems fit the shape of a map() and a reduce(). Suppose we want to compute the numerical sum of an array of ASCII digits: this involves a map() to convert each to its integer representation, and then a reduce() to sum them together. And we can neatly fuse those in one loop:

Scalar	SIMD
int sum_ascii_digits( char* digits, size_t length) { int sum = 0; for (int i = 0; i < length; i++) { sum += digits[i] - '0'; } return sum; }	int sum_ascii_digits(char* digits, size_t length) { // Process 4 digits at a time across 4 lanes. int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0; int i = 0; for (; i+3 < length; i+=4) { sum1 = digits[i] - '0'; sum2 = digits[i+1] - '0'; sum3 = digits[i+2] - '0'; sum4 = digits[i+3] - '0'; } for (i < length; i++) { sum1 += digits[i] - '0'; } // Sum the lanes by adding pairs in parallel. sum1 += sum3; sum2 += sum4; sum1 += sum2; return sum1; }

Scalar

SIMD

int sum_ascii_digits(
    char* digits,
    size_t length)
{
  int sum = 0;
  for (int i = 0;
       i < length;
       i++) {
    sum += digits[i] - '0';
  }
  return sum;
}

int sum_ascii_digits(char* digits,
                     size_t length) {
  // Process 4 digits at a time across 4 lanes.
  int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
  int i = 0;
  for (; i+3 < length; i+=4) {
    sum1 = digits[i] - '0';
    sum2 = digits[i+1] - '0';
    sum3 = digits[i+2] - '0';
    sum4 = digits[i+3] - '0';
  }
  for (i < length; i++) {
    sum1 += digits[i] - '0';
  }
  // Sum the lanes by adding pairs in parallel.
  sum1 += sum3; sum2 += sum4;
  sum1 += sum2;
  return sum1;
}

Such transformations make a lot of sense in my head, because it’s transforming an iteration over an array from one element at a time to many elements at a time, but the overall linear structure of the algorithm stays almost exactly the same. Any loop that’s just a map() and reduce() of a commutative operation can be transformed in this fashion.

There’s other patterns for SIMD solutions which I find fascinating because they require contorting the scalar solution of a linear scan into a completely different shape of an algorithm. We’re going to be looking at the slightly more difficult problem of parsing integers instead. Given a string of 8 digits, I want to know what unsigned 64-bit int they represent. The scalar solution to this is a pretty simple linear scan:

uint64_t parse_int(char* digits) {
  uint64_t parsed = 0;
  for (int i = 0; i < 8; i++) {
    parsed = parsed * 10 + digits[i] - '0';
  }
  return parsed;
}

However, there’s not a lot of opportunity for parallelism in this. Each iteration depends on applying an operation (multiply by 10) to the output of the previous iteration.

If we fully multiply out

(((d[0] * 10 + d[1])*10 + d[2]) * 10 + d[3]) * 10 + ... + d[7]

We’re left with

d[0] * 10^7 + d[1] * 10^6 + d[2] * 10^5 + ... + d[7] * 10^0

And we can write an algorithm that follows that style of computation instead, by building up the power of 10:

uint64_t parse_int2(char* digits) {
  uint64_t parsed = 0;
  uint64_t power_of_ten = 1;
  for (int i = 7; i > 0; i--) {
    parsed += (digits[i] - '0') * power_of_ten;
    power_of_ten *= 10;
  }
  return parsed;
}

However, this still leaves us computing O(digits) numbers of powers of 10, and each iteration still depends on the result of the previous iteration. But can go another step further. Extracting out a common factor of 10^4 from the first half of the elements, and then we’re left with two very similar computations to perform, with one final multiply-and-add:

(d[0] * 10^3 +     d[1] * 10^2 + ... + d[3] * 10^0) * 10^4 +
(d[4] * 10^(3) + d[5] * 10^2 + ... + d[n]     * 10^0)

This looks like a nice optimization, as it means we never need to compute a power of 10 higher than half the number of elements. But we can also apply the same extraction of common factors again and again:

((d[0] * 10^1 + d[1]) * 10^2 +
 (d[2] * 10^1 + d[3]))       * 10^4 +
((d[4] * 10^1 + d[5]) * 10^2 +
 (d[6] * 10^1 + d[7]))

Which leaves us with a nice tree-shaped computation where we can perform all the multiply-and-adds that require 10^1, then compute 10^2 and perform all the multiply-and-adds which require that value, then compute 10^4 and … ad. nauseum. We only ever need to compute squares of 10!

Level 0: ['1', '2', '3', '4', '5', '6', '7', '8']
          │*10+ │    │*10+ │   │*10+ │   │*10+ │
          └──┬──┘    └──┬──┘   └──┬──┘   └──┬──┘
Level 1:     12        34         56        78
              │ *10^2+  │          │  *10^2+ │
              └────┬────┘          └────┬────┘
Level 2:          1234                 5678
                   │      *10^4 +        │
                   └──────────┬──────────┘
Level 3:                   12345678

This idea can be translated back into C:

uint64_t parse_int_tree(char* digits) {
    // For the sake of clarity, first transform the digits into an
    // array of the integral values.
    uint64_t parsed[8] = {0};
    for (int i = 0; i < 8; i++) {
        parsed[i] = digits[i] - '0';
    }
    // Now follow the tree-shaped computation.  We multiply-and-add pairs
    // of elements, assigning the result back to the left-hand side.
    uint64_t power_of_ten = 10;
    for (int level = 0; level < 3; level++) {
        for (int i = 0; i < 8; i = i + (2 << level)) {
            parsed[i] = parsed[i] * power_of_ten + parsed[i+(1<<level)];
        }
        power_of_ten *= power_of_ten;
    }
    return parsed[0];
}

And now, we finally have a loop where each iteration doesn’t depend on the previously resulting value. We’ve achieved parallelism!

From there, we could unroll our loops (#pragma clang loop unroll(full)!) to get a straight line of instructions to execute, but gcc/clang will do that for you already. In the true spirit of SIMD, we can further optimize this by packing the operations for multiple digits into one value. In SIMD land, you’ll typically see this as a significant amount of masking and shifting. We mask to find each of the tens digits, we shift it to line up with the ones digits, perform the multiply-and-add, and then use a wider mask to do the same for hundreds and ten thousands. This is the SIMD-within-a-register (SWAR) technique:

#include <endian.h>

uint64_t parse_int_swar(char* digits) {
  uint64_t digits_bytes = *(uint64_t*)digits;
  uint64_t digits_bcd = digits_bytes - 0x3030303030303030UL;
  // If the host is little endian, then loading it as a uint64_t
  // will mean the least significant byte is the most significant
  // digit, and it's mentally easier to think of it the other way.
  // This mental ease costs us one `bswap` instruction.
  digits_bcd = htobe64(digits_bcd);

  uint64_t tens_upper_mask = 0xFF00FF00FF00FF00UL;
  uint64_t tens_lower_mask = 0x00FF00FF00FF00FFUL;
  uint64_t level_one = ((digits_bcd & tens_upper_mask) >> 8) * 10 +
                       (digits_bcd & tens_lower_mask);

  uint64_t hundreds_upper_mask = 0xFFFF0000FFFF0000UL;
  uint64_t hundreds_lower_mask = 0x0000FFFF0000FFFFUL;
  uint64_t level_two = ((level_one & hundreds_upper_mask) >> 16) * 100 +
                       (level_one & hundreds_lower_mask);

  uint64_t tenK_upper_mask = 0xFFFFFFFF00000000UL;
  uint64_t tenK_lower_mask = 0x00000000FFFFFFFFUL;
  uint64_t level_three = ((level_two & tenK_upper_mask) >> 32) * 10000 +
                         (level_two & tenK_lower_mask);

  return level_three;
}

In general, any fold comprised of commutative operations can be computed in this fashion to unlock parallelism. SIMD-ifying code is easy when it’s already embarrassingly parallel. The fun is in trying to find the right way to contort seemingly serial algorithms into parallel ones!

So, what did our optimizations achieve?

Benchmark              Time             CPU   Iterations
--------------------------------------------------------
parse_int1         0.426 ns        0.425 ns   1667920355
parse_int2         0.421 ns        0.420 ns   1665745819
parse_int_tree     0.484 ns        0.483 ns   1483969012
parse_int_swar     0.421 ns        0.420 ns   1666246273

Nothing! But it sure was fun!

If you’re interested in more of this, highload.fun gives a nice framework and set of challenges for trying to get practice at applying SIMD to real problems. What we’ve looked at is only a small portion of the first "parsing integers" challenge.

↑ Blog ↑ A Failed Experiment with Siso ↦

See discussion of this page on Reddit, HN, and lobsters.