There is a method to the madness of performance!

Eduardo Madrid

75 mins
intermediate
advanced
11:00-12:30, Friday, 5th July 2024

We have something crazy to share with you:

We just wanted a good comparison to assess the performance of our programming techniques and library. We chose to implement very well known and important tasks, like string length or conversion of string to integer, because this would let us compare against hopefully the best, like the standard C library implementation of GLIBC, it does not get more real than comparing against a project that literally runs the world.

To our astonishment, not only our performance held its ground, but comprehensively beat that of GLIBC, and anything else!

The craziest thing about this is that GLIBC's code shows large amount of work, sophisticated programming techniques, also the use of platform-specific, handcrafted assembler, that goes to extremes such as, to name one example, of trying "saving some code size by microfusing VPMINU with the [32 byte data load]", yet, all of that effort is beaten by simple, understandable code, that you can see did not take much effort either --and without us even meaning to, our object code size is a small fraction of theirs.

How is this madness possible of working much less, while writing direct and understandable code and still get at times twice the performance, at times portable C++ matching hand crafted assembly with SIMD instructions?

There is a method to this madness!

  1. You put the knowledge of what performs well into infrastructure code, so that when programmers use this library they benefit automatically. In practice only C++ allows you to do this.
  2. This infrastructure needs to compose in a way that the compiler optimizer can fully do its job
  3. It needs to help the programmer express things in performance-friendly ways, one key to save effort
  4. Avoiding pitfalls that would impede the processor use its full capabilities, especially those related to "out of order" execution.
  5. Avoiding the pitfall of "additive bias".

About 'what performs well': Examples of the following and more:

  1. "Branchless" programming: you do perhaps more total work, but deterministically
  2. Converting control dependencies into data dependencies
  3. Organizing the code to pay the performance penalty of a branch only once, and where it hurts the least
  4. Letting the optimizer keep assumptions
  5. Explicit "speculative" computation, and also "implicit".

From the benchmarks and source code comparisons you will see exactly how the professionally made code did not follow these recommendations and hurt themselves; better yet, you'll see how it is possible to make a library that makes it natural to follow these recommendations!... and how this can only be achieved in C++!

generic programming
templates
assembler
compiler explorer

Eduardo Madrid

I'm the lead of an Open Source little project that gives useful lessons about how to program in C++ for top performance. I've been a team and technology leader at big companies, including Snap, the maker of Snapchat and Hedge Funds that do automated and high frequency trading.