Branchless Coding in Go
Updated: 2021-05-10
So, I was recently inspired by a coworker’s design to pack what would have been JSON (JavaScript Object Notation) data into a bunch of bit fields in an array of 64-bit elements. Binary arithmetic, bit-packing, and high performance are fun for me, so I thought of some ways to improve it, for learning purposes. Then I had an idea: maybe I can unpack/pack bits without using conditional branching (if
statements).
Why does this matter?#
Performance and security. Modern CPUs will often do branch prediction on reaching a conditional jump, which is essentially a guess. If the CPU guesses wrong, it needs to unwind its changes then execute the other branch of the if
statement. A wrong guess is both a performance hit and a potential information leak. Attackers running on the same machine can time how long an operation takes to infer the CPU’s guess, such as with Spectre. For more advanced operations, such as cryptography and server-side APIs, timing attacks can reveal priviledged information.
The less branches we have in our code, the faster the CPU can execute it. There are less guesses to get wrong =)
Simple unpack#
Unpacking a packed binary format usually requires a bunch of bitwise operations. Assuming that we have an 8-bit binary structure that holds 8 boolean config options, Here’s how we might unpack it:
tests := make([]bool, 8)
...
var input uint8 = uint8(*num)
// set each boolean to a bit in the input
if input&(1<<0) != 0 {
tests[0] = true
}
if input&(1<<1) != 0 {
tests[1] = true
}
if input&(1<<2) != 0 {
tests[2] = true
}
if input&(1<<3) != 0 {
tests[3] = true
}
if input&(1<<4) != 0 {
tests[4] = true
}
if input&(1<<5) != 0 {
tests[5] = true
}
if input&(1<<6) != 0 {
tests[6] = true
}
if input&(1<<7) != 0 {
tests[7] = true
}
...
I shortened it a bit for readability. Essentially, we check each input bit, one at a time, and store the result to a bool. Now, let’s look at the assembly output:
go build -o /tmp/binextract-if -gcflags='-S' binextract-if.go
Comments added for clarity:
404 b.go:19 MOVQ 86+96(SP), AX ; AX = input
409 b.go:19 NOP ;
416 b.go:19 TESTB $1, AL ; flag = AL & 0b0001
418 b.go:19 JEQ 862 ; if flag { jmp }
424 b.go:20 MOVQ 85+112(SP),CX ; CX = tests[]
429 b.go:20 MOVB $1, (CX) ; tests[0] = true
432 b.go:22 TESTB $2, AL ; flag = AL & 0b0010
434 b.go:22 JEQ 440 ; if flag { jmp }
436 b.go:23 MOVB $1, 1(CX) ; tests[1] = true
440 b.go:25 TESTB $4, AL
442 b.go:25 JEQ 448
444 b.go:26 MOVB $1, 2(CX)
448 b.go:28 TESTB $8, AL
450 b.go:28 JEQ 456
452 b.go:29 MOVB $1, 3(CX)
456 b.go:31 TESTB $16, AL
458 b.go:31 JEQ 464
460 b.go:32 MOVB $1, 4(CX)
464 b.go:34 TESTB $32, AL
466 b.go:34 JEQ 472
468 b.go:35 MOVB $1, 5(CX)
472 b.go:37 TESTB $64, AL
474 b.go:37 JEQ 480
476 b.go:38 MOVB $1, 6(CX)
480 b.go:40 TESTB $-128, AL
482 b.go:40 JEQ 488
484 b.go:41 MOVB $1, 7(CX)
488 b.go:44 MOVQ CX, (SP)
...
862 b.go:23 MOVQ 85+112(SP),CX ; same MOVQ as location 418
867 b.go:19 JMP 432 ; (appendix A for explanation)
cheat sheet:
MOV | move value on left to location on right |
NOP | no operation (do nothing) |
TEST | do a boolean AND comparison between left and right |
JEQ | if last operation resulted in zero, jump to location |
$ | constant. $1 ==0b0001 , $2 ==0b0010 , $4 ==0b0100 , etc. |
AX | cpu register; used for comparisons here |
AL | lower 8 bits of AX |
CX | cpu register; stores address of the output array |
That’s a lot of conditional branching. Where there’s conditional branching, there’s CPU branch prediction, and where there’s branch prediction, there’s performance loss and potential vectors for timing attacks like spectre. For occasionally reading a config file, this is negligible. If this is hot code that runs frequently, with sensitive information, perhaps we should remove the branches.
Branchless unpack#
Can we unpack these boolean values without conditional jumps? Yep! We can even do it in Go, and without a ternary operator =)
tests := make([]bool, 8)
...
var input uint8 = uint8(*num)
...
// set each boolean to a bit in the input
tests[0] = (input & (1 << 0)) != 0
tests[1] = (input & (1 << 1)) != 0
tests[2] = (input & (1 << 2)) != 0
tests[3] = (input & (1 << 3)) != 0
tests[4] = (input & (1 << 4)) != 0
tests[5] = (input & (1 << 5)) != 0
tests[6] = (input & (1 << 6)) != 0
tests[7] = (input & (1 << 7)) != 0
So, this is nearly the same thing, except we convert the bit fields into booleans instead of using if statements. C would let us automatically convert ints to booleans, but Go doesn’t allow converting booleans to/from anything. However, any expression that you can use inside an if condition is also usable as a value for a boolean variable ;-)
Assembly output:
404 n.go:19 MOVQ 86+96(SP), AX ; AX = input
409 n.go:19 TESTB $1, AL ; flag = AL & 0b0001
411 n.go:19 MOVQ 85+112(SP),CX ; CX = tests[]
416 n.go:19 SETNE (CX) ; tests[0] = flag
419 n.go:20 TESTB $2, AL ; flag = AL & 0b0010
421 n.go:20 SETNE 1(CX) ; tests[1] = flag
425 n.go:21 TESTB $4, AL
427 n.go:21 SETNE 2(CX)
431 n.go:22 TESTB $8, AL
433 n.go:22 SETNE 3(CX)
437 n.go:23 TESTB $16, AL
439 n.go:23 SETNE 4(CX)
443 n.go:24 TESTB $32, AL
445 n.go:24 SETNE 5(CX)
449 n.go:25 TESTB $64, AL
451 n.go:25 SETNE 6(CX)
455 n.go:26 TESTB $-128, AL
457 n.go:26 SETNE 7(CX)
461 n.go:28 MOVQ CX, (SP)
SETNE | sets target to true (0x01) if previous test is non-zero |
Check it out, no branching! And with no branching, there’s no branch prediction - at least not in this code =). Essentially, Go takes advantage of the SETNE x86 instruction to conditionally set the boolean variable instead of using conditional jumps.
Branchless pack#
Now that we have a binary unpacker that runs in constant time, how do we write the reverse? Go does not allow us to directly convert binary values into bits. We can’t just bitshift them as integers like we would in C:
#include <stdio.h>
#include <stdbool.h> // C99 bool support
int main() {
bool a=true;
bool b=true;
bool c=false;
bool d=true;
printf("%x %x %x %x\n", d, c, b, a);
// only one line =D
unsigned int out = a | (b<<1) | (c<<2) | (d<<3);
printf("result: 0x%x\n", out);
return 0;
}
➞ unbool.c
Aside: It’s rare for C to beat Go in code golf; is this some kind of achievement?
Also, this code will only work with C99-style bools, where true
is always 1
, and not any other non-zero integer.
In Go, we have to use if
to convert bools into bits. I’m only going to do 4 bits here in the interest of clarity.
var out uint32
pa := flag.Bool("a", false, "a")
pb := flag.Bool("b", false, "b")
pc := flag.Bool("c", false, "c")
pd := flag.Bool("d", false, "d")
flag.Parse()
a := *pa
b := *pb
c := *pc
d := *pd
fmt.Printf("%t %t %t %t\n", a, b, c, d)
// set each boolean to a bit in the input
if a {
out |= 0x1
}
if b {
out |= 0x2
}
if c {
out |= 0x4
}
if d {
out |= 0x8
}
fmt.Printf("result: %b\n", out)
And the assembly dump:
go build -o /tmp/binextract-noif -gcflags='-S' bool.go
659 u.go:29 MOVBLZX "".a+87(SP), AX ; AX = a
664 u.go:29 MOVL AX, CX ; CX = AX
666 u.go:29 ORL $2, AX ; AX | 0b0010
669 u.go:35 MOVBLZX "".b+86(SP), DX ; DX = b
674 u.go:35 TESTQ DX, DX ; flag = DX != 0
677 u.go:35 CMOVLNE AX, CX ; if flag { CX = AX }
680 u.go:32 MOVL CX, AX ; AX = CX
682 u.go:32 ORL $4, CX ; CX | 0b0100
685 u.go:35 MOVBLZX "".c+85(SP), DX ; DX = c
690 u.go:35 TESTQ DX, DX ; flag = DX != 0
693 u.go:35 CMOVLNE CX, AX ; if flag { AX = CX }
696 u.go:35 MOVL AX, CX ; CX = AX
698 u.go:35 ORL $8, AX ; AX | 0b1000
701 u.go:38 MOVBLZX "".d+84(SP), DX ; DX = d
706 u.go:38 TESTQ DX, DX ; flag = DX != 0
709 u.go:38 CMOVLNE AX, CX ; if flag { CX = AX }
712 u.go:38 MOVL CX, (SP) ; out = CX
CMOVLNE | assign left to right if previous result is non-zero |
➞ Full unbool.go assembly dump
Okay, this one is a bit harder to pick apart. Go is being clever. To understand what’s going on, know that Go treats the bool type as 0 or 1. It’s like C99 bools, but much more strict.
When it assigns AX=a
, a
will either be 0 or 1.
MOVBLZX "".a+87(SP), AX
Next, we store the value of AX into CX.
MOVL AX, CX
Then we set the 0b0010 bit on AX
. Yep, we set this before checking if b
is true.
RL $2, AX
Now we move (MOVBLXZ does a zero-extend) b
into DX.
MOVBLZX "".b+86(SP), DX
and test if it’s non-zero (true)
TESTQ DX, DX
If b
is non-zero, set CX to AX. Remember that AX has the “if it’s true” value.
CMOVLNE AX, CX
Whatever the result, we reset AX to CX’s current value.
MOVL CX, AX
And then we repeat this until we have all our booleans packed. Note that AX and CX alternate roles.
It seems that the Go compiler is much smarter than I expected. It sees the layout, figures out what we want to accomplish, then replaces the branches in source code with conditional move instructions! Once, again, constant-time code execution with no branch prediction. Clever Go =)
Practical use-case#
With this new knowledge, I wrote some server-side JWT validation code that runs in mostly constant-time. Since the JWT validation is part of initial authentication, timing attacks could be used to determine which checks are run against the JWT to search for potential weaknesses. The code I wrote does the validation checks and stores each error condition as one bit in an error status variable. After all the checks are complete, a simple if valError != 0
check is run to see if any errors did occur, then a success or error response is returned to the caller. This also has the neat side effect of making it easier to write unit tests that check for combinations of validation errors.
The typical solution to defend against timing attacks on authentication would be to set a minimum wait time. This way, any short-circuit behavior would be masked, and the timing would be identical whether all checks were run or not. The minimum wait time is much more important if a database lookup is done for authentication, such as when checking if a user account exists. The timing difference will typically be larger, so it would be more detectable over the network. To my knowledge, mysql and postgres do not have any mechanism to prevent leaking information through timing.
Pros and Cons#
Minimum wait time:
- easier to implement
- can be used when your code depends on another service or database that does not run in constant time
- requires you to time your code so you can set the timer to a little longer than max run time
- if your code timing changes, minimum wait time may need to be updated, since code that takes longer than MWT will stand out
Constant-time:
- harder to implement
- will not protect services or databases your code calls if they expose information via timing
- no MWT to configure and re-configure for code updates
- you still need to test and time your code to make sure timing information is not leaking
Conclusion#
Defending against side-channel attacks is a challenge. Find the critical security paths (such as login, authentication, validation, etc) and harden those. For cryptography, it’s vital to get this right. Everything else, proceed normally and hope the Spectre bug doesn’t bite on that shared cloud hosting you’re probably using.
For performance, remember that only the hot spots really need optimization, and there is usually some low-hanging fruit that will give a bigger performance boost than branch-less coding. Example: using a non-garbage-collected language =P
Thanks to Matija Lesar for giving valuable feedback on this post.
Question/Comments? Found a mistake? Email me =)
Appendix A#
So… while writing this, I found an interesting assembly output in the if
version.
404 b.go:19 MOVQ 86+96(SP), AX
409 b.go:19 NOP
416 b.go:19 TESTB $1, AL
418 b.go:19 JEQ 862 ; jump to 862
424 b.go:20 MOVQ 85+112(SP),CX ; this line is also at 862
429 b.go:20 MOVB $1, (CX) ; the line we needed to skip
432 b.go:22 TESTB $2, AL ; return here from 867
434 b.go:22 JEQ 440
436 b.go:23 MOVB $1, 1(CX)
...
862 b.go:23 MOVQ 85+112(SP),CX ; same MOVQ as location 418
867 b.go:19 JMP 432 ; this could have been JEQ
Wait, why are 424 and 862 the same instruction? Why did we jump over this just to repeat it and then jump back? Why jump to the end of the code block? We only needed to skip the assignment on 429 What’s up with that?
I originally thought it was some clever compiler trick, or perhaps a bug, but I couldn’t think of a reason it should do this.
Then I went to #go-nuts on freenode to ask about it.
2021-03-22 18:08
<fizzie> I haven’t been looking at Go assembly output, but if it’s anything like C compilers, I think the problem is in assuming there would need to be a reason for a compiler to do something, other than “an implementation detail related to the compiler’s internal data structures caused it”.<fizzie> FWIW, the output I get is substantially similar, except without the extra NOP. Just a jump to the end, that “duplicate” operation, and a jump back. If I had to guess, it’d be something like, at the time when it was making that decision, it didn’t realize that those two branches would end up with equivalent code, so it had to generate a separate “else” branch which it located past the end of the other block of code.
<Gnum4n> but the mov happens before the branch. Go moved it to after the branch on its own
<fizzie> Does the mov really “happen” at any point in particular? It’s not like there’s anything explicit in the source code that says “put this in a register”.
<Gnum4n> oh! I see what you mean. just because I created the array before line 20 in local function scope doesn’t mean that Go knows that it should be loaded into a register before line 19, so it assumes that it needs to load it on line 20 or before 23
<fizzie> Something like that, yes. FWIW, it doesn’t exactly explain how it realized it can do that register load just once before all the other ifs (instead of translating them the same way), but that’s the part that I think tends to boil down to “it made a lot of transformations through a lot of intermediate representations, and that’s how the chips fell”.
So, it’s not what the compiler did (weird duplication), but what the compiler didn’t do, which is to de-duplicate the MOV and jump to where we would expect.
Appendix B#
You may have noticed that input values are provided via command-line flags. I initially hard-coded the values, but the Go compiler converted them into one big constant and optimized away all the decision-making. Clever compiler.
Updates#
HackerNews#
Thank you so much for everyone who took the time to read and give feedback! Wow, I was not expecting this big of a response! So many good ideas!
I’ve very greatful for all the thoughtful comments. I read them all. One of them stands out:
My advice for the author and anybody else considering this is to break out the confirmed-correct assembler code into its own non-Go object and then link it in; otherwise, you’re depending on the compiler to never change and inadvertently introduce branches. Since the functionality of the code wouldn’t change it would be difficult to check with a unit test. (I guess you could add one that did the assembly step and then grepped for jump instructions, but that has problems of its own.)
– ericbarrett
And then I realized how much I’m trusting the compiler to not change. This is a tough one. If the code really does depend on constant-time execution, then we should verify that the machine code generated actually is what we want it to be. And we can’t just do it manually, either. How do we solve this and also make it part of the build pipeline, with proper CI/CD? Oof.
The three main options appear to be:
- Set aside the compiled object code to be linked in statically
- Write the assembly by hand
- Verify that the correct machine code has been generated in a test case
Options 1 and 2 would limit us to one machine architecture (probably x86). Option 3 would probably be difficult to do right.
If I figure this out, I’ll probably do another blog post on it. For now, it’s probably best to use this technique as part of a layered defense. Defense in depth, as they say.
An alternative C++ method#
Yannick Le Pennec sent me an alternative way to do this in C++:
[…] your unpack function (for an array of 8 bytes) can be implemented in terms of the BMI2 instruction PDEP like so:
#include <cstring> #include <immintrin.h> bool tests[8]; auto bmi2_unpack(std::uint8_t input) { auto output = _pdep_u64(input, 0x101010101010101); std::memcpy(&tests, &output, 8); }
This gives very tight codegen:
movabs rax, 72340172838076673 movzx edi, dil pdep rdi, rdi, rax mov QWORD PTR tests[rip], rdi ret
Similarly, the pack function can be implemented in terms of BMI2’s PEXT.
The intrinsic version has to be written by hand, because unfortunately neither GCC nor clang seem to be able to fold the simple/branchless implementations into a PDEP by themselves. I hope that one day optimizers will reach that stage :)
Thanks for sharing, Yannick. This is very cool! Assuming we do want to unpack into an array of booleans, this does 8 of them in a single CPU instruction (64-bits). I had to look up how PDEP works to understand the code:
Unfortunately for us Go devs, I don’t think Go has any plans to support compiler intrinsics. Also, using inline assembly in Go prevents a lot of compiler optimizations. I’ve seen benchmarks where the hand-written inline assembly loses just because Go can’t make optimization assumptions about the inline assembly code.