Fixpoint

2022-06-13

Fixed-width bit-fiddling tuneups for gbw-signer

Filed under: Bitcoin, Hash functions, JWRD, Scheme, Software — Jacob Welsh @ 02:02

As part of ongoing back-burner preparations for getting into hardware development, I've been reading through my undergrad Digital Design text.(i) The material's been familiar enough to make for easy reading, yet with enough new detail coming to light to keep it interesting.(ii) A chapter on datapath components (shift registers, adders, comparators and the like) connected for me most of all with my various software implementations of arbitrary-width integer arithmetic and bitwise operations, especially as employed for the cryptographic operations in the Gales Bitcoin Wallet signer. So my hunger turned to a long-awaited (by me at least) reread and rework of bit-ops.scm.

There were two known problems - or inconveniences at the very least - that I had in mind to make progress on through this focus. The first was dreadful performance : on the order of 10 seconds for address generation and transaction signing, and that's on x86_64, a fast but notoriously anti-secure architecture that we may need to move away from and sooner rather than later. I still consider it a feature that so much of the stack is implemented in high-level language ; thus the general approach is to keep most of it there and pick out simple but critical components to be lowered to higher-cost, higher-performance implementation levels, be it C, assembly language, or even hardware.(iii) But even before that, the first step is to tighten up the code just within the high-level environment, getting rid of algorithmic inefficiency or other waste so that the lowering may proceed from a good starting point, if it's even still necessary. There's also the option of further optimizing the language implementation itself, which would have wider benefits (though not necessarily larger ones for this instance).

The second known problem was the susceptibility of gbw-signer to side-channel attacks, that is, reconstruction of secret bits through analysis of timing or of the machine's power consumption, electromagnetic or (ultra)sonic emissions during private key operations. This was due to having built it on my earlier and more general bignum.scm, which supports integers of practically unlimited magnitude and thus necessarily variable storage size and timings, under the notion that this was acceptable because proper airgapping was to be mandatory anyway. Meanwhile from late 2017 to 2018 we discovered (at our own expense, though with a good bit of fun mixed in with the disappointments, it must be said) that design and construction of portable yet reliable Faraday cages was farther outside the reach of our expertise and connections than anticipated.(iv)

On the performance front, there was indeed a substantial inefficiency baked into bit-ops.scm at the structural level. Thus the idea was to tackle that first, while the code was still quite compact (being used only for hash functions in which the most complex inter-bit operation was mere addition), then work on expanding it to cover the full contingent of constant-time arithmetic needed to support asymmetric cryptography and specifically ecdsa.scm.

That structural inefficiency wasn't an accident, and removing it does come with a cost, because after all there are always costs. It was an extension of the Lispy design principle of dynamic typing, where data structures are tagged with their own metadata. This allows for concise yet generic and robust code (one procedure handling objects of different types and sizes by looking up the tag, without callers needing to specify) and strong detection of programming errors, at least as far as passing operands of incorrect type or size. A drawback of the Lisp environment(v) is that very little of this checking can be done ahead of time, a property of the language inherited by code written in it. That means there's ongoing runtime overhead, and errors are caught later than one might like : still before causing memory corruption (as would happen in a low-level language) yet potentially after deployment, when unexpected interruption of service can sometimes be just as damaging. Such risk can of course be mitigated by more thorough testing.

In this case, each fixed-width integer was encoded as a list of fixnum "words", essentially digits in a place-value system with large power-of-two base. This list was then wrapped with an outer tag to indicate the bit width, as this might not match the maximum implied by the length of the list - especially since the fixnum type can have non-power-of-two width, some bits being reserved for its tag. Thus, client code could call a single procedure such as ufz+ (for unsigned fixed-width integer add) for "FZ" objects of any width. The error of adding, say, a 32-bit to a 256-bit FZ would be flagged as such, with a more helpful diagnostic than might come from the underlying system for running off the end of a list ; while the error of adding a 32-bit to a 33-bit FZ would also be flagged, which might otherwise have been missed entirely.

However, though it's been a while, I don't recall making any such coarse blunders in practice, while the overhead of the safety net turned out to be considerable, and not just because of the tagging and checking steps themselves. The inability of the FZ operator procedures to know their bit width in advance led to extensive redundant computations, perhaps the most egregious being a quotient for just about every time a new FZ was constructed.

Thus, the main shift is to restructure those operators by wrapping them up into a family (or "class" if you will) to be instantiated separately for each required bit width, allowing any fixed parameters to be derived in advance, and further to drop the width tagging and checking. The operator implementations end up getting even simpler, while the client code for the most part only expands by having to stash the set-width operator instances up front for repeat use. Functional programming shines here, with all this happening naturally e.g. through curried arguments and message lookups without the need for any "object-oriented" language features per se.

In the course of the readthrough I found an embarrassingly backwards comment documenting the FZ representation. The padding it mentioned was (and remains) in fact in the high bits of the high word. Going through the code to verify that everything in fact worked that way made for a good exercise and prompted the addition of various inline comments to list function preconditions or loop invariants.

With the first pass done, it remained to update the client code in hashes.scm. This was not a simple search-and-replace (though it did involve some of those) and required some closer inspection to ensure the right operators ended up in the right places.

Once the dust settled and everything was again hooked up properly, I was delighted to find a roughly 400% speedup of the two hash functions (RIPEMD160 and SHA256), which translated to a 20% speedup of the overall Bitcoin address generation process (most of its time is spent in elliptic curve operations rather than hashing). Encouraged by the result, I did a quick second pass to collect some further low-hanging fruit identified previously in comments, bringing the total hashing speedup to 500%.

The new implementation passes both the gbw-signer tests and some earlier unpublished test code which targets the hash library specifically. This uses randomly generated binary strings of every byte length from zero to 128 and compares results between hashes.scm and the corresponding hashing commands in the "openssl" binary.

A pain point in presenting these changes in patch format was indentation, since all the operators get wrapped in an extra level. My approach was to preserve indentation as much as possible while content was changing, then add a final patch for an automated bulk reindent.(vi)

The work is slated for inclusion in the next fetch-bitcoind release, though it doesn't seem significant enough to warrant a new release just by itself so it's presented as patches only for now.

Patch Seals Tree
gbw-signer_static_bit_ops_1.vpatch jfw Browse
gbw-signer_static_bit_ops_2.vpatch jfw Browse
gbw-signer_static_bit_ops_reindent.vpatch jfw Browse

The logical next step is the conversion of the rest of the crypto to constant-time operations. My guess is that this will bring the overall performance down even worse than it used to be, but this would come with a substantial security benefit and further avenues for optimization if feedback from practical use warrants it.

  1. Frank Vahid, 2007, Wiley. "That sounds very rigorous", one observer noted unenviously ; yet he's the one who does a hundred pushups a day or some such. Psychology's a funny thing. [^]
  2. For instance, did you know that Buridan's Ass, the philosophical curiosity, is actually a fundamental problem in electronics ? There's no method known or perhaps even possible to correctly build sequential circuits, in the sense that flip-flop input timing requirements are satisfied and every stored bit resolves cleanly to either a zero or one in bounded time when dealing with asynchronous inputs from the outside world, which is to say, in any nontrivial application. The best that can be done is to get the probability of such potentially catastrophic metastable states down to acceptably tiny levels. [^]
  3. An early example of this was extending the interpreter with of a set of "fixnum" operations, more closely modeling the underlying machine arithmetic with all its overflow headaches, rather than the idealized mathematical integers discussed by the Scheme standard (which must nonetheless overflow in space and time). Next there was the addition of a double-width multiplication operator mapping to platform-specific assembly. [^]
  4. Even with constant-time code it's still something worth considering, because whatever input device you use to supply a decryption password will most likely still be leaking in some of the noted ways. [^]
  5. As contrasted with language families that developed static yet flexible type systems like ML (SML, OCaml), Miranda (Haskell), and Ada. [^]
  6. This perhaps requires the discerning reader to replicate it in his own Lisp-aware editor, though arguably that was already necessary : since Lisp relies on parentheses for parse tree structure, and a set of tidy indentation rules to spare the reader from counting parentheses, misleading indentation could make for a subtle avenue of sabotage. [^]

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.