A tetrad of tuneups for bitcoind

2021-11-13

A tetrad of tuneups for bitcoind

Filed under: Bitcoin, JWRD, Software — Jacob Welsh @ 15:54

I'm releasing a series of four changes that alleviate basic usability and functionality pain points that came up repeatedly in classroom and production use of the Bitcoin reference implementation. As far as I can tell, this establishes JWRD Computing as not just the most active but the sole current maintainer of the rough yet uncorrupted codebase inherited from Satoshi Nakamoto. Not that this was any great feat, or something we were aiming for, but if it's to be done then someone's got to do it.

The changes are presented in the traditional Bitcoin Foundation Vpatch format, because until such time as I may reach a higher VaMP-y plane I still find it the best available option for publishing and discussing code with due consideration given to meaning, source and context.

`bitcoin_help_tuneups`

The first patch pertains to the self-documentation of the system. A wallet encryption feature was bolted on at a surprisingly late stage in the project's youth, and was left disabled by default. It had the perverse behavior of hiding the presence of and disabling the help messages for commands pertaining to encrypted wallet operations while in its initial unencrypted state. Even more confusingly, for some commands the help would change based on the new requirements once encrypted, and the encryptwallet command would disappear once it was run. Of course it wouldn't make sense to use those commands outside of their respective domain, but that's no reason to make them harder to discover and learn about or recall; and note that the help-hiding was implemented deliberately, as in by adding extra code, not as a natural side effect of the commands' necessary enforcement of their preconditions. So I've simply removed the offending checks and adapted the help messages to serve equally well in all states and make the encryption-dependent differences explicit.

The other help tuneups include accepting -help for option style consistency,⁽ⁱ⁾ expanding on the warning displayed in the so-called Safe Mode (recall Satoshi was a Windows guy) so that the hapless recipient stands some chance of figuring out what the hell is happening, removing the dubious-at-best notion of "needing to upgrade", and documenting the otherwise secret -disablesafemode option. That last one is important because one possible cause of Safe Mode is a deep blockchain fork; perhaps you indeed don't want to risk sending transactions under such conditions, but then again it may be the time when you most need your machine to do as it's told!

Besides that there's some minor formatting cleanups, dead code removal and centralization of option parsing logic that I couldn't resist, from the vicinity of the changes I was making otherwise. Actually this is a recurring theme when I work in this codebase: in order to get anything done I either need to do it in the same stupid way as the existing examples, or else do it the simple and obvious way but create inconsistency, or else get sucked into fixing ALL of it. In general I think the middle option there (sacrificing consistency) is the way to go because it helps clarify what the "right" way even is, paving the way for eventual cleanups to fix the rest if it even remains relevant.⁽ⁱⁱ⁾ Just for some ready instances of patterns I've declined to follow: the GUI code was incompletely removed so there are calls to wxMessageBox that's now just a printf wrapper; there's plentiful use of the _("message") translation hook; and there's the CRITICAL_BLOCK() macro with its twisted implementation.

`bitcoin_permissive_mode`

The second patch adds a new boolean option -permissive, off by default. Its current use is to disable the "ban hammer" introduced in the malleus_mikehearnificarum patch, allowing the node to maintain connections with "heathen" peers. The hypothesis is that at this point the "honest nodes" as Satoshi put it are vastly outnumbered by the monkey nodes, and that ignoring this situation only serves to isolate one's node, contributing to a sometimes unbearably slow initial sync process. The latter at least is unproven though and the situation could change, so it seems most appropriate to me at this stage to leave the matter to the operator's discretion.

Following a sort of conservation law, the same patch removes the -caneat option such that the eatblock command is always available. If you don't want to use the command then simply... don't! There's no cause for an excess of caution because locally-fed blocks receive the exact same processing as those received from the network.

These first two patches have been sitting on my desk for a while now and I've been running my local node in permissive mode with no trouble, including syncing up after a couple weeks offline.⁽ⁱⁱⁱ⁾

`bitcoin_boost_prune_built_libs`

The third patch came more recently when I was craving a break from the MySQL writing and returns attention to the build system with a high-leverage change (as in, tiny change with big result). It was driven by a desire to get the Big Three dependencies - Boost, BerkeleyDB and OpenSSL - cut down to size and properly integrated into the V-tree so that we can finally have a referentially complete system, at least down to the libc/libstdc++ level, on which basis to begin to comprehend what the existing code actually does. There's a ways to go on that, but I started chipping by taking the biggest of the three, which would be Boost, and cracking open the documentation. It's structured as a loose collection of about a hundred C++ libraries for diverse purposes; most are "header-only libraries" - a very C++ sort of derangement, no doubt contributing to the slow build time, large binary and massive spew of redundant compiler warnings when building bitcoind - such that only 12 are separately compiled, as in, compiled up-front into ".a" libraries by the bjam machinery. It further turns out that of that dozen, only four are actually used by bitcoind! (Plus possibly one indirectly: at least part of boost_system is pulled in by boost_filesystem.) And yes, bjam readily provides a way to specify which parts to build. Who knew?

A two to threefold reduction in any resource demand is usually a valuable find; it only takes two or three of these to reach the order-of-magnitude sort of result that makes a palpable difference even to the casual observer. So how does this one work out in practice? Here are my build timing measurements, all from the same machine and with a make clean between each run:

Test	Before	After	Speedup
Boost only, serial (`make -C build boost.stamp`)	8m21.6s	1m54.9s	4.37x
Overall build, parallel (`make -j3`)	11m0.2s	4m47.9s	2.29x

Then there are the qualitative benefits. Some previous bjam overrides (NO_BZIP2, NO_ZLIB and NO_COMPRESSION) are no longer needed as they pertain only to unused libraries. Out of curiosity I tested the build on a glibc system (Gentoo, gcc 4.9.4, glibc 2.23): the unpatched version failed with some obscure error in one of the unused Boost libs (something about uintptr_t if I recall), while the pruned version built fine.^(iv) And while I haven't yet tried it on CentOS I'm fairly confident that any Python related troubles should be history.

There were some surprises on the way to this result. At first I derived the list of compilation-required libraries as the intersection of the list from the Boost docs and a grep-extracted set of #includes from the Bitcoin sources. Then I compared this against the libraries actually linked into the executables according to the Makefile, finding the additional presence of boost_unit_test_framework (for the test_bitcoind binary only) but also the absence of boost_iostreams. Headers for the latter were included from bitcoinrpc.cpp; on first inspection I couldn't find any actual code using them, but it was hard to tell due to the deliberate undermining of namespaces. So I removed the includes, one at a time, finding that it all still built successfully. Not quite convinced, I also tried to confirm identical binary outputs. I was hindered in this effort by what I traced down to a lock debugging feature, used by debug-enabled builds only but embedding file line numbers into the compiled code even for normal builds. I worked around this by leaving some comments to preserve line numbering, then indeed got identical binaries, at least after running strip to remove gcc's debug info. I'm pretty sure the cause of divergence there was having the source directory at different paths for the two builds.

`bitcoin_tx_fee_cleanup`

The fourth and final patch is the most involved, and makes a good start along a path sketched out in the log. In brief, the situation was that there were two different arbitrary constants in the code pertaining to minimum fees, which would sometimes get applied when originating or relaying transactions, depending on output values, the alignment of the planets, the precession of the equinoxes and god knows what else; this came up because the particular numbers had become excessive for current market conditions. Thus I have removed the infantile magic,^(v) adding a user-facing knob to set the minimum relay fee for accepting unconfirmed transactions into the node's local pool, and modifying the behavior of the existing fee knob for wallet-originated transactions to do what you'd have expected it to do all along.

I didn't go as far as implementing a unified "set" command for the now multiple run-time mutable settings; the main obstacle being that I don't adequately grasp how the data type conversions on both sides of the JSON serialization work (or don't work, as the case may be). Such a command would likely have to handle an argument of varying type while the current style of one command per setting allows static type signatures. I did however change the nTransactionFee variable to use thread-safe accessors since I'd worked it out for the new nMinRelayTxFee. It's not clear whether this is strictly required - quite possibly not as the actual concurrency involved is quite limited despite all the noise about threading - but at least it removes a potential hazard for future changes. It's easy to check that there's no possibility of deadlock introduced: all critical sections using the lock in question release it promptly without blocking on anything else.

Some behavioral differences worth noting for those accustomed to the old mess: there's no more special allowance for rate-limited free transactions i.e. the fee-based threshold always applies. Fees are now expressed in BTC per binary kilobyte,^(vi) based on exact transaction size, instead of BTC per weirdly-rounded-up decimal kilobyte. To be precise, it's floor(size*rate/1024) where it used to be (1 + floor(size/1000))*rate, and of course it's done with proper integer arithmetic with overflow checks. The difference is minimal for a 1 kB transaction, but for, say, a 250 kB byte transaction, the fee will come to about a quarter of your -paytxfee setting where before it would always be at least the value of the setting.

To rephrase for emphasis: with this patch, your node will pay significantly lower fees on typical sub-kilobyte transactions, which may result in confirmation delays unless you adjust your configured fee to compensate.

The block creation code had to be adjusted as well; I did the bare minimum there so it still prioritizes transactions using a strange metric based on input ages and output values rather than fee. This doesn't look too hard to fix but will take more study of the code than I cared to do on this pass, especially given that I'm not aware of any serious current efforts to use this code for mining - and I suspect there would be other things to fix before this one if there were. Please do let me know if you're interested in this.

I went through a few rounds of testing and tweaking for this patch; just when I thought it was all set to sign I realized I'd better test the invalid-input case too for the new -minrelaytxfee option, which turned up the annoyance that it did the whole multi-minute node startup process before parsing the option and reporting the error. So I moved the parsing part earlier, and tried to do likewise with the existing -paytxfee, then realized the latter one probably needs to stay put so as to override another scrap of code that loads a fee setting from the database.^(vii) These are details that I suspect would be entirely missed when reading just the patch on its own, a point perhaps worth some deeper pondering.

Download

Patch	Seals	Tree
bitcoin_help_tuneups.vpatch	jfw	View
bitcoin_permissive_mode.vpatch	jfw	View
bitcoin_boost_prune_built_libs.vpatch	jfw	View
bitcoin_tx_fee_cleanup.vpatch	jfw	View

Coming soon [update: okay, so it took two+ months but it happened] I have a script that will make it a whole lot easier to deploy this all on a fresh machine, but I expect I've given the alert reader quite enough to chew on for now!

As opposed to just -? or --help. See, you'd ask the thing for help in the expected manner and instead it would just sit there stupidly locking up your terminal while "loading the block index" and whatever other "boot time" madness. [^]
I'm speaking here of content, not whitespace style; do not freaking mix tabs and spaces omg! Not within a line and not between lines. Sure, all-tabs indentation might be preferable but we haven't bit the "full reformat" bullet so let it stay all-spaces for now. In vimrc terms this means:
```
autocmd FileType c,cpp setlocal tabstop=4
autocmd FileType c,cpp setlocal shiftwidth=4
autocmd FileType c,cpp setlocal expandtab
```
[^]
In September I picked up shop and made the road trip from Maine to Florida, then was stuck briefly at this backwater Hilton that figured the split trash and recycles bin as an important hotel room feature but the Ethernet port as optional. At least I'm in good company with this sort of trouble. [^]
I then further tried it with a -static link flag; due to glibc's mandatory use of dynamic plugins for the DNS resolver, this revealed that DNS has not been entirely removed from the binary. This applies even on musl. It's pulled in by a broad sockets layer in the boost/asio.hpp header-lib. [^]
Back in my days of podcast consumption - which I suppose would at least constitute a juvenile rather than infantile phase - I once heard of a theory in literary criticism that held that Magic was an allegory or code for Madness. There were some plausible examples from pop culture; I don't know that I entirely bought it, as a universal thing, but the two do have a lot in common and the equivalence certainly works when it comes to computer code. [^]
Because requiring long division when you can just as well use bit shifts is just barbaric, that's why. [^]
That GUI-driven settings-in-database thing should probably just be removed, seeing as there's no user interface for it anymore nor documentation. [^]

7 Comments »

As far as I can tell, this establishes JWRD Computing as not just the most active but the sole current maintainer of the rough yet uncorrupted codebase inherited from Satoshi Nakamoto.

There's still time until 2028 and surely even after that, what's the hurry?!

Actually this is a recurring theme when I work in this codebase: in order to get anything done I either need to do it in the same stupid way as the existing examples, or else do it the simple and obvious way but create inconsistency, or else get sucked into fixing ALL of it.

Fwiw, in my experience this is a recurring theme whenever working with legacy/unmaintained or even simply large-and-old-enough codebases, really. At any rate, the "inconsistency" of having something clean while there still is otherwise around a lot of dirty stuff is simply the first step of any cleaning process in practice (as opposed to in theory where everything is always fully consistent and all cleaning of everything can be and is also best done fully in one single step and so on and so forth).

and that ignoring this situation only serves to isolate one's node, contributing to a sometimes unbearably slow initial sync process.

Not that I see any problem at all with actually letting the user decide (oh, how unclean of the user to want to ever run some code that might talk to "heathens" at all, I know) but how do you jump to the conclusion that it's this a significant factor that slows down the initial sync process?

It further turns out that of that dozen, only four are actually used by bitcoind! (Plus possibly one indirectly: at least part of boost_system is pulled in by boost_filesystem.) And yes, bjam readily provides a way to specify which parts to build. Who knew?

Ha, well done!

These are details that I suspect would be entirely missed when reading just the patch on its own, a point perhaps worth some deeper pondering.

This is an important point I would say and it actually does feed into one of the reasons why I'm still pondering the next step with VaMP - I start suspecting that the whole bare-bones approach of "here, read this diff and bring from home ALL its context" is not such a literate approach to coding as it might like to brand itself (even if it arguably was, back in the 1980's, perhaps). Sure, it can be argued along the lines of "there should have been then a comment in the code pointing out this sort of context-based detail." I can even agree that such a comment would certainly be a useful addition to the patch itself but I'm not all that convinced that it really is much of a solution to what seems to me a deeper problem.

Comment by Diana Coman — 2021-11-13 @ 21:02
There's still time until 2028 and surely even after that, what's the hurry?!

And that's just the one we *know* about from MP's grep-driven survey. But no worries, somebody will have fixed boost by then!

At any rate, the "inconsistency" of having something clean while there still is otherwise around a lot of dirty stuff is simply the first step of any cleaning process in practice

Indeed, it's more akin to mopping half of a long-neglected floor than to mowing half of a lawn. "But it makes the other half look worse by comparison" - good!

how do you jump to the conclusion that it's this a significant factor that slows down the initial sync process?

To be clear, I don't put it as firmly as a conclusion at this stage; but the question stands as to why I even suspected it, "wovon man nicht sprechen kann, darueber muss man schweigen" and all. I dug into my logs and found there was more history to it than I first recalled here:

1. In July-August 2019, I was dependent on a PRB (0.11) node on my main, Gentoo machine which I had allowed to run critically low on disk space. Rather than mess with HDD upgrades I was trying to solve both problems at once by getting TRB running on a Gales machine with new SSD. In order to get it synced as fast as possible I wanted to feed it by local connection to the existing node, meaning it would need to not ban said node.

2. At the same time, Robinson already had his coin on a TRB node and was trying to fund a deedbot wallet for Pizarro hosting. He consulted some websites to pick an economical transaction fee and sent, but it got stuck. He guessed that maybe his TRB-only peers had a higher minimum relay threshold set (though now we know this wasn't how it worked as there was no threshold setting per se). One thing we tried was rebuilding without the ban-hammer to get it relayed to prb-net, as an easier first step before implementing the missing "getrawtransaction" so as to try pushing directly to block explorers.

3. In September 2019 my new TRB node finished syncing; just after it reached the tip it stopped getting anything but bastard/orphan blocks. I tried lifting the ban-hammer, I suppose because it was in my toolbox and fresh in mind, and it synced up straightaway. I also noticed I seemed to be feeding a number of other nodes, in that I'd log 2-5 "received getdata for: block ..." after each incoming block; this despite not having any unusually wide or fast connectivity otherwise. A few weeks later I restarted the node for unrelated reasons, going back to a fully-strict build, which over the next days would again get hung up receiving no usable blocks. Again I restarted with the relaxed version and again it got synced.

4. In February 2020, mod6 reported a number of nodes stuck at block 618406. I tried restarting with the relaxed version, which again worked, and shared my notes; also noted in #o. Since apparently lobbes didn't manage to crank the link archiver I've now fished out the paste ref (and begin to wonder why I even used pastes instead of just putting text files in my own web space in the first place).

5. In February 2021 I noted my node was persistently lagging again, though at least getting blocks in small bursts, hourly or so. Again I tried relaxing it, with a proper vpatch this time (though just removing the ban, not adding an option as seen here); at first it didn't help, but my "addnodes" list had enough TRB peers to fill up its 8 connection slots. Upon pruning this list it again got moving, and appeared to be feeding 4 other nodes. In May 2021 Robinson reported running for months without a hitch using the same patch.

As far as contrary evidence, I do seem to recall intervals of block starvation even with a relaxed node, but haven't found mention of it on this dig.

I can even agree that such a comment would certainly be a useful addition to the patch itself but I'm not all that convinced that it really is much of a solution to what seems to me a deeper problem.

That's a fair point about moar comments. Not sure it's quite worth the bother to regrind at this point just for this; perhaps if any other issues come up with the patch. Otherwise I can bear it in mind for next time.

But it's at best a closer-to-the-code variant of commenting on it here; in any case it requires the author to notice and call it out, which does nothing for the patch reader whose proper aim I'd say is to mentally reconstruct the meaning of the change without taking anyone's word for it. I haven't managed to conceive of any way around "bringing from home all the context" though. You could perhaps - at review time if not at signing time - drop the magic number of three lines and have diff show the whole changed file as context.

Some MP refs that came to mind on the question - just to capture them, with no particular conclusion - were on the fundamental function of talmudic commentary in maintaining this kind of codebase and his search for a universal comment format.

Comment by Jacob Welsh — 2021-11-16 @ 06:50
From my last link - ugh, did the log conversion somehow manage to swallow semicolons (in addition to less-thans not html-escaped which I think I noticed somewhere recently) ?! The correct line #2576215 is "mp_en_viaje: maybe ;; is the right choice ? ..." and likewise #2576234 is "she declines; ..."

Comment by Jacob Welsh — 2021-11-16 @ 06:57
There's http://trilema.com/2019/forum-logs-for-21-feb-2014/ for an html-injection example (in my browser the last third of the log and everything following get swallowed). So far I'm seeing it in the pre-2016 ones only.

Comment by Jacob Welsh — 2021-11-28 @ 20:59
bitcoin_tx_fee_cleanup leaves a no-longer-used variable declaration in play,
```
double dPriority = -(*mapPriority.begin()).first;
```
This would have been easily caught except that it gets buried under the mountain of other unresolved compiler warnings. Doesn't seem worth the bother to regrind just for this; there's various other unused bits lying around that could be garbage-collected in one go at a later point.

Comment by Jacob Welsh — 2021-12-13 @ 18:50
Edited to add the big red warning about low fees.

Comment by Jacob Welsh — 2022-01-18 @ 19:39
Edited footnote ii to change "set" to "setlocal", which matters if you're switching between files of different formats within the same Vim session.

Comment by Jacob Welsh — 2022-05-10 @ 18:27