Fixpoint

2021-12-07

Crashproofing bitcoind, the easy and the not-so-easy

Filed under: Bitcoin, Hardware, JWRD, Software — Jacob Welsh @ 06:45

I've been chasing down a nasty and persistent kernel and/or hardware bug that has my system crashing with one kernel panic or another - a different manifestation every time! - that's led me all the way to such wonders as learning to break into the just-crashed kernel in its state of suspended animation by serial cable with a remote GDB.(i) And this is on one of our new boxes, the famous apu1 that was supposed to be this sane and functional AMD64 computer and perhaps even lifeboat off those old, dusty, cranky and increasingly unobtainable Corebooted Thinkpads.(ii)

One scrap of good news, such as it is, that comes out of this is that I'm getting plenty of opportunity (born of necessity) to observe the post-crash recovery behavior of bitcoind. Indeed, having the thing chewing full-speed through a few hours of syncing is a reliable way - and thus far the only way I've found - to reproduce the kernel crash. Now, it's known to employ the Berkeley DB transaction logging facilities which in theory should provide a strong guarantee of crash consistency. I've already noted my standard method of shutting the thing down is with kill -9(iii) to bypass the time-consuming and AFAIK pointless "flushing" of the database when it's asked to shut down the polite way. And so far I've had no problems from this; but the hard system crash proved a stronger test. Here it is on the rebound:

11/27/21 01:23:12 Loading block index...
11/27/21 01:24:15 LoadBlockIndex(): hashBestChain=0000000000000000000308b01be818324df7fa4a15a1fa0af5096511c2acab05 height=710928
11/27/21 01:24:15 ERROR: CheckProofOfWork() : nBits below minimum work
11/27/21 01:24:15 ERROR: CBlock::ReadFromDisk() : errors in block header
11/27/21 01:24:15 ERROR: LoadBlockIndex() : block.ReadFromDisk failed
11/27/21 01:24:15 block index 62933ms
11/27/21 01:24:15 Loading wallet...
[snip]
11/27/21 01:24:15 Bitcoin: Error loading blkindex.dat

11/27/21 01:24:15 DBFlush(false)
11/27/21 01:24:15 addr.dat refcount=0
11/27/21 01:24:15 addr.dat flush
11/27/21 01:24:15 blkindex.dat refcount=0
11/27/21 01:24:15 blkindex.dat flush
11/27/21 01:45:14 wallet.dat refcount=0(iv)
11/27/21 01:45:14 wallet.dat flush
11/27/21 01:45:14 StopNode()
11/27/21 01:45:14 DBFlush(true)
11/27/21 01:45:15 Bitcoin exiting

OK so... now what?! Game over, sorry for your loss. Would you like to play again, starting from the genesis block of 2009? I swear, the filesystem wasn't this bad in Windows 98 and I'd bet it wasn't in MS-DOS either.

Of course I had a coherent backup of the full database ready to go, but it was several months old because it's a pain to do correctly.(v)

As a result, my node is unable to make forward progress at all; I can feed it blocks but unless I interrupt it frequently for a lengthy full backup it'll be right back to square one on the next crash. Obviously the crash has to be resolved one way or the other too, but the two problems together add up to a much worse situation than either one individually.

So whence comes this gap between database theory and practice? It's because it doesn't *just* use BDB, oh no, that would be called "making a decision" or something, all totalitarian like; no, Bitcoin is a free spirit and also writes the raw blocks on disk in its own threadbare, I-can't-believe-it's-not-a-database binary format (those blk####.dat files), then uses BDB to index the data within those. So the one is not covered by the transactional guarantees of the other.

But what if I told you there was a single system call which, if inserted in the right place, could completely solve the problem - and with it perhaps the vast majority of supposed "database corruption" suffered for all these years by Bitcoin node operators in the field ? (At least by those above the "my antivirus put it in quarantine, maybe computers should get face masks too?" level of cattle.)

And what if I further told you that this call was already there in the code, but saddled with conditions so that it only ran some of the time ?

Would you then guess that these conditions were towards the purpose of "optimization" - by which we mean "cheapening" ? You know, just like the DRAM chips, and the CPUs, and the PCI bus, and the system programming language,(vi) and the people writing the code, and the people using the machines, and the system that "educated" them, and the plastic clothes they wear and the plastic battery-powered remote-controlled cars they drive and the chemically-enhanced food they eat and the absolutely everything else ?

Well, that was my guess at any rate and as it seems to go so often, survival in the modern world comes down to figuring out which part to remove, short of burning the whole thing down. Although the latter option isn't necessarily off the table.

Here we go:

diff -uNr a/bitcoin/src/main.h b/bitcoin/src/main.h
--- a/bitcoin/src/main.h 3cd26bb2979bc7ba58683345760c51e3f7ed974df52d84ed6855ed6e294c18776da4d5f5aa2461cb2bece6e3d9388a035cd485812b29900b406d0472904ea384
+++ b/bitcoin/src/main.h 593317b93b1ca134595672039e5abb0b0dd23efa85d5a7eb170b078fde135c96413bbe8b40cc308c3796dc5a83456903592f3e966d07475e2336c3e6887a6f4f
@@ -880,10 +880,7 @@

         // Flush stdio buffers and commit to disk before returning
         fflush(fileout);
-        if (!IsInitialBlockDownload() || (nBestHeight+1) % 500 == 0)
-        {
-            fsync(fileno(fileout));
-        }
+        fsync(fileno(fileout));

         return true;
     }

(Since the context that made it into that patch is pretty meager, we can bring more from home.)

My node is currently running with this change; so far so good and the sync rate isn't noticeably affected. Which stands to reason, because the amount of reads/seeks/updates needed to verify and connect the transactions in a block would add up to way more I/O activity than simply writing out that block in one contiguous chunk, not even considering the CPU-bound part of checking signatures. That's the premature optimization angle here, and why that fits right in to action from cause versus toward purpose: it was not known and likely could not have been known at the time the code was written how significant the performance improvement would even be - though the reliability implications certainly could have been foreseen, on the basis of ~any experience with computers.

I'm not putting this out as a final vpatch yet though because it brings up some issues in the related code that checks blocks on startup; that is, the part that emits the errors seen above in the event of corruption.(vii) One part can certainly be improved without the other; they're related more in the sense of dance partners than of the opposing limbs of either dancer; still it makes sense to me to keep them together if possible, touching as they do on the same subject and with the fsync part being so trivial. I've set down those issues in the log but to summarize:

  • The node spends a substantial amount of "bootup" time doing structural checks on the last 2500-and-change blocks, far more time than it takes to load the whole set of block headers ("block index") into memory (itself perhaps a questionable design but fundamental in the current implementation).
  • There's no reason in sight for this magic number to be so high, though I don't see much alternative to picking some number. Zero isn't enough, one might do but is playing with fire, while "just check them all" is absurd to do on every startup. I suggest 24 hours * 6 expected blocks per hour = 144 as a more reasonable choice: still quite conservative as far as the sort of damage that hits with a bias toward recent data, especially since the fsync patch should remove the only established instance of such; but substantially faster, specifically about 1/6 the time of loading the block index at present (which fraction will tend to shrink over time as the block index continues growing).
  • There is an option to force a full check when desired, which needs to get mentioned in the help.
  • There is recovery code for one but not another species of check failure; its usefulness and correctness are both unclear to me but at minimum it needs to make the "last N blocks" safety margin measure from the first *good* block it finds rather than proceeding as if the check had succeeded when in fact it's just that the limit had been reached. This change would be simple (in the sense of concise) and conservative (in the sense of not making things any worse than before).

I'm presently open to questions & comments on the subject.

Update (2022-01-18): patches published

Here are the signed patches as discussed. The fsync change is as shown above. As the "checkblocks" cleanup is a bit more involved, I'll reformat its manifest entry here for convenience.(viii)

  • Let the -checkblocks option take a positive integer value to set the check depth limit or zero for no limit.
  • Drop the default from 2500 to 144.
  • Break the implementation out from a long-winded CTxDB member function whose class context was irrelevant to it.
  • Simplify its loop bounds (which revealed a corner case of genesis block possibly going unchecked), fix off-by-one on the nominal depth limit and make it count from the first *good* block found.
  • Add progress reporting and improve error messages and commentary.
  • Exit before loading other things when it fails.
  • NOTE: This breaks CLI/config file compatibility with the prior sloppy undocumented situation where any -checkblocks value at all would enable full checking.
Patch Seals Tree
bitcoin_fsync_all_blocks.vpatch jfw Browse
bitcoin_checkblocks_cleanup.vpatch jfw Browse
  1. While the novelty of this escape from the carefully maintained safe-space/jail of user-space tools made for a bit of a thrill, mind-reading powers aren't as great as one might naively imagine: you still have to figure out where to look, what questions to ask, and how to interpret; and you still don't see causes directly, only downstream effects. Kinda why the old Linus was against adding it (archived). [^]
  2. And did I mention this comes after working around this *other* nonsense where the SATA 6Gbps link repeatedly resets with complaints of either bus or device errors ?! I put libata.force=3.0Gbps,noncq in the boot flags with the thought of reducing the signaling clock speed and the volume of potentially buggy code in play, both on the Linux side and in the black box of the Samsung SSD controller on the other end of the link; no more errors. And yes, this SSD was working fine and with the same kernel version in the Thinkpad it was just transplanted from. [^]
  3. Actually, svc -kd /service/bitcoin the Gales way because I'm fancy like that and who wants to mess with PIDs? But it's the same effect. [^]
  4. See that timestamp delta? Even in death it burdens us with its spurious latency. [^]
  5. The only way I know is by shutting down the node first; you can probably keep all but the last blk####.dat from a prior backup but in any case you're waiting for the copy of the whole ~80GB blkindex.dat during this downtime. [^]
  6. You realize that Lisp, at least in its primitive form, came a couple decades *before* C, yes? [^]
  7. In fact that was the first place I started looking, in the vague hope that it could be made to recover from the corruption by rewinding and re-fetching the damaged block or some such. It can't, at least by readily available means, because a block's correct data is needed in order to unwind it (specifically to reverse its transactions). [^]
  8. The one-paragraph limitation of the TMSR-V manifest format is starting to grate. Earlier, well-used version control tools didn't have this limitation and I'm struggling to see any upside in it. [^]

9 Comments »

  1. This was a pleasure to read, I found it to have a good balance of technical details, the environment driving them and commentary on operation in practice.

    Just so I have it clear, not up for debate is flushing to file system for each block, right ?

    There's no reason in sight for this magic number to be so high, though I don't see much alternative to picking some number.

    Myeah, magic numbers are indeed braindamaged. That being said, it makes sense to at least pick a good default and a days' worth of blocks looks like a good balance between usability and security, since when was the last time there was a fork even in the double digit block count ? March 2013 ?

    How about a third option to mitigate the magic number infection : allow -checkblock to take an argument to let the operator decide how deep to check. If no argument, check from genesis. Then you have a more rational default on startup, -checkblock without arguments continues functioning as is, but then the operator has a clean way to decide for himself how deep he wants to check.

    Good find on -checkblock ; aye on adding it to the help in any case.

    at minimum it needs to make the "last N blocks" safety margin measure from the first *good* block it finds rather than proceeding as if the check had succeeded when in fact it's just that the limit had been reached. This change would be simple (in the sense of concise) and conservative (in the sense of not making things any worse than before).

    Sounds like a smart addition to me.

    I think I covered your questions, let me know otherwise.

    Comment by Robinson Dorion — 2021-12-08 @ 01:03

  2. This was a pleasure to read, I found it to have a good balance of technical details, the environment driving them and commentary on operation in practice.

    Cheers, I enjoyed writing it too.

    Just so I have it clear, not up for debate is flushing to file system for each block, right ?

    Well, if there's a case to be made against it I'm certainly open to hearing, though I can't imagine what it would be in light of the consequences illustrated here, and I wasn't asking for one.

    One argument might be that crashproofing as a whole is a software solution to a hardware problem; except in current practice the software underpinnings are also profoundly unreliable, and there's no available hardware I know of that can save the full system state across a power outage, which is what you'd need to ditch fsync(). What does exist are RAID cards with battery-backed writeback caches; the theory is that they reduce the costs of the fsync (since the data just needs to be flushed to the card's cache rather than all the way to spinning platters or erased flash cells) but don't remove the need for it, so they should work fine as intended in conjunction with the patch. Indeed I'd say instead that the "every 500 blocks" was a (defective) software solution for a hardware performance problem.

    when was the last time there was a fork even in the double digit block count ? March 2013 ?

    Possible future events have the pesky property of not being constrained by past observations, but I don't think that's a concern here. The blocks already had to make it through full validation to get accepted by the node in the first place; what this code does is just a startup sanity self-test, akin to the BIOS POST or the routine filesystem check when a machine boots. The question is "how many blocks deep might some data have gotten corrupted without our noticing the first time for whatever reason"; my argument for keeping it something greater than one is the unknown of what those reasons might be.

    allow -checkblock to take an argument to let the operator decide how deep to check. If no argument, check from genesis.

    Special meaning for "no argument" is possibly fragile/confusing: how would you specify it in bitcoin.conf? Is "-checkblocks" the same as "-checkblocks=" or "-checkblocks= " (note the whitespace) ?

    It looks to me like the previous code idiotically took any -checkblocks option at all to mean "true"; see how it doesn't use GetBoolArg (which itself is sloppy enough but at least has some kind of model behind it). So we're kinda boxed in if we don't want to break compatibility.

    We could of course define a new option, which would include specifying its interactions with the old one in some least-surprising way; but in any case adding new optionality does come with a cost, both for code review, testing, and for the user. In this case it seems possibly more of a deflection of responsibility. What extra information does the user have that would allow him to make a better decision here than the developer? "His level of tolerance for this narrow species of unknown risks" I suppose. There's some kind of balancing act here.

    Just checked prb (0.11ish) and I see they changed it to -checkblocks=N where N=0 means all and the default is 288. Perhaps that's the way to go.

    Comment by Jacob Welsh — 2021-12-08 @ 03:08

  3. Well, if there's a case to be made against it I'm certainly open to hearing, though I can't imagine what it would be in light of the consequences illustrated here, and I wasn't asking for one.

    Alright, I don't have a case against.

    Indeed I'd say instead that the "every 500 blocks" was a (defective) software solution for a hardware performance problem.

    Yeah, I can see it.

    Possible future events have the pesky property of not being constrained by past observations, but I don't think that's a concern here.

    True.

    The question is "how many blocks deep might some data have gotten corrupted without our noticing the first time for whatever reason"; my argument for keeping it something greater than one is the unknown of what those reasons might be.

    Right, I follow that.

    Alright, I wasn't aware of the implementation details on checkblocks, thanks for clarifying. I agree that adding optionality isn't free, but only having 2 magic number options, a recent check and a check from genesis, also carries a cost and the latter will only grow as the chain lengthens. How long is checkblocks taking at present ? It seems to me that the cost of adding the optionality goes under the cost of inheriting shitcode account in the ledger.

    In this case it seems possibly more of a deflection of responsibility. What extra information does the user have that would allow him to make a better decision here than the developer? "His level of tolerance for this narrow species of unknown risks" I suppose. There's some kind of balancing act here.

    Re better decision than the developer, this quote comes to mind.

    The questions are, is boxing the operator into check recent or check from genesis a legitmate responsibility of the developer ? Or is it the responsibility of the developer to slowly chip away at removing magic numbers from the code ?

    That's how I'm seeing it at least.

    the default is 288. Perhaps that's the way to go.

    Ok.

    Comment by Robinson Dorion — 2021-12-08 @ 18:39

  4. but only having 2 magic number options, a recent check and a check from genesis, also carries a cost and the latter will only grow as the chain lengthens

    Nitpick perhaps, but checking from genesis is magic-free; and it's a valuable tool to have for when you're dubious of the whole data set rather than just recent parts, as indeed I was with this flaky machine.

    How long is checkblocks taking at present ?

    It took 7.4 hours for 683071 blocks.

    It seems to me that the cost of adding the optionality goes under the cost of inheriting shitcode account in the ledger.

    I could see it.

    Re better decision than the developer, this quote comes to mind.

    Does it merely come to mind, or do you mean something more by it? I see what's depicted there as a social failure, a rejection of the actual hierarchy with a hallucinated self-serving alt-hierarchy, quite in the "open sores" tradition where if you don't personally code then you're presumed not to have something to say on the subject.

    Possibly closer to the point here are the discussions of optionality in the MP-WP context, most prominently here. Basically, that we can't figure out a correct choice up front isn't our failing but does indicate an immature project for not having seen enough useful users yet.

    the default is 288. Perhaps that's the way to go.

    There was more packed in there than just the choice of number though - for instance the 0 meaning "check all", and the break in meaning e.g. where -checkblocks=1 previously meant "all" and now means "minimum". I don't think it's a problem tbh, especially since no prior mention in the help. Just want to confirm we're on the same page tho.

    Comment by Jacob Welsh — 2021-12-08 @ 20:50

  5. Nitpick perhaps, but checking from genesis is magic-free; and it's a valuable tool to have for when you're dubious of the whole data set rather than just recent parts, as indeed I was with this flaky machine.

    Ok, and I agree that it's valuable.

    Does it merely come to mind, or do you mean something more by it? I see what's depicted there as a social failure, a rejection of the actual hierarchy with a hallucinated self-serving alt-hierarchy, quite in the "open sores" tradition where if you don't personally code then you're presumed not to have something to say on the subject.

    Well, you did say, "What extra information does the user have that would allow him to make a better decision here than the developer?" The point is, there could be A LOT of extra information an active and technically informed operator who doesn't code for his bread has that developer doesn't.

    Basically, that we can't figure out a correct choice up front isn't our failing but does indicate an immature project for not having seen enough useful users yet.

    So that tells me the adding the optionality is worth it.

    There was more packed in there than just the choice of number though

    Ah, right. So, they did add in the optionality ? Is it something like, checkblocks=0 checks from genesis and checkblocks=600000 checks starting 600k blocks deep ?

    Comment by Robinson Dorion — 2021-12-08 @ 22:08

  6. Is it something like, checkblocks=0 checks from genesis and checkblocks=600000 checks starting 600k blocks deep ?

    More or less. Technically they did the 'from genesis' part as (brace yourself...)

    if (nCheckDepth <= 0)
        nCheckDepth = 1000000000; // suffices until the year 19000
    

    and it works from the most recent block backwards (and still has the off-by-one I noted).

    Come to think of it, for the default I'm liking my original 144 if only to emphasize that it's our decision and they aren't the reference or basis for anything. Only reason I checked was so that we don't end up loading the same option with such different behavior that it creates pointless confusion for people switching.

    Comment by Jacob Welsh — 2021-12-09 @ 02:03

  7. Updated. Meanwhile I've replaced the flaky apu1 by falling back to what still works for now, refurbing another Thinkpad.

    Comment by Jacob Welsh — 2022-01-18 @ 19:27

  8. I applied the bitcoin_fsync_all_blocks by hand shortly after this was published and the node has been running smoothly since.

    For a datapoint, it took about 2.25 hours at current height to check all the blocks. I think the status update every 1k blocks is nice.

    01/18/22 20:51:57 Loading block index...
    01/18/22 20:52:22 LoadBlockIndex(): hashBestChain=0000000000000000000880d32fd1321a69593493cd2b18d654350741bfbd6a55 height=719329
    01/18/22 20:52:22 CheckDiskBlocks() at height=719329 ...
    01/18/22 20:52:40 CheckDiskBlocks() at height=718329 ...
    01/18/22 20:53:00 CheckDiskBlocks() at height=717329 ...
    01/18/22 20:53:18 CheckDiskBlocks() at height=716329 ...
    01/18/22 20:53:39 CheckDiskBlocks() at height=715329 ...
    01/18/22 20:53:59 CheckDiskBlocks() at height=714329 ...
    01/18/22 20:54:18 CheckDiskBlocks() at height=713329 ...
    01/18/22 20:54:36 CheckDiskBlocks() at height=712329 ...
    01/18/22 20:54:57 CheckDiskBlocks() at height=711329 ...
    01/18/22 20:55:17 CheckDiskBlocks() at height=710329 ...
    01/18/22 20:55:37 CheckDiskBlocks() at height=709329 ...

    ...

    01/18/22 23:09:27 CheckDiskBlocks() at height=10329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=9329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=8329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=7329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=6329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=5329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=4329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=3329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=2329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=1329 ...
    01/18/22 23:09:27 CheckDiskBlocks() at height=329 ...
    01/18/22 23:09:27 block index 8249728ms
    01/18/22 23:09:27 Loading wallet...

    Comment by Robinson Dorion — 2022-01-19 @ 22:16

  9. Cool. 7.4h / 2.3h ~~ 3.2x slowdown for apu1 vs thinkpad; about as expected.

    Comment by Jacob Welsh — 2022-01-20 @ 17:02

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.