Fixpoint

2022-03-01

#jwrd Logs for Mar 2022

Filed under: #jwrd logs, Logs — Jacob Welsh @ 12:02
Day changed to 2022-03-01
[12:02] caai: jfw: Guten Morgen, yes, you are correct..... Dorion slipped into the coversation unbeknonwnst to me, expect for the small fact that his nick was at the beginning of the line
[12:07] caai: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Feb-2022/#3432 - Alright. I will await your next instruction
[12:07] sourcerer: 2022-02-28 15:44:24 (#jwrd) jfw: I'm most likely going to have to dig into the code; but this at least gives me some specific questions to guide the search.
[12:08] caai: The node has been stopped
[12:11] caai: rm debug-snip* has been run
[12:13] caai: block files 720919-720922 have not been touched
[12:37] caai: As root, 'date MMDDhhmmCCYY' was run, then 'hwclock -w' | I am not sure how the clock offset could be relevant, either, however, it is nice to have the correct time on the thinkpad
Day changed to 2022-03-07
[14:15] caai: dorion/jfw: My suggestion for the March schedule for continued traning: Operator: March 23 - 23:00 UTC | Management March 30 - 23:00 UTC
[21:35] jfw: caai: works for me; noted.
[21:35] jfw checking in from the tranquil canals of Fort Lauderdale, Florida
[22:04] jfw: caai: on the "invalid chain" issue, I've done an initial call trace (seeing what code calls what and under what conditions in order to produce the observed effect), it's been educative at the very least in that it's cleared up a misconception of mine that led to the "how is this even possible ?!" ; I'm still not quite satisfied as to how it happened in the first place though, or how the program
[22:04] jfw: handles the situation.
[22:06] jfw: that is, my curiosity isn't satisfied as to how it happened, and my sensibilities aren't satisfied that the program is adequately handling the situation, even if it can be blamed on a hardware error.
[22:14] jfw: the misconception was that bitcoind validates a block and then accepts it into its permanent storage, or at least does the two tasks indivisibly e.g. using a database transaction. the truth is messier, it does use a transaction but it only does cursory validation (e.g. proof-of-work) prior to accepting it as far as disk and memory block index; then it connects the transactions in the block and if
[22:14] jfw: that fails then the block gets left in some 'invalid' state, where it remains unexamined by 'checkblocks' but still present, in that the node won't attempt to re-validate if it sees it anew. "Or something like that."
Day changed to 2022-03-08
[14:00] caai: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3445 perfect. Quedamos asi.
[14:00] sourcerer: 2022-03-07 14:15:17 (#jwrd) caai: dorion/jfw: My suggestion for the March schedule for continued traning: Operator: March 23 - 23:00 UTC | Management March 30 - 23:00 UTC
[14:10] caai: re: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3452 Interesting.
[14:10] sourcerer: 2022-03-07 22:14:09 (#jwrd) jfw: that fails then the block gets left in some 'invalid' state, where it remains unexamined by 'checkblocks' but still present, in that the node won't attempt to re-validate if it sees it anew. "Or something like that."
[22:51] caai: and what do you consider the next best step to be?
Day changed to 2022-03-09
[03:42] jfw: caai: assuming it's still not all that urgent for you to get a node back up and running, I plan to continue study & documentation of the relevant code for a limited amount of further time, then consider whether a change is warranted (or feasible) to improve its resiliency. If I don't go for that, your best option will be to restore from a backup if available or else restart the sync from zero (in
[03:42] jfw: either case preserving only wallet.dat and bitcoin.conf)
[14:31] caai: jfw: Sounds good. I would say that it is not urgent, but very important, obviously. I propose that we set March 23rd as the deadline to have studied, documented and determined whether it is feasible to improve its resiliency or restart the sync from zero. Bearing in mind that the tasks will be executed by you and you are a busy man, please let me know if that deadline is to plausible.
[14:34] caai: is plausible*
[14:34] caai: dorion: want to schedule a 45-min lunch this week or next?
[16:26] jfw: caai: march 23 sounds good for the cutoff.
[18:46] dorion: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3445 -- works for me.
[18:46] sourcerer: 2022-03-07 14:15:17 (#jwrd) caai: dorion/jfw: My suggestion for the March schedule for continued traning: Operator: March 23 - 23:00 UTC | Management March 30 - 23:00 UTC
[18:46] dorion: caai, I'm booked tomorrow and friday. could do lunch next tuesday.
Day changed to 2022-03-10
[02:55] caai: dorion: next tuesday works for me. athens or elsewhere?
[03:11] jfw: caai: taking it to a private message window may be best for such details (do "/msg dorion"), at least if you can stay connected to await reply
[03:14] jfw: but I'll cast my idle & absentee vote for Carthage
[13:52] caai: jfw: alright. I will send private messages in that manner next time
Day changed to 2022-03-18
[21:07] jfw: caai: I've strengthened my grasp quite a bit on the bitcoin block acceptance code; I would still like to conclusively determine what caused your node to initially "wedge" after block 720920, and the next step I'm seeing there is to collect further log snippets.
[21:09] jfw: can you do: "grep -FC 10 REORGANIZE ~/.bitcoin/debug.log > reorgs.log"
[21:10] jfw: the node doesn't need to be running but this may take a while to complete due to the large size of the debug log.
[21:11] jfw: then, as even the output may be lengthy and we really only care about the latest, do a "tail -n 1000 reorgs.log > reorgs-tail.log", then paste reorgs-tail.log via deedbot.
Day changed to 2022-03-20
[03:21] jfw: A question/comment from me on the ever exciting topic of real estate
[16:36] caai: jfw: as per your instructions, please note the reorgs-tail.log: http://paste.deedbot.org/?id=8pxH
[21:28] jfw: caai: we are making progress, if slowly; the "CheckBlock() : hashMerkleRoot mismatch" errors are the relevant part.
[21:30] jfw: this means the hash stored in the block header doesn't match the transaction data, as loaded from disk; I think we can further assume the hash in the header is correct, because that's included in the proof-of-work check, which would have triggered a different error had it failed.
[22:27] jfw: caai: what I'm thinking, to go for the kill here, is to write a simple RPC (via a new vpatch) that will allow us to extract the faulty block on disk and see exactly what's wrong with it by comparing to the good one. if this confirms the corruption theory, I think we'd even be able to surgically correct it in place, because as a non-accepted block, its transaction data hasn't yet been recorded
[22:27] jfw: elsewhere in the database.
Day changed to 2022-03-21
[01:33] jfw: I've written a bitcoin_getblockindex_etc.vpatch to provide visibility into the block index data structure, which includes the disk coordinates we need among other things; testing now.
[01:36] dorion: jfw, nice. sounds good.
[13:12] caai: jfw: excellent. I am glad to see that we are making progress. please let me know how the testing goes.
[14:40] jfw: caai: testing went smoothly; I started on a writeup last night but that still seems to be the slowest part of the process for me (and it covers the wider situation rather than just the patch which itself is relatively simple). I'll go ahead and upload the patch so you can proceed.
[14:50] jfw: caai: http://fixpoint.welshcomputing.com/v/bitcoin/bitcoin_getblockindex_etc.vpatch , http://fixpoint.welshcomputing.com/v/bitcoin/bitcoin_getblockindex_etc.vpatch.jfw.sig
[14:51] jfw: you know the drill to import these and get it rebuilt?
[14:57] jfw: caai: once built, installed & started, the goal is to paste the result of "bitcoind getblockindex 00000000000000000008eceb0d016d2fcdc91048c23fd7913a16b14b39986384" (the hash of the block it's choking on).
Day changed to 2022-03-22
[04:15] jfw: kaboom! http://fixpoint.welshcomputing.com/2022/mapping-the-bitcoind-block-acceptance-code-with-bonus-patch-for-index-access/
[15:42] caai: jfw: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3464 -- today I reviewd my notes to ensure that I am importing these patches and rebuilding correctly. This should be done by the end of my next work session.
[15:42] sourcerer: 2022-03-09 18:46:15 (#jwrd) dorion: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3445 -- works for me.
[15:45] caai: The correct message that I responded to: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-March-2022/#3486
[15:45] sourcerer: 2022-03-21 14:51:08 (#jwrd) jfw: you know the drill to import these and get it rebuilt?
[15:48] caai: jfw: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3488 -- nice!
[15:48] sourcerer: 2022-03-22 04:15:57 (#jwrd) jfw: kaboom! http://fixpoint.welshcomputing.com/2022/mapping-the-bitcoind-block-acceptance-code-with-bonus-patch-for-index-access/
[16:10] jfw: caai: sounds good.
[20:53] dorion: jfw, caai, fyi I applied the patch and ran the command, my result is here : http://dorion-mode.com/wp-content/uploads/2022/03/blockindex-720921
[21:23] jfw: dorion: looks right; the file and blockpos fields will vary from one node to the next.
[21:24] jfw: caai: I'll go ahead and preload the next step for once you've got that...
[21:26] jfw: we want to extract the raw block data from the collection; this is stored in the ~/.bitcoin/blkNNNN.dat files, while the index data tells us where exactly.
[21:27] jfw: thus the NNNN there is given by the "file" field, zero-padded; so in dorion's example it would be blk0151.dat
[21:28] jfw: then "blockpos" gives the offset within the file. to extract the block we'll use the general binary copying tool "dd", as follows:
[21:30] jfw: cd ~/.bitcoin ; dd if=blkNNNN.dat of=stored-block-720921 bs=1 count=933059
[21:32] jfw: oops, missed an argument, it also needs: skip=<blockpos>
[21:33] jfw: where <blockpos> comes from the getblockindex output.
[21:35] jfw: "if" means input file, "of" - output file, "bs" - block size, the measure for the remaining numbers (1 byte), "count" - number of "bs"-blocks to copy; "skip" - number of "bs"-blocks to skip over from the start of the file.
[21:36] jfw: the 933059 is the block size which I obtained separately.
[21:36] jfw: the size of the bitcoin block in question, that is.
[21:38] jfw: this done, paste the block by first encoding it for a plaintext channel: base64 stored-block-720921 | deedbot-paste.py
Day changed to 2022-03-24
[14:04] caai: jfw: Today, I reviewed the later portion of the video which outlines the Bitcoin node build.
[14:12] caai: I seem to have downloaded the patches correctly and moved them into their corresponding directory, however, I believe I am doing something incorrectly with step 6. Put depwards in place and verify checksums. Since my operator study time is up for today, I will continue focusing on this step tomorrow.
[17:09] jfw: caai: the depwads are an inconvenient aspect of the build for sure; but you already have them from prior builds, so it should just be matter of figuring out where they are and where they need to be moved or (better) copied to.
[17:12] jfw: I suspect the importance of being aware of the working directory hasn't quite sunk in yet; especially since you may change it in the course of investigating or experimenting, which may throw you off when following instructions that assume it's somewhere else based on a prior step
[17:14] jfw: so really, keeping track of both where you are and where the instructions are given relative to.
Day changed to 2022-03-25
[06:41] caai: jfw: the rebuild was successful. As per your clue, I copied them from the prior build.
[06:53] caai: As a side note, speaking about working directories, I noticed that I had created the following directory hierarchy during the last build: src/bitcoin/patches/bitcoin_dumpblock_no_losers | To my understanding, I should not have done the build inside the patches directory. It should have been: src/bitcoin/bitcoin_dumpblock_no_losers
[16:39] jfw: caai: indeed I would not suggest putting anything besides patches/seals within those subdirs.
[16:40] jfw: not because it necessarily wouldn't work, more just because it's confusing what goes where.
[16:43] jfw: caai: are you clear on the next steps then?
Day changed to 2022-03-26
[17:16] caai: Next, I started bitcoind in the .bitcoind directory. Then, I ran 'bitcoind getblockindex 00000000000000000008eceb0d016d2fcdc91048c23fd7913a16b14b39986384'
[17:19] caai: The resulting output is: error: {"code":-32601,"message":"Message not found"}
[17:21] caai: Correction: {"code":-32601,"message":"Method not found"}
Day changed to 2022-03-27
[17:44] jfw: caai: sounds like it's still the pre-patched version running
[17:44] jfw: did some irc message get dropped before that "Next," ?
[17:45] jfw: (it just reads like it would follow a "First...")
[17:46] jfw: "unpatched" rather, pre-application of the getblockindex patch
Day changed to 2022-03-28
[10:59] caai: When I wrote "Next", I was using it as a conjunction to connect my statement that the rebuild was successful. No text was lost.
[11:13] caai: Going back to the rebuild; after I copied the patches into their corresponding directories, I ran v.pl .wot and then v.pl flow. bitcoin_getblockindex_etc.vpatch shows at the end.
[11:16] caai: Then, I copied the depwads from the previous build into src/bitcoin/bitcoin_getblockindex/bitcoin/deps
[11:25] caai: (Before the previous step, I ran v.pl p bitcoin_getblockindex patches/bitcoin_getblockindex_etc.vpatch)
[11:28] caai: Once the files from the previous rebuild were in deps, I ran sha512sum -c Manifest.sha512. All 3 files were 'OK'
[11:34] caai: Finally, I move up one directory '..cd', and I ran 'make -j3'. There was a build, resulting in a signficant amount of output
[11:35] caai: 'cd ..'*
[14:58] jfw: caai: ok, sounds fine so far; what did you do with the resulting executable?
[15:06] jfw: or to make it a bit clearer in the interests of efficiency, what I specifically didn't see there is the part where you install it to your path such that "running bitcoind" would actually refer to the new version.
[15:07] jfw: how's the poor machine to know that you wanted to make the new one globally active rather than just compiling it for development purposes in its local directory?
Day changed to 2022-03-29
[13:06] jfw: caai: I'm around, if you're up for bitcoind work
[13:29] caai: jfw: Good morning, let this study session begin
[13:33] jfw: an' the tip o' the mornin' to ye.
[13:38] caai: Is the resulting executuble this file? 'src/bitcoin/bitcoin_getblockindex/bitcoin/build/bitcoind'?
[13:39] jfw: yes it is.
[13:43] caai: I have move it to /usr/local/bin
[13:45] jfw: good; try a "type bitcoind" for good measure to ensure that's what's being found.
[13:46] jfw: ( /local/bin/bitcoind would be equivalent on Gales. )
[13:47] caai: I see (/local/bin/bitcoind)
[13:49] jfw: good; was that indeed the missing step last time?
[13:51] caai: Alright. I started bitcoind 5 minutes ago but then killed it. However, kill -9 was unable to 'locate the process', so I 'Ctrl-C' the process. Now, I am waiting ..........
[13:52] jfw: aw.
[13:53] jfw: that could take 20 minutes +
[13:53] caai: Yes, could I just restart my machine?
[13:54] jfw: it would be equivalent to doing a kill -9; which I suppose we might as well.
[13:54] jfw: probably you got the wrong PID before.
[13:55] jfw: (part of why I push daemontools)
[13:55] jfw: gah
[13:58] jfw: caai: welcome back. I take it you went for the windows solution (reboot)?
[13:58] caai: hahaha
[14:00] caai: We will circle back to the kill -9 problem. It has happened a few times. I didn't want to bring it up because I thought it best to solve one problem at a time.
[14:01] caai: I have fired up bitcoind within the .bitcoind directory
[14:01] jfw: caai: alright
[14:04] jfw: caai: although the results show that this particular new-problem-avoidance cost us about 15 minutes, in which time we might well have solved it
[14:08] jfw: caai: now that you've rebooted do you need the getblockindex command again for pasting purposes?
[14:09] jfw: preemptively: http://fixpoint.welshcomputing.com/2022/jwrd-logs-for-Mar-2022/#3487
[14:09] sourcerer: 2022-03-21 14:57:30 (#jwrd) jfw: caai: once built, installed & started, the goal is to paste the result of "bitcoind getblockindex 00000000000000000008eceb0d016d2fcdc91048c23fd7913a16b14b39986384" (the hash of the block it's choking on).
[14:13] caai: Sorry for the delay. I was searching for what I should call the file that I am sending the output to.
[14:15] caai: http://paste.deedbot.org/?id=64Lo
[14:16] jfw: caai: whatever works for you as far as the filename goes, it's just temporary.
[14:16] jfw: paste looks good
[14:17] caai: Unfortuntely, my study time for the day has come to an end. I will continue tomorrow morning. Thanks for you help!
[14:17] jfw: caai: was the next part clear?
[14:17] sourcerer: 2022-03-22 21:26:41 (#jwrd) jfw: we want to extract the raw block data from the collection; this is stored in the ~/.bitcoin/blkNNNN.dat files, while the index data tells us where exactly.
[14:21] jfw: based on the block index data I'll expand the full extraction command: dd if=blk0151.dat of=stored-block-720921 bs=1 count=933059 skip=1889677924
[14:22] jfw: caai: you're welcome & catch you later.
Day changed to 2022-03-30
[01:59] jfw: Today I've got the pastebin service dependency cut while resolving the protocol bloat - http://welshcomputing.com/paste/
[01:59] sourcerer: 2022-02-23 00:04:38 (#jwrd) jfw: ACTION looks at his own script for the task, has trouble seeing why http requests and url-encoding should be necessary just to dump some bytes in a bin and get back an identifier
[02:01] jfw: 19 lines of bash/ksh, durable commits & no RDBMS or PHP etc.
[02:06] jfw: running on a stock (albeit with the very latest tweaks) Gales install.
[13:39] dorion: jfw, cool. I tried the exact example on the www, but didn't receive a response.
[13:40] jfw: dorion: try manually using 'nc -v' and typing into stdin, might show what's up. guessing firewlal.
[13:40] jfw: *firewall
[13:56] dorion: jfw, ayup, working now. thanks.
[14:11] jfw: caai: how goes the block-wrangling?
[15:06] caai: jfw: Good morning! I ran "cd ~/.bitcoin", then "dd =ifblk0151.dat of=stored-block-720921 bs=1 count=933059 skip=<blockpos>"
[15:10] caai: the output reads: "ksh: syntax error: `newline' unexpected"
[15:27] caai: dd if=blk0151.dat*
[15:32] jfw: sounds like you typed <blockpos> literally?
[22:51] caai: Yes, I did
[22:55] caai: I now ran "dd if=blk0151.dat of=stored-block-720921 bs=1 count=933059 skip=1889677924" | No output was givn (verbose); it appears to have completed successfully.
[22:56] jfw: caai: how do you mean "verbose" if there was no output?
[22:57] caai: I was impovising and apparently hit the wrong note
[22:58] caai: I suppose what I said doesn't make sense
[23:02] caai: improvising*
Day changed to 2022-03-31
[00:56] jfw: we got this to the point of some error when uploading to deedbot paste; rather than troubleshoot that I'd rather use my new one which is looking to be right on time.
[00:57] jfw: caai: the trouble is you'll probably hit the same firewall issue that dorion did, since you're running a similar config; but give it a try: nc welshcomputing.com 4 < stored-block-720921
[12:48] caai: jfw: Good morning
[12:50] caai: Yes, upon running the command, it appears to hit a firewall issue: "nc: we can't connect to remote host (your ip address): connection refused
[14:20] jfw: caai: ok. do you know if or where you have your router config files backed up, and specifically pf.conf?
[14:22] jfw: I'm thinking the easiest process will be to edit that on the laptop, then 'scp' the new version to the router as /etc/pf.conf, then ssh in and activate it using 'pfctl -f /etc/pf.conf'
[14:22] jfw: (it can also be edited directly on the router but only a basic 'vi' is available there, which tends to throw people a bit)
[14:24] jfw: the edit we need to make is to add TCP port 4 to the list of permitted ports for outbound connections. see if you can figure that out, otherwise I could perhaps supply a diff.
[15:22] caai: alright. I accessed the /etc/pf.conf file on the router via an ssh login. From there, I opened the pf.conf file using vim. I see a line that says "tcp_out=". Is this the line where I should add port 4?
[15:31] dorion: caai, nice going, yeah.
[15:32] dorion: do you see the pattern there ? where the variables for the various ports are defined in the lines above tcp_out ?
[15:33] dorion: caai, once you define the paste variable and add it to the tcp_out list, write and quite and run :
[15:34] dorion: pfctl -f /etc/pf.conf
[15:36] dorion: maybe you already recalled that, or maybe I should've let you look that ^^^ up from the from networking lesson.
[17:04] jfw: or from what I said a couple lines up :P
[17:10] jfw: caai: yes, you can simply add 4 to the tcp_out list, or define it as a variable as dorion suggests which has the benefit of making clearer what it's for.
[17:28] caai: alright. with vi, I added another variable "tcp_port = 4" and added it to the list" $tcp_port".
[17:29] caai: and then ran pfctl -f /etc/pf.conf
[17:33] caai: after that, I ran the command again: nc welshcomputing.com 4 < stored-block-720921 | output: http://welshcomputing.com/paste/b
[17:52] jfw: "tcp_port = 4" <- that's not quite the idea; it *is* a tcp port but so are all the others. I'd call it "paste_port" or the like.
[18:00] caai: alright. I modifed the port name to "paste_port = 4" and added it to the list "$paste_port". I then ran "pfctl -f /etc/pf.conf" again.
[18:00] caai: modified*
[18:02] caai: I then ran: nc welshcomputing.com 4 < stored-block-720921 | Here is the output: http://welshcomputing.com/paste/c
[18:06] jfw: caai: good, and either way the paste worked (even on the un-base64-encoded binary file), well done.
[18:07] jfw: and at last I can see that indeed that block is corrupted, and it's a single bit flip.
[18:09] jfw: shortly I'll work out a command to try to repair it
[18:20] jfw: from 'cmp' I find it's byte #6519 that differs within the block; but this counts from 1, so we subtract 1 to get the byte offset (because, for example, if the first byte differed, its file offset would be zero)
[18:20] jfw: to this we add the offset of the block within the larger blk0151.dat: 1889677924+6518 = 1889684442
[18:22] jfw: the erroneous byte is 0xFD while the correct one is 0xED (hex notation)
[18:27] jfw: thus, from ~/.bitcoin, with the node stopped: printf '\xED' | dd of=blk0151.dat conv=notrunc bs=1 count=1 seek=1889684442
[18:27] jfw: make sure to get that exactly right - better, make a backup copy of that block file first.
[18:28] jfw: (just a local backup, eg "cp blk0151.dat blk0151.dat.bak")
[18:29] jfw: once successful, resume the node and we'll see if it worked.
[18:30] jfw off

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.