Fixpoint

2020-04-28

Build system overhaul for bitcoind

Filed under: Bitcoin, Historia, Software — Jacob Welsh @ 21:19

Background

The Real Bitcoin's build system for some years has consisted at the top level of a number of GNU Makefiles and a thing called "Rotor", building on an earlier "Stator". According to its 2015 introduction by Stanislav Datskovskiy, it served to compile the "bitcoind" executable deterministically and with full static linking, given a reasonable starting environment. The key to this magic was "buildroot", essentially a miniature, non-self-hosting Linux distribution designed for cross-compiling embedded systems.

The determinism came from capturing a full set of dependencies all the way down to the compiler. This came at the cost of adding considerable complexity to the process, as what was formerly a mere application took on all the potential problems involved in bootstrapping an operating system from an unpredictable environment, in addition to the already intricate build systems of the required libraries Boost, Berkeley DB (BDB) and OpenSSL. In an early sign of trouble, Michael Trinque found it wouldn't build BDB on his system without some CPU architecture specific configuration. In my own experience, I got it to work once, but when demonstrating to some friends on fairly similar Gentoo systems I'd built for them, it failed in multiple different ways. Ultimately I couldn't be bothered to track them all down, partly because of how unbearably slow it was: to try any change you would have to repeat the whole toolchain bootstrap.

The V cryptographic source code management system was introduced, with Bitcoin as its first user, shortly after Rotor; somehow the Rotor scripts and patches didn't end up in the V-tree proper, meaning that they in addition to the library and toolchain sources had to be rounded up in order to do offline (i.e. reliable) builds or study the code.

Finally, having already taken on the publication of a Linux distribution with similar static linking and deterministic bootstrapping goals, but going further in providing a self-sufficient system with native compilers, I had little desire to be stuck maintaining two different such beasts.

The vpatch

Thus I now present bitcoin_system_compiler.vpatch (with seals on the shelf). Building on my previous raw transactions patch, it:

  • Rewrites the Makefiles almost entirely. This includes the "makefile.unix" inherited from earlier developers, greatly simplifying it and eliminating historical baggage such as dynamic linking tweaks, Ubuntu bug workarounds, linking of "libssl" and way too much "sed" magic. Additions include compiler warning flags (resulting in quite a bit of warning spew, some of which might be interesting) and building "test_bitcoin" by default. Automatic header dependency analysis is preserved.
  • Removes some minor GNUisms like "tar xvfz" in order to work on BusyBox systems.
  • Brings "openssl-004-musl-termios.patch" formerly found in the external rotor sources into the tree.
  • Adds "boost-no-demangler.patch", discussed below.
  • Removes the various "src/obj" directories and moves all compiler output to the "build" directory. (For instance, this makes it easier to "grep" or "diff" the code without tripping on binary files.)
  • Avoids copying the "bitcoind" binary all around and removes the "bin" subdirectory: one place is enough.
  • Corrects the oversight that a library build failure would be ignored on a second "make" invocation because the mere extracted directories were used as the targets for dependency calculation.
  • Fixes parallel "make" by forcibly serializing the recursion into OpenSSL's ever-so-special custom build system.
  • Avoids recursing into "deps" on a top-level "make clean" so that dependency tarballs won't need to be re-downloaded. (Ultimately these need to get cleaned up and imported directly to the tree.)
  • Tweaks the BDB configuration to prevent "libtool" from attempting to build unwanted shared librarires.
  • Tweaks the Boost "bjam" invocation (the "compression" module gets built at install time if suppressed only at build time) and removes some "|| true" constructs that caused failures to be ignored.

It still supports the "make ONLINE=1" mode to download out-of-tree dependencies into the "deps" subdirectory from deedbot; these are reduced to the three essentials (Boost, BDB, OpenSSL).

In short, it makes both development and deployment much less painful, with a sane starting system as the price of admission.

The undemangling

Special mention is in order for the new boost-no-demangler.patch. Gales Linux uses an older branch of GCC that didn't receive fixes when the long-existing "stack clashing" (archived) family of attacks was stirred up in 2017, meaning some applications could end up vulnerable, especially those using the hazardous yet popular "alloca" or variable-length array features.

As a first step in investigating this, I enabled a number of warnings by default in the GCC configuration relating to excessive or dynamic stack frame size. While these warnings produce many false positives, they've done nicely to illuminate some suspicious code, such as binutils/libiberty/cp-demangle.c. Got that all read? Me neither... but so what, that "libiberty" is just an internal part of the toolchain, or so say the docs, right? And it's "well-known" that you don't want to feed untrusted input to the linker and friends. But wait: the GCC build system copies that code into libstdc++; from there it gets linked into C++ programs. This happens even if the program doesn't use the nonstandard "__cxa_demangle" extension it provides, by way of the default exception handler ("terminate called after throwing an instance of std::whatever").

So my gcc-4.7.4-demangler-amputation.patch, included in the Gales toolchain, removes __cxa_demangle along with the copying of cp-demangle.c, and simplifies the termination function to print raw symbol name, as it previously did anyway for the case of demangler failure (hah! - we learn that they knew their code doesn't always work). These names can be fed to "c++filt" to decode them manually if need be. It then turns out that Boost contains a couple uses of __cxa_demangle - none of them in components actually needed by bitcoind, harrumph - so the Boost patch simply cuts out the assumption that GNU compilers support it.

Stats

$ diffstat bitcoin_system_compiler.vpatch
 .gitignore                          |   21 ----
 Makefile                            |   21 ----
 bin/Makefile                        |   13 --
 bin/Manifest.sha512                 |    1
 build/Makefile                      |   97 +++++++++++++--------
 build/Makefile.rotor                |   56 ------------
 deps/Makefile                       |  166 +-----------------------------------
 deps/Manifest.sha512                |   17 ---
 deps/boost-no-demangler.patch       |   49 ++++++++++
 deps/openssl-004-musl-termios.patch |   46 +++++++++
 manifest                            |    1
 src/makefile.unix                   |  145 -------------------------------
 src/obj-test/.gitignore             |    2
 src/obj/.gitignore                  |    2
 src/obj/nogui/.gitignore            |    2
 src/obj/test/.gitignore             |    2
 verify.mk                           |    5 -
 17 files changed, 168 insertions(+), 478 deletions(-)

Looking only at the "make" code, 471 lines across seven files is reduced to 112 lines across three: quite the improvement I should think!

Future directions

Some work that sorely needs doing, as suggested earlier, is getting those external libraries under control, through some combination of pruning their code to just the necessary parts, replacing their build systems, and importing to the tree, or changing bitcoind code to eliminate them.

2020-04-22

New and improved raw transaction support for bitcoind

Filed under: Bitcoin, Software — Jacob Welsh @ 05:25

Previously on Fixpoint, I introduced my implementation of the getrawtransaction command and reviewed an existing one for sendrawtransaction. There were some more and less serious problems and stylistic considerations outstanding. To recap, for the "get":

  1. Not a proper vpatch.
  2. Loose parsing, ignoring trailing non-hex or excess characters in the txid argument.
  3. Directly accessing the mapTransactions object without acquiring the cs_mapTransactions lock.
  4. Unnecessarily exposing the mapTransactions object globally.

And for the "send", (with some points not previously noted, indicated by *, for my teeth grow sharper with the passage of time):

  1. Needing a regrind to build on other work from the intervening years.
  2. Redundant inclusion of CDataStream constructor defaults SER_NETWORK and VERSION.
  3. Loose parsing, ignoring excess data in the hex argument after the first decoded transaction.
  4. Refusal to re-broadcast transactions already in the node's memory pool.
  5. Returning -5 for already-seen transactions, a code otherwise used only for invalid arguments or inexistant objects. *
  6. New TransactionSeen function, clumsy in both usage and implementation.
  7. Supposed "missing inputs" condition that can't occur and wouldn't distinguish error codes if it did.
  8. Application of anti-spam checks by way of calling AcceptToMemoryPool with fCheckInputs=true.
  9. No reason provided for some rejection cases. *
  10. "txHash" breaking the conventional Hungarian notation, suggesting it's a CTransaction instance. *

Now meet bitcoin_rawtx_get_send.vpatch.

It addresses points 1, 3, 4, 5, 6, 8, 9, 10, 11, and 14. The risky exposure of mapTransactions is replaced with specific accessor functions that take care of locking. Each of these also turns out to simplify other code in main.cpp: always a nice sign that you're onto a good interface.

#2 I've discussed previously and don't see as a problem with this patch.

#7 I've let slide; please speak up if you think it's important.

#12 I've preserved deliberately, on the theory that it's better to reject transactions likely to be rejected by peers so the sender has a chance to determine the cause and fix it. This strikes me as a weak argument though; perhaps better would be to have a way to force it if necessary.

#13 is tricky because AcceptToMemoryPool is a monster of a function that could fail for a number of reasons, and doesn't make that reason available to the caller. It does at least write to the debug log so I've made the error message provide that referral.

The two previous patches added up to +99/-2 lines; the new unified one to +85/-10.

Further, I've now made the whole Real Bitcoin patch set and known seals available on my own reference code shelves.(i) I went through them all afresh to identify those on which I was comfortable adding my own signature; for the rest I used my "jfw_unchecked" subordinate key (both keys are available through my contact page).

On that shelf you may also notice a novel bitcoin_system_compiler.vpatch. It amounts to a full overhaul of the build system and will be the subject of an upcoming article.

  1. At some point I intend to put a formal, efficient, GNU bloatware free sync mechanism in place. For now, the best I can suggest to grab the whole collection in one shot is:

    wget --mirror --no-parent http://fixpoint.welshcomputing.com/v/

    [^]

2020-04-11

Preliminary report on the bitcoind hex conversion mess

Filed under: Bitcoin, Politikos, Software — Jacob Welsh @ 16:46

As I previously noted about my getrawtransaction patch,

It does accept invalid hex strings, perhaps a flaw in that SetHex method.

Indeed, SetHex in the base_uint class is completely permissive: it accepts leading whitespace, optional "0x" or "0X", then however many hex digits it finds, up to the length of the template specialization (160 or 256 bits), zero-filling if too short and ignoring excess if too long. It also includes some pointer arithmetic I believe has technically undefined behavior: comparing after decrementing past the start of the object. And it's one source of the obnoxious byte order reversal in bitcoind hash I/O, with GetHex being its counterpart.

The users of the function are the RPC commands listsinceblock and gettransaction. There's another SetHex function in CBigNum, similar but taking arbitrary length, entirely unused.

On the other hand, ParseHex, the subject of mod6_phexdigit_fix.vpatch, is oriented toward "dumps" in that it allows space between digit pairs and doesn't reverse byte order. It's used in some tests, in the hard-coded genesis block output in main.cpp, in the mining RPC commands getwork and getmemorypool, and in sendrawtransaction as seen in polarbeard_add_sendrawtransaction.vpatch.

My conclusion is that getrawtransaction is doing the right thing here, in that it shouldn't be made a special case, but the hex parsing should be cleaned up generally. If it were just me, I'd rip it all out and use a single, strict, non-reversing hex decoder. But there's no telling how much outside code or data has built up around the old rules. Parties interested in the matter, in the sense of having a meaningful amount of coin on the line, are encouraged to write in.(i)

  1. Not that I expect this to do much on its own, as politics involves involves actively going out to find people where they are and talk to them. [^]

2020-03-07

JFW's 130 top Trilema picks to date

Filed under: Bitcoin, Hardware, Historia, Lex, Paidagogia, Philosophia, Politikos, Software, Vita — Jacob Welsh @ 16:25

Inquiring minds have asked of me to please shed a bit more light on what this Republic thing and that Popescu fellow in particular are all about. Is there more to it than the ravings that first meet the eye, of sluts and slaves and scandalous sexual predations and every "ism" and trigger word known to man or woman? What's the value I see in it that keeps me coming back? And what's the plan for this world domination thing anyway?

I gave the most accurate response I could, if not the most helpful: see, all you gotta do is read a couple thousand articles in multiple languages averaging maybe a thousand words each, a couple times over, and likely a bunch of the imported cultural environment and extensive chat logs besides, and then all will become clear! At least as clear as it can be so far. At least I think it will. But what would I know, I'm a long ways from being there.

Well great, so couldn't I at least give an executive summary? Not exactly an easy task either. Short of that, here's an attempt at picking some of the especially interesting, informative or significant articles on Trilema from my reading so far, a map of sorts of enticing entries to the rabbit hole.

The very unfair process that articles went through to make this list was as follows:

  1. I extracted an initial set of 957 items from my presently accessible browsing history, using some CLI magic.(i)
  2. I narrowed the list to those where I believed I recalled something of the article, going off the title alone. This brought it down to 424.
  3. I further selected based on roughly the above "interesting, informative or significant" standard in my subjective perception, again by memory from title alone.(ii) I also ended up skipping some that would have met this by way of having especially horrified me; not sure if I've done anyone any favors thus, but there it is.

The ordering within each publication year is merely alphabetical (because I can't quite see a pressing need to do it better in this context).

Enjoy... if you dare. What can I say, it's not for everyone.

2012

2013

2014

2015

2016

2017

2018

2019

2020

  • The slap and human dignity
  • Fin.

    1. You know Firefox keeps this in a SQL database, yes? Because they told you about it in the manual, and documented the schema and all? [^]
    2. At times I was overpowered by the temptation to go check, with the inevitable expenditure of time on re-reading which, useful as it can be, I hadn't planned on getting drawn into just now. And while my shiny tools got this down to a minimal "this button to keep, that button to skip" flow, they were entirely powerless to speed up the thinking. [^]

2020-03-04

Bitcoin transactions and their signing, 2: attachment

Filed under: Bitcoin, Software — Jacob Welsh @ 20:10

Having outlined the shape of the building block provided by digital signatures, we now face the potential problem of how to attach signatures to the messages they sign. The one hard requirement for any attachment scheme is that the verification function can work, that is, can answer unambiguously whether a signature is valid for a specific message and key. I will explore the space of possible approaches here,(i) then describe the one used in Bitcoin.

The simplest approach is to say: "what problem?" That is, treat the message and signature as separate objects (bitstrings, numbers, files or however you like to think of them) and use some external system to organize them. This is known in the traditional GPG toolset as detached signing. It has its advantages, besides the obvious "less work to implement". The original, unmodified message is directly available to the reader and his tools. New signatures can be added to a collection without duplicating or modifying the message object, and thus without needing further verification that they in fact refer to the same message. These properties are exploited in present manifestations of the V version control concept.

Assuming one does indeed want attached signatures, then, the first option is to package the message and signature together in some container format. Depending on how it's done, this can preserve the advantage that at least a semblance of the original message is readily visible in plain text, as with GPG clearsigning.(ii) New signatures can be added either with support from the container format, producing a single multiply-signed document, or without such support, either by nesting (such that each new signature references the previous stack) or duplication.

A second option, when the message represents a formal data structure, is to embed signatures in that structure itself in an application-specific way. At first sight this appears to be a circular data dependency: how can a signature be computed for a message that includes a representation of that signature?(iii) However, this can be worked around by applying a transformation to clip or whiteout the signature field at both signing and verification time.

The third and final option is to generalize the previous into a flexible or perhaps even universal embedding scheme. For example, signatures can be wrapped in whatever comment delimiters are available in a programming language, as seen in Mircea Popescu's recent proposal.(iv)

Bitcoin transactions, we can now say, use option #2: format-specific embedding, though with some added complications as follows.

The signature on each input is wrapped using the "script" encoding, in a field originally named "scriptSig", and its interpretation is determined by a corresponding script in the linked output being spent, originally "scriptPubKey". If we constrain our interest to transactions in the standard pay-to-pubkey-hash form, these considerations reduce to a formality.

The whiteout procedure is basically to replace the scriptSig on each input with an empty script. This implies the signatures are independent of each other. The twist, though, is that for the input for which a signature is being computed, the scriptSig is replaced instead by the corresponding scriptPubKey. I can't see any security advantage in doing this, since the previous output is already referenced by a unique identifier(v) covered by the signature. The result is that a different message must be signed for each input, and transaction verification takes quadratic time with respect to the number of inputs. This makes for a good reminder that the Bitcoin protocol externalizes much of the cost of transacting onto all node operators, and unless a satisfactory solution to that tough problem is deployed, transaction throughput must be kept a scarce resource.

To be continued.

  1. I struggled more than usual in writing about these, perhaps indicating I didn't grasp them as well as I'd thought. I don't claim to be equipped to discuss why one choice might be philosophically preferable to others; yet neither can I take a "purely technical" approach since cryptography is necessarily shaped as much from above by its utility to human society as from below by mathematical possibility. Maybe search the logs? [^]
  2. That format however incurs further complexity from tackling the additional perceived problems of linefeed normalization and in-band bracketing for inclusion in a larger text, with the drawback of having to quote instances of the magical bracket sequence in the signed message. [^]
  3. Such a message can be conceived as a fixpoint of the hash-sign-attach pipeline, but finding one in practice would seem to constitute a severe break in the cryptographic primitives. [^]
  4. It's not yet clear to me if or how this can be implemented reliably. For starters, how would you distinguish actual signatures from, say, quoted signatures, without knowing the lexical rules of the target language? How would the "whiteout" work to produce the same hash after addition of new signatures, without knowing same? [^]
  5. Well, not quite unique but at least identifying its contents including the scriptPubKey in question, to the extent you trust SHA256. And if you don't trust that, the signature hash would seem to be the bigger problem. [^]

2020-02-25

Bitcoin transactions and their signing, 1

Filed under: Bitcoin, Software — Jacob Welsh @ 23:42

As my offline Bitcoin signer nears completion, it's a good time to introduce just what Bitcoin transactions are anyway, how they are signed, and not least of all how it could go horribly wrong if we're not careful. This first part will cover the basics that I consider required knowledge for anyone who handles the currency.

A Bitcoin transaction is a message with particular structure and binary encoding rules,(i) specifying the transfer of given quantities from one set of accounts to another.

Transactions are composed of inputs and outputs. Each output specifies a monetary value and a destination address.(ii) Each input contains a reference to a previous transaction output(iii) and a signature authorizing its spending. In a quirk of implementation, the "accounts" mentioned above don't explicitly exist in the system; outputs are considered either unspent or spent in full by inclusion in a subsequent transaction. Your available balance, then, is the total value of unspent outputs for which you are able to issue valid signatures. Since the amount to be sent isn't usually an exact sum of previous outputs, a "change" output is added so as to overshoot and send the excess back to the original owner.

Observing that the scheme as presented so far rests on the strength of the signature, let's briefly expand on that concept, leaving the mathematical details as a black box for present purposes. A digital signature scheme provides three high-level operations: key generation, signing, and verification. Key generation takes some cryptographic entropy as input and produces a public/private key pair. Signing takes a fixed-length message hash, a private key, and possibly some further entropy and produces a signature. Verification answers whether a purported signature is valid for a given hash and public key. This gives a high degree of confidence that the signature could only have been issued by someone with knowledge of the private key (as long as some underlying unproven mathematical assumptions hold, which they appear to have so far despite ample incentive to break them). Note the distinct advantage over traditional pen-and-paper signatures: simply seeing one does not grant an ability to forge it or pass it off as covering some other message, despite the susceptibility of digital information to perfect copying and easy modification.

To be continued.

  1. Due to an unfortunate misallocation of brain cycles by Satoshi and the others who imagined themselves Bitcoin developers in the early days, there's a whole cocktail of encodings with, for example, at least four different ways to represent integers. While this makes for some added implementation complexity, the details aren't especially important for normal usage. [^]
  2. Technically a "script", but for simplicity we'll consider only the standard "pay-to-pubkey-hash" form. [^]
  3. Except in the case of "coinbase" transactions which issue mining rewards. [^]

2020-01-20

Draft gbw-node frontend, part 6

Filed under: Bitcoin, Software — Jacob Welsh @ 21:32

Continued from:

The first of the input/output commands is to print a table of possibly-spendable outputs in the format required by the offline signer. While the Bitcoin protocol refers to transactions by 256-bit hash, the more compact confirmation coordinates (height, index) are included for convenience in the comment field. The queries are a bit lengthy, since we now join several tables to build the flat output file, but aren't doing anything too fancy once you break them down. The only difference when a tag is specified is the extra join to filter on its ID.

In some cases, BLOB fields need to be converted back to str.(i)

def cmd_unspent_outs(argv):
	'''
	unspent-outs [TAG]

	Display the unspent outputs table for addresses with the given TAG (or all watched addresses), as required by the offline wallet, ordered by age.
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=? \
				ORDER BY block_height DESC', (tag_id,))
	else:
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				WHERE spent IS NULL \
				ORDER BY block_height DESC')
	for a, v, hash, n_out, height, n_tx in r:
		stdout.write('%s %s %s %s #blk %s tx %s\n' % (format_address(str(a)), format_coin(v), b2lx(hash), n_out, height, n_tx))

Idea: Add a command to print the outputs table for a given raw transaction. For example, this would enable spending unconfirmed or too recently confirmed outputs in a pinch, without requiring any further changes. Or more generally: all the data conversion code is already here so might as well make it accessible.

Next we proceed to the accounting commands, as they're really just another kind of output command. The balance of an address set is the total value of unspent outputs to addresses in the set.

def cmd_balance(argv):
	'''
	balance [TAG]

	Display confirmed balance of addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output WHERE spent IS NULL')
	bal, = r.fetchone()
	stdout.write('%s\n' % format_coin(bal))

Things get tricker for the register report as it attempts to usefully summarize several things in a small space. In particular, summing the incoming and outgoing value per transaction seems to require separate queries since the join criteria differ.(ii)

def cmd_register(argv):
	'''
	register [TAG]

	Display a tab-delimited transaction register report for addresses with the given TAG (or all watched addresses). Columns are:

	- confirmation block height
	- number of transaction within block
	- total deposits (new outputs)
	- total withdrawals (spent outputs)
	- running balance
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
	else:
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
	bal = 0
	for height, n, o_val, i_val in merge_moves(outs.fetchall(), ins.fetchall()):
		bal = bal + o_val - i_val
		stdout.write('%s\t%s\t%s\t%s\t%s\n' % (height, n, format_coin(o_val), format_coin(-i_val), format_coin(bal)))

A helper is used to join the two possibly uneven lists by transaction, inserting zeros for transactions found on only one side. Perhaps it could all be done in SQL with subqueries and some type of outer joins, but I wasn't quite seeing it, so resorted to the low level with an algorithm reminiscent of the merging step of classical mergesort.

# Merge ordered lists of total input and output values per transaction into single table with columns for both.
def merge_moves(outs, ins):
	i = o = 0

	while True:
		if o == len(outs):
			for height, n, val in ins[i:]:
				yield (height, n, 0, val)
			return
		o_height, o_n, o_val = outs[o]
		o_key = (o_height, o_n)

		if i == len(ins):
			for height, n, val in outs[o:]:
				yield (height, n, val, 0)
			return
		i_height, i_n, i_val = ins[i]
		i_key = (i_height, i_n)

		if o_key < i_key:
			yield (o_height, o_n, o_val, 0)
			o += 1
		elif i_key < o_key:
			yield (i_height, i_n, 0, i_val)
			i += 1
		else:
			yield (o_height, o_n, o_val, i_val)
			i += 1
			o += 1

Next, the input commands. For sanity's sake, we exclude newlines in tag names as implicitly required by the tags listing format.

def cmd_watch(argv):
	'''
	watch [TAG]

	Import a set of addresses to watch linewise from stdin, optionally named by the given TAG. Addresses can be associated with multiple tags using multiple watch commands.
	'''
	tag_id = None
	if len(argv) > 0:
		name = argv.pop(0)
		if '\n' in name:
			die('newline not allowed in tag name')
		tag_id = insert_or_get_tag_id(name)
	while True:
		l = stdin.readline()
		if len(l) == 0:
			break
		addr_id = insert_or_get_address_id(parse_address(l.rstrip('\n')))
		if tag_id is not None:
			try:
				db.execute('INSERT INTO address_tag (address_id, tag_id) VALUES (?,?)',
						(addr_id, tag_id))
			except IntegrityError:
				pass
		db.commit()

def cmd_push(argv):
	'''
	push

	Import raw hex transactions linewise from stdin and send to bitcoind.
	'''
	while True:
		line = stdin.readline()
		if len(line) == 0:
			break
		tx_hex = line.rstrip('\n')
		stdout.write('txid %s\n' % rpc('sendrawtransaction', tx_hex))

General or command-specific help, and a command registry allowing abbreviation:

def cmd_help(argv):
	'''
	help [COMMAND]

	Display help for a given command or list all commands.
	'''
	if len(argv) > 0:
		name = argv.pop(0)
		name, func = get_command(name)
		doc = getdoc(func)
		if doc is None:
			stdout.write('No help for %r\n' % name)
		else:
			stdout.write('gbw-node %s\n' % doc)
	else:
		stdout.write('''Usage: gbw-node COMMAND [ARGS]

Available commands (can be abbreviated when unambiguous):

%s
''' % '\n'.join([name for name, f in cmds]))

cmds = (
	('help', cmd_help),
	('scan', cmd_scan),
	('reset', cmd_reset),
	('tags', cmd_tags),
	('addresses', cmd_addresses),
	('unspent-outs', cmd_unspent_outs),
	('watch', cmd_watch),
	('push', cmd_push),
	('balance', cmd_balance),
	('register', cmd_register),
)

def get_command(name):
	rows = [r for r in cmds if r[0].startswith(name)]
	if len(rows) == 0:
		die('command not found: %s' % name)
	if len(rows) > 1:
		die('ambiguous command %s. Completions: %s' % (name, ' '.join([r[0] for r in rows])))
	return rows[0]

When invoked as a program (as opposed to imported elsewhere e.g. for testing), we connect to the database, enable foreign key constraints, and boost cache size and checkpoint interval from the meager defaults. These can be tuned if needed to optimize the scan process for your machine. Finally we dispatch to the given command.

Ideally, we'd create the database from schema here if not found.

def main():
	global db
	signal.signal(signal.SIGINT, signal.SIG_DFL)
	require_dir(gbw_home)
	db = sqlite3.connect(gbw_home + '/db', timeout=600) # in seconds
	db.execute('PRAGMA foreign_keys=ON')
	db.execute('PRAGMA cache_size=-8000') # negative means in KiB
	db.execute('PRAGMA wal_autocheckpoint=10000') # in pages (4k)
	if len(argv) < 2:
		die('missing command', help=True)
	get_command(argv[1])[1](argv[2:])

if __name__ == '__main__':
	main()

This concludes the node frontend. Congratulations if you've followed thus far! There's no magic in programming, just a ruthless decomposition of bigger problems into smaller ones, a search for useful and robust abstractions -- and of course a whole lot of background reading and practice.

In the next month or two I will be completing the missing pieces of the signer; meanwhile, the code here is quite ready to play with. Import some addresses, run a scan, run the reports, and let me know how it goes in the comments below.

  1. Well, so far the only such case is format_address, so perhaps it should just be changed to allow passing a buffer. [^]
  2. It's looking like the COALESCE trick is pointless here, since rows are only generated by the join when matching outputs are present; that is, the SUM aggregation is always getting at least one row. Was I overzealous before? I don't recall if I observed an actual problem here rather than just in cmd_balance. It does no harm to leave in though, at least as far as correctness. [^]

2020-01-19

Draft gbw-node frontend, part 5

Filed under: Bitcoin, Software — Jacob Welsh @ 19:02

Continued from:

Command implementations

The core scanning logic is in a helper function that takes a block's height and a memory view of its contents.

Referential integrity between blocks is ensured by scanning sequentially by height; that is, all relevant tx and output records from prior blocks will be known by the time we see the inputs that spend them. However, as far as I know this topological ordering is not guaranteed for the transaction sequence within a block (eg. tx 1 could spend outputs of tx 2, or vice versa) so we do separate passes over the transaction list for outputs and inputs.

def scan_block(height, v):
	stdout.write('block %s' % height)
	# [perf] computing every tx hash
	(blkhash, prev, time, target, txs), size = load_block(v)

The performance comment above was just to note some not-strictly-necessary work being done, in case the scan ended up horribly slow.(i)

An output is relevant if its script is standard and pays a known address. At least with foreign key constraints enabled, we can't insert an output until the tx record it references exists, but we don't know whether to insert the tx until we see if any of its outputs are relevant, so we again use a two-pass approach.

	count_out = 0
	n_tx = 0
	for (hash, size, txins, txouts) in txs:
		matched_outs = []
		for n, txout in enumerate(txouts):
			val, script = txout
			a = out_script_address(script)
			if a is not None:
				#print format_address(a)
				addr_id = get_address_id(a)
				if addr_id is not None:
					matched_outs.append((n, addr_id, val))
		if len(matched_outs) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, addr_id, val in matched_outs:
				insert_output(tx_id, n, addr_id, val)
			count_out += len(matched_outs)
		n_tx += 1
	stdout.write(' new-outs %s' % count_out)

An input is relevant if it spends a known output. Recall that insert_input updates the corresponding output to create the back-reference, indicating it has been spent.

	# Inputs scanned second in case an output from the same block is spent.
	# Coinbase (input of first tx in block) doesn't reference anything.
	count_in = 0
	n_tx = 1
	for (hash, size, txins, txouts) in txs[1:]:
		matched_ins = []
		for n, txin in enumerate(txins):
			prevout_hash, prevout_n, scriptsig = txin
			prevout_tx_id = get_tx_id(prevout_hash)
			if prevout_tx_id is not None:
				prevout_id = get_output_id(prevout_tx_id, prevout_n)
				if prevout_id is not None:
					matched_ins.append((n, prevout_id))
		if len(matched_ins) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, prevout_id in matched_ins:
				insert_input(tx_id, n, prevout_id)
			count_in += len(matched_ins)
		n_tx += 1
	stdout.write(' spent-outs %s\n' % count_in)

Assorted helpers: handling usage errors; looking up a tag ID that must exist.

def die(msg, help=False):
	stderr.write('gbw-node: %s\n' % msg)
	if help:
		cmd_help([])
	exit(-1)

def require_tag(name):
	i = get_tag_id(name)
	if i is None:
		die('tag not found: %r' % name)
	return i

The entry point for any user command "X" is the function "cmd_X", having help text in its docstring and taking a list of any supplied CLI arguments past the command name.

First, the sync commands. The scan process commits one database transaction per block.

def cmd_scan(argv):
	'''
	scan

	Iterate blocks from bitcoind, indexing transaction inputs and outputs affecting watched addresses. May be safely interrupted and resumed.

	NOT PRESENTLY SAFE TO RUN CONCURRENT INSTANCES due to the dumpblock to named pipe kludge.
	'''
	db.execute('PRAGMA synchronous=NORMAL')
	height = db.execute('SELECT scan_height FROM state').fetchone()[0]
	blockcount = max(-1, rpc('getblockcount') - CONFIRMS)
	while height < blockcount:
		height += 1
		scan_block(height, memoryview(getblock(height)))
		db.execute('UPDATE state SET scan_height = ?', (height,))
		db.commit()

def cmd_reset(argv):
	'''
	reset

	Reset the scan pointer so the next scan will proceed from the genesis block, to find transactions associated with newly watched addresses.
	'''
	db.execute('UPDATE state SET scan_height = -1')
	db.commit()

Next, commands to query the watched address sets (not in the original spec but trivial and clearly useful).

def cmd_tags(argv):
	'''
	tags

	List all tag names.
	'''
	for name, in db.execute('SELECT name FROM tag'):
		stdout.write(name + '\n')

def cmd_addresses(argv):
	'''
	addresses [TAG]

	List addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address FROM address \
				JOIN address_tag ON address.address_id=address_tag.address_id \
				WHERE tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT address FROM address')
	for a, in r:
		stdout.write(format_address(str(a)) + '\n')

To be continued.

  1. I've found the Python profiler quite useful so far compared to such guesswork; still, optimization is something of a balance between experimentally-driven efforts and not doing obviously wasteful things from the start. [^]

Draft gbw-node frontend, part 4

Filed under: Bitcoin, Software — Jacob Welsh @ 04:36

Continued from:

Common database operations

As an internal convention, a "get_X_id" function will return the database ID for the row in table "X" named by its bulkier external reference, or None if not found. Similarly, "insert_or_get_X_id" will insert a row if needed and in either case return the ID. Some of these have only a single caller, but I find that collecting the various similar queries in one place and wrapping them into tidy functions helps readability.

The mapping of Python to SQLite types is fairly straightforward, except that buffer is needed to specify a BLOB.

The "parameter substitution" feature is used throughout, avoiding improper mixing of code and data that could manifest as SQL injection or thrashing the compiled statement cache.

def get_address_id(a):
	r = db.execute('SELECT address_id FROM address WHERE address=?', (buffer(a),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_address_id(a):
	i = get_address_id(a)
	if i is not None:
		return i
	return db.execute('INSERT INTO address (address) VALUES (?)', (buffer(a),)).lastrowid

def get_tx_id(hash):
	r = db.execute('SELECT tx_id FROM tx WHERE hash=?', (buffer(hash),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tx_id(hash, blkhash, height, n, size):
	try:
		return db.execute('INSERT INTO tx (hash, block_hash, block_height, n, size) VALUES (?,?,?,?,?)',
				(buffer(hash), buffer(blkhash), height, n, size)).lastrowid
	except IntegrityError:
		# XXX check equality?
		return get_tx_id(hash)

I now think we should indeed catch that condition (differing transactions with identical hash), especially given the possibility of TXID collisions. Perhaps I left it out from excessive worry about scan performance. Or just laziness.

The mixture of check-first and try-first styles seen above also doesn't sit well. The possibility of TOCTTOUs,(i) depending on the details of transaction isolation level, would seem to make a strong case for try-first. It's a minor point though; the worst case here would be an uncaught IntegrityError halting the program gracefully.

def insert_output(tx_id, n, addr_id, val):
	try:
		db.execute('INSERT INTO output (tx_id, n, address_id, value) VALUES (?,?,?,?)',
				(tx_id, n, addr_id, val))
	except IntegrityError:
		r = db.execute('SELECT address_id, value FROM output WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()
		if r != (addr_id, val):
			raise Conflict('output differs from previous content', tx_id, n, (addr_id, val), r)

def insert_input(tx_id, n, prevout_id):
	try:
		input_id = db.execute('INSERT INTO input (tx_id, n) VALUES (?,?)', (tx_id, n)).lastrowid
	except IntegrityError:
		input_id = db.execute('SELECT input_id FROM input WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()[0]
	db.execute('UPDATE output SET spent=? WHERE output_id=?', (input_id, prevout_id))

def get_output_id(tx_id, n):
	r = db.execute('SELECT output_id FROM output WHERE tx_id=? AND n=?', (tx_id, n)).fetchone()
	return None if r is None else r[0]

def get_tag_id(name):
	r = db.execute('SELECT tag_id FROM tag WHERE name=?', (name,)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tag_id(name):
	i = get_tag_id(name)
	if i is not None:
		return i
	return db.execute('INSERT INTO tag (name) VALUES (?)', (name,)).lastrowid

Next up, we'll finally get to implementing the commands themselves. To be continued.

  1. The "time of check to time of use" race condition. You know, like sitting down when some trickster's meanwhile moved the chair. [^]

2020-01-17

Draft gbw-node frontend, part 3

Filed under: Bitcoin, Software — Jacob Welsh @ 18:02

Continued from:

Base58

Bitcoin addresses are conventionally written in a special-purpose encoding and include a hash truncated to 32 bits for error detection. As the reference implementation explains:

Why base-58 instead of standard base-64 encoding?
- Don't want 0OIl characters that look the same in some fonts and could be used to create visually identical looking account numbers.
- A string with non-alphanumeric characters is not as easily accepted as an account number.
- E-mail usually won't line-break if there's no punctuation to break at.
- Doubleclicking selects the whole number as one word if it's all alphanumeric.

Of course, all these points would have been answered just as well by hexadecimal, and without the various burdens: case-sensitivity for the user (the surest way I've found to read these out is the fully explicit: "one five big-A three little-X ..."); more code for the implementer; and more work for the machine (as the lack of bit alignment demands a general base conversion algorithm).

We start with lookup tables to convert the digits 0-57 to the specified alphabet and back. I was once surprised to learn the scope of iteration variables in a Python "for" loop is not restricted to the loop body: a potential source of referential confusion, reflecting the language's casual approach to mutation. Thus, when at the global scope I like to ensure throwaway names, like "index" and "character" here, are safely contained in a function.

base58_alphabet = (string.digits + string.uppercase + string.lowercase).translate(None, '0OIl')
base58_inverse = [None]*256
def init_base58_inverse():
	for index, character in enumerate(base58_alphabet):
		base58_inverse[ord(character)] = index
init_base58_inverse()

To do base conversion we'll need to treat byte sequences as integers with the same ordering conventions as the reference code. Otherwise put: to decode from base-256 to abstract integers. Python 2 doesn't have a builtin for this. The algorithm is not optimal, but the base-58 part will be worse anyway.

def bytes_to_int(b):
	"Convert big-endian byte sequence to unsigned integer"
	i = 0
	for byte in b:
		i = (i << 8) + ord(byte)
	return i

To complete the bytes-to-ASCII converter we extract digits from the integer, least significant first, by iterated division with remainder by 58. Since the conversion to integer loses track of field width, the convention is to pad with the same number of base-58 zeros as there were base-256 leading zeros in the input. In further fallout from using a non-bit-aligned encoding, these are not naturally constant time or constant control-flow operations.

For the same bit cost of the error detection code we could have had error correction. But that would have required, like, math, and stuff.

def b2a_base58check(data):
	data += sha256d(data)[:4]

	leading_zeros = 0
	for b in data:
		if b != '\x00':
			break
		leading_zeros += 1

	data_num = bytes_to_int(data)

	digits = []
	while data_num:
		data_num, digit = divmod(data_num, 58)
		digits.append(digit)
	digits.extend([0] * leading_zeros)

	return ''.join(base58_alphabet[digit] for digit in reversed(digits))

Converting back to bytes uses the inverse operation at each step, but now there are cases of invalid input to reject: digits outside the specified alphabet and corruption detected by the checksum. (The precise function decomposition is a bit arbitrary and asymmetrical I'll admit.)

class Base58Error(ValueError):
	pass

class BadDigit(Base58Error):
	pass

class BadChecksum(Base58Error):
	pass

def a2b_base58(data):
	digits = [base58_inverse[ord(b)] for b in data]
	if None in digits:
		raise BadDigit

	leading_zeros = 0
	for digit in digits:
		if digit != 0:
			break
		leading_zeros += 1

	data_num = 0
	for digit in digits:
		data_num = 58*data_num + digit

	data_bytes = []
	while data_num:
		data_bytes.append(data_num & 0xFF)
		data_num = data_num >> 8
	data_bytes.extend([0] * leading_zeros)

	return ''.join(chr(b) for b in reversed(data_bytes))

def a2b_base58check(data):
	data = a2b_base58(data)
	payload = data[:-4]
	check = data[-4:]
	if check != sha256d(payload)[:4]:
		raise BadChecksum
	return payload

Finally we apply this encoding to Bitcoin addresses, which have a fixed 160-bit width plus an extra "version" byte that becomes the familiar leading "1".

class BadAddressLength(ValueError):
	pass

class BadAddressVersion(ValueError):
	pass

def parse_address(a):
	b = a2b_base58check(a)
	if len(b) != 21:
		raise BadAddressLength
	if b[0] != '\x00':
		raise BadAddressVersion(ord(b[0]))
	return b[1:]

def format_address(b):
	return b2a_base58check('\x00' + b)

All this format conversion groundwork out of the way, we'll start talking to the database and putting it all together. To be continued!

Older Posts »

Powered by MP-WP. Copyright Jacob Welsh.