Bitcoin transactions and their signing, 2: attachment

Filed under: Bitcoin, Software — Jacob Welsh @ 20:10

Having outlined the shape of the building block provided by digital signatures, we now face the potential problem of how to attach signatures to the messages they sign. The one hard requirement for any attachment scheme is that the verification function can work, that is, can answer unambiguously whether a signature is valid for a specific message and key. I will explore the space of possible approaches here,(i) then describe the one used in Bitcoin.

The simplest approach is to say: "what problem?" That is, treat the message and signature as separate objects (bitstrings, numbers, files or however you like to think of them) and use some external system to organize them. This is known in the traditional GPG toolset as detached signing. It has its advantages, besides the obvious "less work to implement". The original, unmodified message is directly available to the reader and his tools. New signatures can be added to a collection without duplicating or modifying the message object, and thus without needing further verification that they in fact refer to the same message. These properties are exploited in present manifestations of the V version control concept.

Assuming one does indeed want attached signatures, then, the first option is to package the message and signature together in some container format. Depending on how it's done, this can preserve the advantage that at least a semblance of the original message is readily visible in plain text, as with GPG clearsigning.(ii) New signatures can be added either with support from the container format, producing a single multiply-signed document, or without such support, either by nesting (such that each new signature references the previous stack) or duplication.

A second option, when the message represents a formal data structure, is to embed signatures in that structure itself in an application-specific way. At first sight this appears to be a circular data dependency: how can a signature be computed for a message that includes a representation of that signature?(iii) However, this can be worked around by applying a transformation to clip or whiteout the signature field at both signing and verification time.

The third and final option is to generalize the previous into a flexible or perhaps even universal embedding scheme. For example, signatures can be wrapped in whatever comment delimiters are available in a programming language, as seen in Mircea Popescu's recent proposal.(iv)

Bitcoin transactions, we can now say, use option #2: format-specific embedding, though with some added complications as follows.

The signature on each input is wrapped using the "script" encoding, in a field originally named "scriptSig", and its interpretation is determined by a corresponding script in the linked output being spent, originally "scriptPubKey". If we constrain our interest to transactions in the standard pay-to-pubkey-hash form, these considerations reduce to a formality.

The whiteout procedure is basically to replace the scriptSig on each input with an empty script. This implies the signatures are independent of each other. The twist, though, is that for the input for which a signature is being computed, the scriptSig is replaced instead by the corresponding scriptPubKey. I can't see any security advantage in doing this, since the previous output is already referenced by a unique identifier(v) covered by the signature. The result is that a different message must be signed for each input, and transaction verification takes quadratic time with respect to the number of inputs. This makes for a good reminder that the Bitcoin protocol externalizes much of the cost of transacting onto all node operators, and unless a satisfactory solution to that tough problem is deployed, transaction throughput must be kept a scarce resource.

To be continued.

  1. I struggled more than usual in writing about these, perhaps indicating I didn't grasp them as well as I'd thought. I don't claim to be equipped to discuss why one choice might be philosophically preferable to others; yet neither can I take a "purely technical" approach since cryptography is necessarily shaped as much from above by its utility to human society as from below by mathematical possibility. Maybe search the logs? [^]
  2. That format however incurs further complexity from tackling the additional perceived problems of linefeed normalization and in-band bracketing for inclusion in a larger text, with the drawback of having to quote instances of the magical bracket sequence in the signed message. [^]
  3. Such a message can be conceived as a fixpoint of the hash-sign-attach pipeline, but finding one in practice would seem to constitute a severe break in the cryptographic primitives. [^]
  4. It's not yet clear to me if or how this can be implemented reliably. For starters, how would you distinguish actual signatures from, say, quoted signatures, without knowing the lexical rules of the target language? How would the "whiteout" work to produce the same hash after addition of new signatures, without knowing same? [^]
  5. Well, not quite unique but at least identifying its contents including the scriptPubKey in question, to the extent you trust SHA256. And if you don't trust that, the signature hash would seem to be the bigger problem. [^]


Bitcoin transactions and their signing, 1

Filed under: Bitcoin, Software — Jacob Welsh @ 23:42

As my offline Bitcoin signer nears completion, it's a good time to introduce just what Bitcoin transactions are anyway, how they are signed, and not least of all how it could go horribly wrong if we're not careful. This first part will cover the basics that I consider required knowledge for anyone who handles the currency.

A Bitcoin transaction is a message with particular structure and binary encoding rules,(i) specifying the transfer of given quantities from one set of accounts to another.

Transactions are composed of inputs and outputs. Each output specifies a monetary value and a destination address.(ii) Each input contains a reference to a previous transaction output(iii) and a signature authorizing its spending. In a quirk of implementation, the "accounts" mentioned above don't explicitly exist in the system; outputs are considered either unspent or spent in full by inclusion in a subsequent transaction. Your available balance, then, is the total value of unspent outputs for which you are able to issue valid signatures. Since the amount to be sent isn't usually an exact sum of previous outputs, a "change" output is added so as to overshoot and send the excess back to the original owner.

Observing that the scheme as presented so far rests on the strength of the signature, let's briefly expand on that concept, leaving the mathematical details as a black box for present purposes. A digital signature scheme provides three high-level operations: key generation, signing, and verification. Key generation takes some cryptographic entropy as input and produces a public/private key pair. Signing takes a fixed-length message hash, a private key, and possibly some further entropy and produces a signature. Verification answers whether a purported signature is valid for a given hash and public key. This gives a high degree of confidence that the signature could only have been issued by someone with knowledge of the private key (as long as some underlying unproven mathematical assumptions hold, which they appear to have so far despite ample incentive to break them). Note the distinct advantage over traditional pen-and-paper signatures: simply seeing one does not grant an ability to forge it or pass it off as covering some other message, despite the susceptibility of digital information to perfect copying and easy modification.

To be continued.

  1. Due to an unfortunate misallocation of brain cycles by Satoshi and the others who imagined themselves Bitcoin developers in the early days, there's a whole cocktail of encodings with, for example, at least four different ways to represent integers. While this makes for some added implementation complexity, the details aren't especially important for normal usage. [^]
  2. Technically a "script", but for simplicity we'll consider only the standard "pay-to-pubkey-hash" form. [^]
  3. Except in the case of "coinbase" transactions which issue mining rewards. [^]


From the forum log, 30 December 2019

Filed under: Paidagogia, Software, Summaries — Jacob Welsh @ 18:48

From #trilema: 2019-12-30.

MP clarified for trinque that nobody's proposing all software be installed everywhere, and his own point was more about making gradual code and design cleanups so the mess becomes much smaller. A V tree could grow in different directions for different purposes, with a GUI environment likely being an early such split.

He weighed in on the Lisp critique thread. The latest ASDF was wrecked by Francois-Rene Rideau aka fare when the elder Lispers were no longer around to stop it. MP's experience with Lisp code had been that most ended up being Python anyway. He cited the late Erik Naggum on the trouble with free Lisp implementations lacking the more powerful features, but lambasted him for checking out of life before he could do anything useful, along with recent TMSR dropouts. He formulated the purpose of Lisp as for solving closed problems in the most efficient way, once problems are adequately understood.

In computer graphics, the nonsense of "XML shaders" was more popular than diana_coman knew.

Hanbot noted the use of Scheme scripting in Gimp, apropos of both practical Lisp usage and bypassing large dependency trees as seen in ImageMagick.

Mocky said he'll catch up.

MP and diana_coman picked up a discussion from blog comments about Eulora work, wherein various parts are stalled; the project outgrew the world it came from, in terms of both tools and artists. MP explored the art question by comparing valuation of a lone fighter aircraft, to a black-box Go-playing AI, to a cube of colored pixels: all being piles of accumulated tinkering, results of another's process, useless on their own without access to the process. Maintaining infrastructure to cater to the current crop of artists would be costly; growing new talent likewise. Hanbot had once done some Blender character modeling work but got stuck on the last step of exporting to a usable format; unfortunately she didn't recall documenting the struggle. MP posed the practical options as either trawling the entire Internet for artwork and sorting out what's usable, or machine-generating it and doing similar, at least knowing it's in a usable format. Diana_coman wasn't enthused by either. MP was interested in doing the first anyway, inviting unemployed readership to talk.

For the first time ever, MP found himself struggling with recent publication volume. He and diana_coman pointed out some kinks in trinque's blog setup; trinque started looking into it.


Draft gbw-node frontend, part 6

Filed under: Bitcoin, Software — Jacob Welsh @ 21:32

Continued from:

The first of the input/output commands is to print a table of possibly-spendable outputs in the format required by the offline signer. While the Bitcoin protocol refers to transactions by 256-bit hash, the more compact confirmation coordinates (height, index) are included for convenience in the comment field. The queries are a bit lengthy, since we now join several tables to build the flat output file, but aren't doing anything too fancy once you break them down. The only difference when a tag is specified is the extra join to filter on its ID.

In some cases, BLOB fields need to be converted back to str.(i)

def cmd_unspent_outs(argv):
	unspent-outs [TAG]

	Display the unspent outputs table for addresses with the given TAG (or all watched addresses), as required by the offline wallet, ordered by age.
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=? \
				ORDER BY block_height DESC', (tag_id,))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				WHERE spent IS NULL \
				ORDER BY block_height DESC')
	for a, v, hash, n_out, height, n_tx in r:
		stdout.write('%s %s %s %s #blk %s tx %s\n' % (format_address(str(a)), format_coin(v), b2lx(hash), n_out, height, n_tx))

Idea: Add a command to print the outputs table for a given raw transaction. For example, this would enable spending unconfirmed or too recently confirmed outputs in a pinch, without requiring any further changes. Or more generally: all the data conversion code is already here so might as well make it accessible.

Next we proceed to the accounting commands, as they're really just another kind of output command. The balance of an address set is the total value of unspent outputs to addresses in the set.

def cmd_balance(argv):
	balance [TAG]

	Display confirmed balance of addresses with the given TAG (or all watched addresses).
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=?', (tag_id,))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output WHERE spent IS NULL')
	bal, = r.fetchone()
	stdout.write('%s\n' % format_coin(bal))

Things get tricker for the register report as it attempts to usefully summarize several things in a small space. In particular, summing the incoming and outgoing value per transaction seems to require separate queries since the join criteria differ.(ii)

def cmd_register(argv):
	register [TAG]

	Display a tab-delimited transaction register report for addresses with the given TAG (or all watched addresses). Columns are:

	- confirmation block height
	- number of transaction within block
	- total deposits (new outputs)
	- total withdrawals (spent outputs)
	- running balance
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
	bal = 0
	for height, n, o_val, i_val in merge_moves(outs.fetchall(), ins.fetchall()):
		bal = bal + o_val - i_val
		stdout.write('%s\t%s\t%s\t%s\t%s\n' % (height, n, format_coin(o_val), format_coin(-i_val), format_coin(bal)))

A helper is used to join the two possibly uneven lists by transaction, inserting zeros for transactions found on only one side. Perhaps it could all be done in SQL with subqueries and some type of outer joins, but I wasn't quite seeing it, so resorted to the low level with an algorithm reminiscent of the merging step of classical mergesort.

# Merge ordered lists of total input and output values per transaction into single table with columns for both.
def merge_moves(outs, ins):
	i = o = 0

	while True:
		if o == len(outs):
			for height, n, val in ins[i:]:
				yield (height, n, 0, val)
		o_height, o_n, o_val = outs[o]
		o_key = (o_height, o_n)

		if i == len(ins):
			for height, n, val in outs[o:]:
				yield (height, n, val, 0)
		i_height, i_n, i_val = ins[i]
		i_key = (i_height, i_n)

		if o_key < i_key:
			yield (o_height, o_n, o_val, 0)
			o += 1
		elif i_key < o_key:
			yield (i_height, i_n, 0, i_val)
			i += 1
			yield (o_height, o_n, o_val, i_val)
			i += 1
			o += 1

Next, the input commands. For sanity's sake, we exclude newlines in tag names as implicitly required by the tags listing format.

def cmd_watch(argv):
	watch [TAG]

	Import a set of addresses to watch linewise from stdin, optionally named by the given TAG. Addresses can be associated with multiple tags using multiple watch commands.
	tag_id = None
	if len(argv) > 0:
		name = argv.pop(0)
		if '\n' in name:
			die('newline not allowed in tag name')
		tag_id = insert_or_get_tag_id(name)
	while True:
		l = stdin.readline()
		if len(l) == 0:
		addr_id = insert_or_get_address_id(parse_address(l.rstrip('\n')))
		if tag_id is not None:
				db.execute('INSERT INTO address_tag (address_id, tag_id) VALUES (?,?)',
						(addr_id, tag_id))
			except IntegrityError:

def cmd_push(argv):

	Import raw hex transactions linewise from stdin and send to bitcoind.
	while True:
		line = stdin.readline()
		if len(line) == 0:
		tx_hex = line.rstrip('\n')
		stdout.write('txid %s\n' % rpc('sendrawtransaction', tx_hex))

General or command-specific help, and a command registry allowing abbreviation:

def cmd_help(argv):
	help [COMMAND]

	Display help for a given command or list all commands.
	if len(argv) > 0:
		name = argv.pop(0)
		name, func = get_command(name)
		doc = getdoc(func)
		if doc is None:
			stdout.write('No help for %r\n' % name)
			stdout.write('gbw-node %s\n' % doc)
		stdout.write('''Usage: gbw-node COMMAND [ARGS]

Available commands (can be abbreviated when unambiguous):

''' % '\n'.join([name for name, f in cmds]))

cmds = (
	('help', cmd_help),
	('scan', cmd_scan),
	('reset', cmd_reset),
	('tags', cmd_tags),
	('addresses', cmd_addresses),
	('unspent-outs', cmd_unspent_outs),
	('watch', cmd_watch),
	('push', cmd_push),
	('balance', cmd_balance),
	('register', cmd_register),

def get_command(name):
	rows = [r for r in cmds if r[0].startswith(name)]
	if len(rows) == 0:
		die('command not found: %s' % name)
	if len(rows) > 1:
		die('ambiguous command %s. Completions: %s' % (name, ' '.join([r[0] for r in rows])))
	return rows[0]

When invoked as a program (as opposed to imported elsewhere e.g. for testing), we connect to the database, enable foreign key constraints, and boost cache size and checkpoint interval from the meager defaults. These can be tuned if needed to optimize the scan process for your machine. Finally we dispatch to the given command.

Ideally, we'd create the database from schema here if not found.

def main():
	global db
	signal.signal(signal.SIGINT, signal.SIG_DFL)
	db = sqlite3.connect(gbw_home + '/db', timeout=600) # in seconds
	db.execute('PRAGMA foreign_keys=ON')
	db.execute('PRAGMA cache_size=-8000') # negative means in KiB
	db.execute('PRAGMA wal_autocheckpoint=10000') # in pages (4k)
	if len(argv) < 2:
		die('missing command', help=True)

if __name__ == '__main__':

This concludes the node frontend. Congratulations if you've followed thus far! There's no magic in programming, just a ruthless decomposition of bigger problems into smaller ones, a search for useful and robust abstractions -- and of course a whole lot of background reading and practice.

In the next month or two I will be completing the missing pieces of the signer; meanwhile, the code here is quite ready to play with. Import some addresses, run a scan, run the reports, and let me know how it goes in the comments below.

  1. Well, so far the only such case is format_address, so perhaps it should just be changed to allow passing a buffer. [^]
  2. It's looking like the COALESCE trick is pointless here, since rows are only generated by the join when matching outputs are present; that is, the SUM aggregation is always getting at least one row. Was I overzealous before? I don't recall if I observed an actual problem here rather than just in cmd_balance. It does no harm to leave in though, at least as far as correctness. [^]


Draft gbw-node frontend, part 5

Filed under: Bitcoin, Software — Jacob Welsh @ 19:02

Continued from:

Command implementations

The core scanning logic is in a helper function that takes a block's height and a memory view of its contents.

Referential integrity between blocks is ensured by scanning sequentially by height; that is, all relevant tx and output records from prior blocks will be known by the time we see the inputs that spend them. However, as far as I know this topological ordering is not guaranteed for the transaction sequence within a block (eg. tx 1 could spend outputs of tx 2, or vice versa) so we do separate passes over the transaction list for outputs and inputs.

def scan_block(height, v):
	stdout.write('block %s' % height)
	# [perf] computing every tx hash
	(blkhash, prev, time, target, txs), size = load_block(v)

The performance comment above was just to note some not-strictly-necessary work being done, in case the scan ended up horribly slow.(i)

An output is relevant if its script is standard and pays a known address. At least with foreign key constraints enabled, we can't insert an output until the tx record it references exists, but we don't know whether to insert the tx until we see if any of its outputs are relevant, so we again use a two-pass approach.

	count_out = 0
	n_tx = 0
	for (hash, size, txins, txouts) in txs:
		matched_outs = []
		for n, txout in enumerate(txouts):
			val, script = txout
			a = out_script_address(script)
			if a is not None:
				#print format_address(a)
				addr_id = get_address_id(a)
				if addr_id is not None:
					matched_outs.append((n, addr_id, val))
		if len(matched_outs) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, addr_id, val in matched_outs:
				insert_output(tx_id, n, addr_id, val)
			count_out += len(matched_outs)
		n_tx += 1
	stdout.write(' new-outs %s' % count_out)

An input is relevant if it spends a known output. Recall that insert_input updates the corresponding output to create the back-reference, indicating it has been spent.

	# Inputs scanned second in case an output from the same block is spent.
	# Coinbase (input of first tx in block) doesn't reference anything.
	count_in = 0
	n_tx = 1
	for (hash, size, txins, txouts) in txs[1:]:
		matched_ins = []
		for n, txin in enumerate(txins):
			prevout_hash, prevout_n, scriptsig = txin
			prevout_tx_id = get_tx_id(prevout_hash)
			if prevout_tx_id is not None:
				prevout_id = get_output_id(prevout_tx_id, prevout_n)
				if prevout_id is not None:
					matched_ins.append((n, prevout_id))
		if len(matched_ins) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, prevout_id in matched_ins:
				insert_input(tx_id, n, prevout_id)
			count_in += len(matched_ins)
		n_tx += 1
	stdout.write(' spent-outs %s\n' % count_in)

Assorted helpers: handling usage errors; looking up a tag ID that must exist.

def die(msg, help=False):
	stderr.write('gbw-node: %s\n' % msg)
	if help:

def require_tag(name):
	i = get_tag_id(name)
	if i is None:
		die('tag not found: %r' % name)
	return i

The entry point for any user command "X" is the function "cmd_X", having help text in its docstring and taking a list of any supplied CLI arguments past the command name.

First, the sync commands. The scan process commits one database transaction per block.

def cmd_scan(argv):

	Iterate blocks from bitcoind, indexing transaction inputs and outputs affecting watched addresses. May be safely interrupted and resumed.

	NOT PRESENTLY SAFE TO RUN CONCURRENT INSTANCES due to the dumpblock to named pipe kludge.
	db.execute('PRAGMA synchronous=NORMAL')
	height = db.execute('SELECT scan_height FROM state').fetchone()[0]
	blockcount = max(-1, rpc('getblockcount') - CONFIRMS)
	while height < blockcount:
		height += 1
		scan_block(height, memoryview(getblock(height)))
		db.execute('UPDATE state SET scan_height = ?', (height,))

def cmd_reset(argv):

	Reset the scan pointer so the next scan will proceed from the genesis block, to find transactions associated with newly watched addresses.
	db.execute('UPDATE state SET scan_height = -1')

Next, commands to query the watched address sets (not in the original spec but trivial and clearly useful).

def cmd_tags(argv):

	List all tag names.
	for name, in db.execute('SELECT name FROM tag'):
		stdout.write(name + '\n')

def cmd_addresses(argv):
	addresses [TAG]

	List addresses with the given TAG (or all watched addresses).
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address FROM address \
				JOIN address_tag ON address.address_id=address_tag.address_id \
				WHERE tag_id=?', (tag_id,))
		r = db.execute('SELECT address FROM address')
	for a, in r:
		stdout.write(format_address(str(a)) + '\n')

To be continued.

  1. I've found the Python profiler quite useful so far compared to such guesswork; still, optimization is something of a balance between experimentally-driven efforts and not doing obviously wasteful things from the start. [^]

Draft gbw-node frontend, part 4

Filed under: Bitcoin, Software — Jacob Welsh @ 04:36

Continued from:

Common database operations

As an internal convention, a "get_X_id" function will return the database ID for the row in table "X" named by its bulkier external reference, or None if not found. Similarly, "insert_or_get_X_id" will insert a row if needed and in either case return the ID. Some of these have only a single caller, but I find that collecting the various similar queries in one place and wrapping them into tidy functions helps readability.

The mapping of Python to SQLite types is fairly straightforward, except that buffer is needed to specify a BLOB.

The "parameter substitution" feature is used throughout, avoiding improper mixing of code and data that could manifest as SQL injection or thrashing the compiled statement cache.

def get_address_id(a):
	r = db.execute('SELECT address_id FROM address WHERE address=?', (buffer(a),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_address_id(a):
	i = get_address_id(a)
	if i is not None:
		return i
	return db.execute('INSERT INTO address (address) VALUES (?)', (buffer(a),)).lastrowid

def get_tx_id(hash):
	r = db.execute('SELECT tx_id FROM tx WHERE hash=?', (buffer(hash),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tx_id(hash, blkhash, height, n, size):
		return db.execute('INSERT INTO tx (hash, block_hash, block_height, n, size) VALUES (?,?,?,?,?)',
				(buffer(hash), buffer(blkhash), height, n, size)).lastrowid
	except IntegrityError:
		# XXX check equality?
		return get_tx_id(hash)

I now think we should indeed catch that condition (differing transactions with identical hash), especially given the possibility of TXID collisions. Perhaps I left it out from excessive worry about scan performance. Or just laziness.

The mixture of check-first and try-first styles seen above also doesn't sit well. The possibility of TOCTTOUs,(i) depending on the details of transaction isolation level, would seem to make a strong case for try-first. It's a minor point though; the worst case here would be an uncaught IntegrityError halting the program gracefully.

def insert_output(tx_id, n, addr_id, val):
		db.execute('INSERT INTO output (tx_id, n, address_id, value) VALUES (?,?,?,?)',
				(tx_id, n, addr_id, val))
	except IntegrityError:
		r = db.execute('SELECT address_id, value FROM output WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()
		if r != (addr_id, val):
			raise Conflict('output differs from previous content', tx_id, n, (addr_id, val), r)

def insert_input(tx_id, n, prevout_id):
		input_id = db.execute('INSERT INTO input (tx_id, n) VALUES (?,?)', (tx_id, n)).lastrowid
	except IntegrityError:
		input_id = db.execute('SELECT input_id FROM input WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()[0]
	db.execute('UPDATE output SET spent=? WHERE output_id=?', (input_id, prevout_id))

def get_output_id(tx_id, n):
	r = db.execute('SELECT output_id FROM output WHERE tx_id=? AND n=?', (tx_id, n)).fetchone()
	return None if r is None else r[0]

def get_tag_id(name):
	r = db.execute('SELECT tag_id FROM tag WHERE name=?', (name,)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tag_id(name):
	i = get_tag_id(name)
	if i is not None:
		return i
	return db.execute('INSERT INTO tag (name) VALUES (?)', (name,)).lastrowid

Next up, we'll finally get to implementing the commands themselves. To be continued.

  1. The "time of check to time of use" race condition. You know, like sitting down when some trickster's meanwhile moved the chair. [^]


Draft gbw-node frontend, part 3

Filed under: Bitcoin, Software — Jacob Welsh @ 18:02

Continued from:


Bitcoin addresses are conventionally written in a special-purpose encoding and include a hash truncated to 32 bits for error detection. As the reference implementation explains:

Why base-58 instead of standard base-64 encoding?
- Don't want 0OIl characters that look the same in some fonts and could be used to create visually identical looking account numbers.
- A string with non-alphanumeric characters is not as easily accepted as an account number.
- E-mail usually won't line-break if there's no punctuation to break at.
- Doubleclicking selects the whole number as one word if it's all alphanumeric.

Of course, all these points would have been answered just as well by hexadecimal, and without the various burdens: case-sensitivity for the user (the surest way I've found to read these out is the fully explicit: "one five big-A three little-X ..."); more code for the implementer; and more work for the machine (as the lack of bit alignment demands a general base conversion algorithm).

We start with lookup tables to convert the digits 0-57 to the specified alphabet and back. I was once surprised to learn the scope of iteration variables in a Python "for" loop is not restricted to the loop body: a potential source of referential confusion, reflecting the language's casual approach to mutation. Thus, when at the global scope I like to ensure throwaway names, like "index" and "character" here, are safely contained in a function.

base58_alphabet = (string.digits + string.uppercase + string.lowercase).translate(None, '0OIl')
base58_inverse = [None]*256
def init_base58_inverse():
	for index, character in enumerate(base58_alphabet):
		base58_inverse[ord(character)] = index

To do base conversion we'll need to treat byte sequences as integers with the same ordering conventions as the reference code. Otherwise put: to decode from base-256 to abstract integers. Python 2 doesn't have a builtin for this. The algorithm is not optimal, but the base-58 part will be worse anyway.

def bytes_to_int(b):
	"Convert big-endian byte sequence to unsigned integer"
	i = 0
	for byte in b:
		i = (i << 8) + ord(byte)
	return i

To complete the bytes-to-ASCII converter we extract digits from the integer, least significant first, by iterated division with remainder by 58. Since the conversion to integer loses track of field width, the convention is to pad with the same number of base-58 zeros as there were base-256 leading zeros in the input. In further fallout from using a non-bit-aligned encoding, these are not naturally constant time or constant control-flow operations.

For the same bit cost of the error detection code we could have had error correction. But that would have required, like, math, and stuff.

def b2a_base58check(data):
	data += sha256d(data)[:4]

	leading_zeros = 0
	for b in data:
		if b != '\x00':
		leading_zeros += 1

	data_num = bytes_to_int(data)

	digits = []
	while data_num:
		data_num, digit = divmod(data_num, 58)
	digits.extend([0] * leading_zeros)

	return ''.join(base58_alphabet[digit] for digit in reversed(digits))

Converting back to bytes uses the inverse operation at each step, but now there are cases of invalid input to reject: digits outside the specified alphabet and corruption detected by the checksum. (The precise function decomposition is a bit arbitrary and asymmetrical I'll admit.)

class Base58Error(ValueError):

class BadDigit(Base58Error):

class BadChecksum(Base58Error):

def a2b_base58(data):
	digits = [base58_inverse[ord(b)] for b in data]
	if None in digits:
		raise BadDigit

	leading_zeros = 0
	for digit in digits:
		if digit != 0:
		leading_zeros += 1

	data_num = 0
	for digit in digits:
		data_num = 58*data_num + digit

	data_bytes = []
	while data_num:
		data_bytes.append(data_num & 0xFF)
		data_num = data_num >> 8
	data_bytes.extend([0] * leading_zeros)

	return ''.join(chr(b) for b in reversed(data_bytes))

def a2b_base58check(data):
	data = a2b_base58(data)
	payload = data[:-4]
	check = data[-4:]
	if check != sha256d(payload)[:4]:
		raise BadChecksum
	return payload

Finally we apply this encoding to Bitcoin addresses, which have a fixed 160-bit width plus an extra "version" byte that becomes the familiar leading "1".

class BadAddressLength(ValueError):

class BadAddressVersion(ValueError):

def parse_address(a):
	b = a2b_base58check(a)
	if len(b) != 21:
		raise BadAddressLength
	if b[0] != '\x00':
		raise BadAddressVersion(ord(b[0]))
	return b[1:]

def format_address(b):
	return b2a_base58check('\x00' + b)

All this format conversion groundwork out of the way, we'll start talking to the database and putting it all together. To be continued!


Draft gbw-node frontend, part 2

Filed under: Bitcoin, Software — Jacob Welsh @ 18:52

Continued from schema and part 1. Source:

What follows is a kluge and probably the sorest point of the thing in my mind. I will attempt to retrace the decision tree that led to it.

Ultimately what we're after is a way to examine all confirmed transactions so as to filter for relevant ones and record these in the database ("scanning"). Since on disk they're already grouped into blocks of constrained size, it makes sense from an I/O standpoint to load them blockwise. But how? We could read directly from the blk####.dat files, but this would go against the "loose coupling" design. Concretely, this manifests as several messes waiting to happen: do we know for sure that only validated blocks get written to these files? How will we detect and handle blocks that were once valid but abandoned when a better chain was found? Might we read a corrupt block if the node is concurrently writing? Much more sensible would be to work at the RPC layer which abstracts over internal storage details and provides the necessary locking.

Unfortunately there is presently no getblock method in TRB, only dumpblock which strikes me as more of a debugging hack than anything else, returning its result by side effect through the filesystem. I worried this could impose a substantial cost in performance and SSD wear, once multiplied across the many blocks and possible repeated scans, especially if run on a traditional filesystem with more synchronous operations than the likes of ext4. It occurred to me to use a named pipe to allow the transfer to happen in memory without requiring changes to the TRB code. I took this route, after some discussion with management confirmed that minimizing TRB changes was preferable. I hadn't really spelled out the thinking on files versus pipes, or anticipated what now seems a fundamental problem with the pipe approach: if the reading process aborts for any reason, the write operation in bitcoind can't complete; in fact, due to coarse locking, the whole node ends up frozen. (This can be cleaned up manually e.g. by cat ~/.gbw/blockpipe >/dev/null).

Given this decision, the next problem was the ordering of reads and writes, as reading the block data through the pipe must complete before the RPC method will return. Thus some sort of threading is needed: either decomposing the RPC client into separate send and receive parts so as to read the pipe in between, or using a more general facility. Hoping to preserve the abstraction of the RPC code, I went with the latter in the form of Python's thread support. I tend to think this was a mistake; besides all the unseen complexity it invokes under the hood, it turned out the threading library provides no reliable way to cancel a thread, which I seemed to need for the case where the RPC call returns an error without sending data. I worked around by making the reader a long-lived "daemon" thread, which ensures it terminates with the program, and using an Event object to synchronize handoff of the result through a global variable.

getblock_thread = None
getblock_done = Event()
getblock_result = None
def getblock_reader(pipe):
	global getblock_result
	while True:
		fd = os_open(pipe, O_RDONLY)
		getblock_result = read_all(fd)

def getblock(height):
	global getblock_thread
	pipe = gbw_home + '/blockpipe'
	if getblock_thread is None:
		getblock_thread = Thread(target=getblock_reader, args=(pipe,))
		getblock_thread.daemon = True
	if not rpc('dumpblock', height, pipe):
		raise ValueError('dumpblock returned false')
	return getblock_result

The astute reader may notice a final and rather severe "rake" waiting to be stepped on: a fixed filesystem path is used for the named pipe, so concurrent scan processes will interfere with the result of corrupted blocks and all sorts of possible failures following. While I'm not sure there's any reason to run such concurrent scans, at least this possibility should be excluded. I'm thinking the process ID would make a sufficiently unique identifier to add to the filename, but then the pipe will need to be removed afterward, and what if the process dies first? Yet more code to clean up possibly-defunct pipes? Or one could keep the fixed name and use filesystem locking. At any rate I hope this all makes it clear why I'm not too enamored of dumpblock.

To be continued.


Errata for gbw-node drafts to date, and Bitcoin txid collisions

Filed under: Bitcoin, Software — Jacob Welsh @ 18:11

I've discovered a few mistakes in my wallet code published so far, one from pure carelessness and two from insufficient cognizance of environmental hazards, which together seem interesting enough for a dedicated article.

The first is in the HTTP Basic authentication header in the JSON-RPC client: I slipped in an erroneous space character after the colon in the base64 input, in the course of a hasty and perhaps excessive last-minute style cleanup. I discovered this through basic testing as it broke authentication altogether. I will update the article but preserve the original file for whoever cares to diff.

The second, found in code not yet written up, is the obnoxious but standardized behavior of the SQL SUM function, whereby the sum of the empty set is NULL rather than zero. In Python this becomes the None object then my code passes it to functions that expect integers. I first tried to work around at the Python level, but soon found this to be awkward especially for queries returning multiple columns where some were perfectly well-behaved integers. A fix closer to the source of the problem is found in the standard SQL coalesce function, though having to use this as boilerplate around every use of SUM is not exactly satisfying.

The above fixes are in: draft2/

The third and deepest might not even be a problem in practice, but seems to warrant further investigation. From the schema:

CREATE UNIQUE INDEX i_tx_hash ON tx(hash);

The problem is that Bitcoin doesn't guarantee uniqueness of transaction contents - miners can use identical coinbase inputs - despite the fact that the implementation assumes unique transaction hashes! The possibility of collision was realized in 2010,(i) condemning all future implementers to bugwise compatibility, whatever that means. The Power Rangers in their boundless wisdom addressed this in BIP 30 except without solving all that much. Further discussions I've been chewing on include Mirco… Mezzo… Macroflation—Overheated Economy (archived) and of course the forum log. Takeaways so far are that 1) the sky is probably (still) not falling regarding Bitcoin itself, but 2) this could possibly be used by malicious peers to hard-wedge syncing TRB nodes.

The conservative approach for my program would seem to be leaving the schema as is and letting SQL throw an integrity error if you try to monitor the addresses in question. Relaxing the unique constraint should be possible with some further changes to the code, but the question would arise of how exactly the quasi-duplicate outputs should be interpreted. Please speak up in the comments if you know of further references or conclusions on this topic!

  1. Block pair 91722/91880 paying address 1GktTvnY8KGfAS72DhzGYJRyaQNvYrK9Fg and 91812/91842 paying address 16va6NxJrMGe5d2LP6wUzuVnzBBoKQZKom. [^]


Draft gbw-node frontend, part 1

Filed under: Bitcoin, Software — Jacob Welsh @ 01:18

The database schema for the "node" part of Gales Bitcoin Wallet covered, we now proceed to the frontend program that puts it to work: collecting data from bitcoind, parsing its various binary encodings and extracting something useful.

Source: draft1/ draft2/ (and the previous schema-node.sql).

You'd be well advised to read the downloaded thing prior to executing, especially since it's in an unsigned draft state. As for what's necessary to graduate to a vpatch I'd be ready to sign, my thinking is that it's this review and annotation process itself, plus whatever important changes come out of it, and the previously suggested schema tweaks (since changing that is the most obnoxious part once deployed).

At present there's not much of an installation process and the database is initialized manually. I'd suggest creating some directory to hold the two sources. Then from that directory:

$ chmod +x
$ mkdir ~/.gbw
$ sqlite3 ~/.gbw/db < schema-node.sql
$ ./ help

In preparing this code for publication I observed that I had continued by force of habit (and editor settings) with the Python style guidelines of four-space indents and some fixed line width limit, in opposition to Republican doctrine. I've attempted to clean it up such that line breaks occur only for good reasons, though I can't say I'm happy with how my browser wraps the long lines. And it's not like I expect the poor thing to know good indentation rules for every possible programming language now... wut do?!


We start with the usual Pythonistic pile of imports. The ready libraries are a big reason the language is hard to beat for getting things working quickly, and at the same time a dangerous temptation toward thinking you don't need to care what's inside them.

# J. Welsh, December 2019

from os import getenv, open as os_open, O_RDONLY, O_WRONLY, mkdir, mkfifo, read, write, close, stat
from stat import S_ISDIR, S_ISFIFO
from sys import argv, stdin, stdout, stderr, exit
from socket import socket
from threading import Thread, Event
from binascii import a2b_hex, b2a_hex
from base64 import b64encode
from struct import Struct
from hashlib import sha256 as _sha256
from decimal import Decimal
from inspect import getdoc
import errno
import signal
import string
import json
import sqlite3
from sqlite3 import IntegrityError

The above are all in the standard library, assuming they're enabled on your system. The ones that stick out like sore thumbs to me are threading and decimal; more on these to come.

As the comments say:

# Safety level: scanning stops this many blocks behind tip

# There's no provision for handling forks/reorgs. In the event of one deeper than CONFIRMS, a heavy workaround would be:
#   $ sqlite3 ~/.gbw/db
#   sqlite> DELETE FROM output;
#   sqlite> DELETE FROM input;
#   sqlite> DELETE FROM tx;
#   sqlite> .exit
#   $ gbw-node reset
#   $ gbw-node scan

At least a semi-automated and lighter-touch recovery procedure would certainly be nice there.

gbw_home = getenv('HOME') + '/.gbw'
bitcoin_conf_path = getenv('HOME') + '/.bitcoin/bitcoin.conf'

# Further knobs in main() for database tuning.
db = None

This Is The Database; Use It.

For reasons I don't quite recall (probably interpreting hashes as integers, combined with pointer type punning - an unportable C programming practice common in Windows-land), bitcoind ended up reversing byte order compared to the internal representation for hex display of certain things including transaction and block hashes. Thus we have "bytes to little-endian hex" wrappers.

b2lx = lambda b: b2a_hex(b[::-1])
lx2b = lambda x: a2b_hex(x)[::-1]

Not taking any chances with display of monetary amounts, a function to convert integer Satoshi values to fixed-point decimal BTC notation. The remainder/modulus operators have varying definitions between programming languages (sometimes even between implementations of the same language!) when it comes to negative inputs, so we bypass the question.

def format_coin(v):
	neg = False
	if v < 0:
		v = -v
		neg = True
	s = '%d.%08d' % divmod(v, 100000000)
	if neg:
		return '-' + s
	return s

Preloading and giving more intelligible names to some "struct" based byte-packing routines.

u16 = Struct('<H')
u32 = Struct('<I')
u64 = Struct('<Q')
s64 = Struct('<q')
unpack_u16 = u16.unpack
unpack_u32 = u32.unpack
unpack_u64 = u64.unpack
unpack_s64 = s64.unpack
unpack_header = Struct('<i32s32sIII').unpack
unpack_outpoint = Struct('<32sI').unpack

Some shorthand for hash functions.

def sha256(v):
	return _sha256(v).digest()

def sha256d(v):
	return _sha256(_sha256(v).digest()).digest()

An exception type to indicate certain "should not happen" database inconsistencies.

class Conflict(ValueError):

For reading a complete stream from a low-level file descriptor; experience has led me to be suspicious of Python's file objects.

def read_all(fd):
	parts = []
	while True:
		part = read(fd, 65536)
		if len(part) == 0:
	return ''.join(parts)

Ensuring needed filesystem objects exist.

def require_dir(path):
	except OSError, e:
		if e.errno != errno.EEXIST:
		if not S_ISDIR(stat(path).st_mode):
			die('not a directory: %r' % path)

def require_fifo(path):
	except OSError, e:
		if e.errno != errno.EEXIST:
		if not S_ISFIFO(stat(path).st_mode):
			die('not a fifo: %r' % path)

RPC client

Bitcoind uses a password-authenticated JSON-RPC protocol. I expect this is one of the more concise client implementations around.

class JSONRPCError(Exception):
	"Error returned in JSON-RPC response"

	def __init__(self, error):
		super(JSONRPCError, self).__init__(error['code'], error['message'])

	def __str__(self):
		return 'code: {}, message: {}'.format(*self.args)

Some of this code was cribbed from earlier experiments on my shelf. The fancy exception class above doesn't really look like my style; it may have hitchhiked from an outside JSON-RPC library.

The local bitcoin.conf is parsed to get the node's credentials. This is done lazily to avoid unnecessary error conditions for the many commands that won't be needing it.

bitcoin_conf = None
def require_conf():
	global bitcoin_conf
	if bitcoin_conf is None:
		bitcoin_conf = {}
		with open(bitcoin_conf_path) as f:
			for line in f:
				line = line.split('#', 1)[0].rstrip()
				if not line:
				k, v = line.split('=', 1)
				bitcoin_conf[k.strip()] = v.lstrip()

Side note: I detest that "global" keyword hack. It's "necessary" only because variable definition is conflated with mutation in the single "=" operator, and completely misses the case of a nested function setting a variable in an outer but not global scope. ("So they added 'nonlocal' in Python 3, solves your problem!!")

def rpc(method, *args):
	host = bitcoin_conf.get('rpcconnect', '')
	port = int(bitcoin_conf.get('rpcport', 8332))
	auth = 'Basic ' + b64encode('%s:%s' % (
		bitcoin_conf.get('rpcuser', ''),
		bitcoin_conf.get('rpcpassword', '')))
	payload = json.dumps({'method': method, 'params': args})
	headers = [
		('Host', host),
		('Content-Type', 'application/json'),
		('Content-Length', len(payload)),
		('Connection', 'close'),
		('Authorization', auth),
	msg = 'POST / HTTP/1.1\r\n%s\r\n\r\n%s' % ('\r\n'.join('%s: %s' % kv for kv in headers), payload)
	sock = socket()
	sock.connect((host, port))
	response = read_all(sock.fileno())
	headers, payload = response.split('\r\n\r\n', 1)
	r = json.loads(payload, parse_float=Decimal)
	if r['error'] is not None:
		raise JSONRPCError(r['error'])
	return r['result']

I could see removing the "parse_float=Decimal", and thus the corresponding import, as we won't be calling here any of the problematic interfaces that report monetary values as JSON numbers. But then, I'd also see value in one RPC client implementation that can just be copied for whatever use without hidden hazards.

Bitcoin data parsing

Now things might get interesting. To parse the serialized data structures in a manner similar to the C++ reference implementation and hopefully efficient besides, I used memory views, basically bounds-checking pointers.(i)

# "load" functions take a memoryview and return the object and number of bytes consumed.

def load_compactsize(v):
	# serialize.h WriteCompactSize
	size = ord(v[0])
	if size < 253:
		return size, 1
	elif size == 253:
		return unpack_u16(v[1:3])[0], 3
	elif size == 254:
		return unpack_u32(v[1:5])[0], 5
		return unpack_u64(v[1:9])[0], 9

def load_string(v):
	# serialize.h Serialize, std::basic_string and CScript overloads
	n, i = load_compactsize(v)
	return v[i:i+n].tobytes(), i+n

def vector_loader(load_element):
	# serialize.h Serialize_impl
	def load_vector(v):
		n, i = load_compactsize(v)
		r = [None]*n
		for elem in xrange(n):
			r[elem], delta = load_element(v[i:])
			i += delta
		return r, i
	return load_vector

def load_txin(v):
	# main.h CTxIn
	i = 36
	txid, pos = unpack_outpoint(v[:i])
	scriptsig, delta = load_string(v[i:])
	i += delta
	i += 4 # skipping sequence
	return (txid, pos, scriptsig), i

load_txins = vector_loader(load_txin)

def load_txout(v):
	# main.h CTxOut
	i = 8
	value, = unpack_s64(v[:i])
	scriptpubkey, delta = load_string(v[i:])
	return (value, scriptpubkey), i+delta

load_txouts = vector_loader(load_txout)

def load_transaction(v):
	# main.h CTransaction
	i = 4 # skipping version
	txins, delta = load_txins(v[i:])
	i += delta
	txouts, delta = load_txouts(v[i:])
	i += delta
	i += 4 # skipping locktime
	hash = sha256d(v[:i])
	return (hash, i, txins, txouts), i

load_transactions = vector_loader(load_transaction)

def load_block(v):
	# main.h CBlock
	i = 80
	head = v[:i]
	version, prev, root, time, target, nonce = unpack_header(head)
	hash = sha256d(head)
	txs, delta = load_transactions(v[i:])
	return (hash, prev, time, target, txs), i+delta

The code dig to come up with this magic for identifying standard pay-to-pubkey-hash outputs and extracting the enclosed addresses was ugly.

def out_script_address(s):
	if len(s) == 25 and s[:3] == '\x76\xA9\x14' and s[23:] == '\x88\xAC':
		return s[3:23]
	return None

To be continued.(ii)

Updated for errata.

  1. I'm just now noticing these were added in 2.7, ugh... sorry, 2.6 users. [^]
  2. My blog will be going on hiatus as far as new articles until early January. There's quite a ways to go on this file and I might not make it all the way through on this pass. If the suspense gnaws, you can always keep reading the source! [^]
« Newer PostsOlder Posts »

Powered by MP-WP. Copyright Jacob Welsh.