Fixpoint

2020-01-20

Draft gbw-node frontend, part 6

Filed under: Bitcoin, Software — Jacob Welsh @ 21:32

Continued from:

The first of the input/output commands is to print a table of possibly-spendable outputs in the format required by the offline signer. While the Bitcoin protocol refers to transactions by 256-bit hash, the more compact confirmation coordinates (height, index) are included for convenience in the comment field. The queries are a bit lengthy, since we now join several tables to build the flat output file, but aren't doing anything too fancy once you break them down. The only difference when a tag is specified is the extra join to filter on its ID.

In some cases, BLOB fields need to be converted back to str.(i)

def cmd_unspent_outs(argv):
	'''
	unspent-outs [TAG]

	Display the unspent outputs table for addresses with the given TAG (or all watched addresses), as required by the offline wallet, ordered by age.
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=? \
				ORDER BY block_height DESC', (tag_id,))
	else:
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				WHERE spent IS NULL \
				ORDER BY block_height DESC')
	for a, v, hash, n_out, height, n_tx in r:
		stdout.write('%s %s %s %s #blk %s tx %s\n' % (format_address(str(a)), format_coin(v), b2lx(hash), n_out, height, n_tx))

Idea: Add a command to print the outputs table for a given raw transaction. For example, this would enable spending unconfirmed or too recently confirmed outputs in a pinch, without requiring any further changes. Or more generally: all the data conversion code is already here so might as well make it accessible.

Next we proceed to the accounting commands, as they're really just another kind of output command. The balance of an address set is the total value of unspent outputs to addresses in the set.

def cmd_balance(argv):
	'''
	balance [TAG]

	Display confirmed balance of addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output WHERE spent IS NULL')
	bal, = r.fetchone()
	stdout.write('%s\n' % format_coin(bal))

Things get tricker for the register report as it attempts to usefully summarize several things in a small space. In particular, summing the incoming and outgoing value per transaction seems to require separate queries since the join criteria differ.(ii)

def cmd_register(argv):
	'''
	register [TAG]

	Display a tab-delimited transaction register report for addresses with the given TAG (or all watched addresses). Columns are:

	- confirmation block height
	- number of transaction within block
	- total deposits (new outputs)
	- total withdrawals (spent outputs)
	- running balance
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
	else:
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
	bal = 0
	for height, n, o_val, i_val in merge_moves(outs.fetchall(), ins.fetchall()):
		bal = bal + o_val - i_val
		stdout.write('%s\t%s\t%s\t%s\t%s\n' % (height, n, format_coin(o_val), format_coin(-i_val), format_coin(bal)))

A helper is used to join the two possibly uneven lists by transaction, inserting zeros for transactions found on only one side. Perhaps it could all be done in SQL with subqueries and some type of outer joins, but I wasn't quite seeing it, so resorted to the low level with an algorithm reminiscent of the merging step of classical mergesort.

# Merge ordered lists of total input and output values per transaction into single table with columns for both.
def merge_moves(outs, ins):
	i = o = 0

	while True:
		if o == len(outs):
			for height, n, val in ins[i:]:
				yield (height, n, 0, val)
			return
		o_height, o_n, o_val = outs[o]
		o_key = (o_height, o_n)

		if i == len(ins):
			for height, n, val in outs[o:]:
				yield (height, n, val, 0)
			return
		i_height, i_n, i_val = ins[i]
		i_key = (i_height, i_n)

		if o_key < i_key:
			yield (o_height, o_n, o_val, 0)
			o += 1
		elif i_key < o_key:
			yield (i_height, i_n, 0, i_val)
			i += 1
		else:
			yield (o_height, o_n, o_val, i_val)
			i += 1
			o += 1

Next, the input commands. For sanity's sake, we exclude newlines in tag names as implicitly required by the tags listing format.

def cmd_watch(argv):
	'''
	watch [TAG]

	Import a set of addresses to watch linewise from stdin, optionally named by the given TAG. Addresses can be associated with multiple tags using multiple watch commands.
	'''
	tag_id = None
	if len(argv) > 0:
		name = argv.pop(0)
		if '\n' in name:
			die('newline not allowed in tag name')
		tag_id = insert_or_get_tag_id(name)
	while True:
		l = stdin.readline()
		if len(l) == 0:
			break
		addr_id = insert_or_get_address_id(parse_address(l.rstrip('\n')))
		if tag_id is not None:
			try:
				db.execute('INSERT INTO address_tag (address_id, tag_id) VALUES (?,?)',
						(addr_id, tag_id))
			except IntegrityError:
				pass
		db.commit()

def cmd_push(argv):
	'''
	push

	Import raw hex transactions linewise from stdin and send to bitcoind.
	'''
	while True:
		line = stdin.readline()
		if len(line) == 0:
			break
		tx_hex = line.rstrip('\n')
		stdout.write('txid %s\n' % rpc('sendrawtransaction', tx_hex))

General or command-specific help, and a command registry allowing abbreviation:

def cmd_help(argv):
	'''
	help [COMMAND]

	Display help for a given command or list all commands.
	'''
	if len(argv) > 0:
		name = argv.pop(0)
		name, func = get_command(name)
		doc = getdoc(func)
		if doc is None:
			stdout.write('No help for %r\n' % name)
		else:
			stdout.write('gbw-node %s\n' % doc)
	else:
		stdout.write('''Usage: gbw-node COMMAND [ARGS]

Available commands (can be abbreviated when unambiguous):

%s
''' % '\n'.join([name for name, f in cmds]))

cmds = (
	('help', cmd_help),
	('scan', cmd_scan),
	('reset', cmd_reset),
	('tags', cmd_tags),
	('addresses', cmd_addresses),
	('unspent-outs', cmd_unspent_outs),
	('watch', cmd_watch),
	('push', cmd_push),
	('balance', cmd_balance),
	('register', cmd_register),
)

def get_command(name):
	rows = [r for r in cmds if r[0].startswith(name)]
	if len(rows) == 0:
		die('command not found: %s' % name)
	if len(rows) > 1:
		die('ambiguous command %s. Completions: %s' % (name, ' '.join([r[0] for r in rows])))
	return rows[0]

When invoked as a program (as opposed to imported elsewhere e.g. for testing), we connect to the database, enable foreign key constraints, and boost cache size and checkpoint interval from the meager defaults. These can be tuned if needed to optimize the scan process for your machine. Finally we dispatch to the given command.

Ideally, we'd create the database from schema here if not found.

def main():
	global db
	signal.signal(signal.SIGINT, signal.SIG_DFL)
	require_dir(gbw_home)
	db = sqlite3.connect(gbw_home + '/db', timeout=600) # in seconds
	db.execute('PRAGMA foreign_keys=ON')
	db.execute('PRAGMA cache_size=-8000') # negative means in KiB
	db.execute('PRAGMA wal_autocheckpoint=10000') # in pages (4k)
	if len(argv) < 2:
		die('missing command', help=True)
	get_command(argv[1])[1](argv[2:])

if __name__ == '__main__':
	main()

This concludes the node frontend. Congratulations if you've followed thus far! There's no magic in programming, just a ruthless decomposition of bigger problems into smaller ones, a search for useful and robust abstractions -- and of course a whole lot of background reading and practice.

In the next month or two I will be completing the missing pieces of the signer; meanwhile, the code here is quite ready to play with. Import some addresses, run a scan, run the reports, and let me know how it goes in the comments below.

  1. Well, so far the only such case is format_address, so perhaps it should just be changed to allow passing a buffer. [^]
  2. It's looking like the COALESCE trick is pointless here, since rows are only generated by the join when matching outputs are present; that is, the SUM aggregation is always getting at least one row. Was I overzealous before? I don't recall if I observed an actual problem here rather than just in cmd_balance. It does no harm to leave in though, at least as far as correctness. [^]

2020-01-19

Draft gbw-node frontend, part 5

Filed under: Bitcoin, Software — Jacob Welsh @ 19:02

Continued from:

Command implementations

The core scanning logic is in a helper function that takes a block's height and a memory view of its contents.

Referential integrity between blocks is ensured by scanning sequentially by height; that is, all relevant tx and output records from prior blocks will be known by the time we see the inputs that spend them. However, as far as I know this topological ordering is not guaranteed for the transaction sequence within a block (eg. tx 1 could spend outputs of tx 2, or vice versa) so we do separate passes over the transaction list for outputs and inputs.

def scan_block(height, v):
	stdout.write('block %s' % height)
	# [perf] computing every tx hash
	(blkhash, prev, time, target, txs), size = load_block(v)

The performance comment above was just to note some not-strictly-necessary work being done, in case the scan ended up horribly slow.(i)

An output is relevant if its script is standard and pays a known address. At least with foreign key constraints enabled, we can't insert an output until the tx record it references exists, but we don't know whether to insert the tx until we see if any of its outputs are relevant, so we again use a two-pass approach.

	count_out = 0
	n_tx = 0
	for (hash, size, txins, txouts) in txs:
		matched_outs = []
		for n, txout in enumerate(txouts):
			val, script = txout
			a = out_script_address(script)
			if a is not None:
				#print format_address(a)
				addr_id = get_address_id(a)
				if addr_id is not None:
					matched_outs.append((n, addr_id, val))
		if len(matched_outs) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, addr_id, val in matched_outs:
				insert_output(tx_id, n, addr_id, val)
			count_out += len(matched_outs)
		n_tx += 1
	stdout.write(' new-outs %s' % count_out)

An input is relevant if it spends a known output. Recall that insert_input updates the corresponding output to create the back-reference, indicating it has been spent.

	# Inputs scanned second in case an output from the same block is spent.
	# Coinbase (input of first tx in block) doesn't reference anything.
	count_in = 0
	n_tx = 1
	for (hash, size, txins, txouts) in txs[1:]:
		matched_ins = []
		for n, txin in enumerate(txins):
			prevout_hash, prevout_n, scriptsig = txin
			prevout_tx_id = get_tx_id(prevout_hash)
			if prevout_tx_id is not None:
				prevout_id = get_output_id(prevout_tx_id, prevout_n)
				if prevout_id is not None:
					matched_ins.append((n, prevout_id))
		if len(matched_ins) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, prevout_id in matched_ins:
				insert_input(tx_id, n, prevout_id)
			count_in += len(matched_ins)
		n_tx += 1
	stdout.write(' spent-outs %s\n' % count_in)

Assorted helpers: handling usage errors; looking up a tag ID that must exist.

def die(msg, help=False):
	stderr.write('gbw-node: %s\n' % msg)
	if help:
		cmd_help([])
	exit(-1)

def require_tag(name):
	i = get_tag_id(name)
	if i is None:
		die('tag not found: %r' % name)
	return i

The entry point for any user command "X" is the function "cmd_X", having help text in its docstring and taking a list of any supplied CLI arguments past the command name.

First, the sync commands. The scan process commits one database transaction per block.

def cmd_scan(argv):
	'''
	scan

	Iterate blocks from bitcoind, indexing transaction inputs and outputs affecting watched addresses. May be safely interrupted and resumed.

	NOT PRESENTLY SAFE TO RUN CONCURRENT INSTANCES due to the dumpblock to named pipe kludge.
	'''
	db.execute('PRAGMA synchronous=NORMAL')
	height = db.execute('SELECT scan_height FROM state').fetchone()[0]
	blockcount = max(-1, rpc('getblockcount') - CONFIRMS)
	while height < blockcount:
		height += 1
		scan_block(height, memoryview(getblock(height)))
		db.execute('UPDATE state SET scan_height = ?', (height,))
		db.commit()

def cmd_reset(argv):
	'''
	reset

	Reset the scan pointer so the next scan will proceed from the genesis block, to find transactions associated with newly watched addresses.
	'''
	db.execute('UPDATE state SET scan_height = -1')
	db.commit()

Next, commands to query the watched address sets (not in the original spec but trivial and clearly useful).

def cmd_tags(argv):
	'''
	tags

	List all tag names.
	'''
	for name, in db.execute('SELECT name FROM tag'):
		stdout.write(name + '\n')

def cmd_addresses(argv):
	'''
	addresses [TAG]

	List addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address FROM address \
				JOIN address_tag ON address.address_id=address_tag.address_id \
				WHERE tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT address FROM address')
	for a, in r:
		stdout.write(format_address(str(a)) + '\n')

To be continued.

  1. I've found the Python profiler quite useful so far compared to such guesswork; still, optimization is something of a balance between experimentally-driven efforts and not doing obviously wasteful things from the start. [^]

Draft gbw-node frontend, part 4

Filed under: Bitcoin, Software — Jacob Welsh @ 04:36

Continued from:

Common database operations

As an internal convention, a "get_X_id" function will return the database ID for the row in table "X" named by its bulkier external reference, or None if not found. Similarly, "insert_or_get_X_id" will insert a row if needed and in either case return the ID. Some of these have only a single caller, but I find that collecting the various similar queries in one place and wrapping them into tidy functions helps readability.

The mapping of Python to SQLite types is fairly straightforward, except that buffer is needed to specify a BLOB.

The "parameter substitution" feature is used throughout, avoiding improper mixing of code and data that could manifest as SQL injection or thrashing the compiled statement cache.

def get_address_id(a):
	r = db.execute('SELECT address_id FROM address WHERE address=?', (buffer(a),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_address_id(a):
	i = get_address_id(a)
	if i is not None:
		return i
	return db.execute('INSERT INTO address (address) VALUES (?)', (buffer(a),)).lastrowid

def get_tx_id(hash):
	r = db.execute('SELECT tx_id FROM tx WHERE hash=?', (buffer(hash),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tx_id(hash, blkhash, height, n, size):
	try:
		return db.execute('INSERT INTO tx (hash, block_hash, block_height, n, size) VALUES (?,?,?,?,?)',
				(buffer(hash), buffer(blkhash), height, n, size)).lastrowid
	except IntegrityError:
		# XXX check equality?
		return get_tx_id(hash)

I now think we should indeed catch that condition (differing transactions with identical hash), especially given the possibility of TXID collisions. Perhaps I left it out from excessive worry about scan performance. Or just laziness.

The mixture of check-first and try-first styles seen above also doesn't sit well. The possibility of TOCTTOUs,(i) depending on the details of transaction isolation level, would seem to make a strong case for try-first. It's a minor point though; the worst case here would be an uncaught IntegrityError halting the program gracefully.

def insert_output(tx_id, n, addr_id, val):
	try:
		db.execute('INSERT INTO output (tx_id, n, address_id, value) VALUES (?,?,?,?)',
				(tx_id, n, addr_id, val))
	except IntegrityError:
		r = db.execute('SELECT address_id, value FROM output WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()
		if r != (addr_id, val):
			raise Conflict('output differs from previous content', tx_id, n, (addr_id, val), r)

def insert_input(tx_id, n, prevout_id):
	try:
		input_id = db.execute('INSERT INTO input (tx_id, n) VALUES (?,?)', (tx_id, n)).lastrowid
	except IntegrityError:
		input_id = db.execute('SELECT input_id FROM input WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()[0]
	db.execute('UPDATE output SET spent=? WHERE output_id=?', (input_id, prevout_id))

def get_output_id(tx_id, n):
	r = db.execute('SELECT output_id FROM output WHERE tx_id=? AND n=?', (tx_id, n)).fetchone()
	return None if r is None else r[0]

def get_tag_id(name):
	r = db.execute('SELECT tag_id FROM tag WHERE name=?', (name,)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tag_id(name):
	i = get_tag_id(name)
	if i is not None:
		return i
	return db.execute('INSERT INTO tag (name) VALUES (?)', (name,)).lastrowid

Next up, we'll finally get to implementing the commands themselves. To be continued.

  1. The "time of check to time of use" race condition. You know, like sitting down when some trickster's meanwhile moved the chair. [^]

2020-01-17

Draft gbw-node frontend, part 3

Filed under: Bitcoin, Software — Jacob Welsh @ 18:02

Continued from:

Base58

Bitcoin addresses are conventionally written in a special-purpose encoding and include a hash truncated to 32 bits for error detection. As the reference implementation explains:

Why base-58 instead of standard base-64 encoding?
- Don't want 0OIl characters that look the same in some fonts and could be used to create visually identical looking account numbers.
- A string with non-alphanumeric characters is not as easily accepted as an account number.
- E-mail usually won't line-break if there's no punctuation to break at.
- Doubleclicking selects the whole number as one word if it's all alphanumeric.

Of course, all these points would have been answered just as well by hexadecimal, and without the various burdens: case-sensitivity for the user (the surest way I've found to read these out is the fully explicit: "one five big-A three little-X ..."); more code for the implementer; and more work for the machine (as the lack of bit alignment demands a general base conversion algorithm).

We start with lookup tables to convert the digits 0-57 to the specified alphabet and back. I was once surprised to learn the scope of iteration variables in a Python "for" loop is not restricted to the loop body: a potential source of referential confusion, reflecting the language's casual approach to mutation. Thus, when at the global scope I like to ensure throwaway names, like "index" and "character" here, are safely contained in a function.

base58_alphabet = (string.digits + string.uppercase + string.lowercase).translate(None, '0OIl')
base58_inverse = [None]*256
def init_base58_inverse():
	for index, character in enumerate(base58_alphabet):
		base58_inverse[ord(character)] = index
init_base58_inverse()

To do base conversion we'll need to treat byte sequences as integers with the same ordering conventions as the reference code. Otherwise put: to decode from base-256 to abstract integers. Python 2 doesn't have a builtin for this. The algorithm is not optimal, but the base-58 part will be worse anyway.

def bytes_to_int(b):
	"Convert big-endian byte sequence to unsigned integer"
	i = 0
	for byte in b:
		i = (i << 8) + ord(byte)
	return i

To complete the bytes-to-ASCII converter we extract digits from the integer, least significant first, by iterated division with remainder by 58. Since the conversion to integer loses track of field width, the convention is to pad with the same number of base-58 zeros as there were base-256 leading zeros in the input. In further fallout from using a non-bit-aligned encoding, these are not naturally constant time or constant control-flow operations.

For the same bit cost of the error detection code we could have had error correction. But that would have required, like, math, and stuff.

def b2a_base58check(data):
	data += sha256d(data)[:4]

	leading_zeros = 0
	for b in data:
		if b != '\x00':
			break
		leading_zeros += 1

	data_num = bytes_to_int(data)

	digits = []
	while data_num:
		data_num, digit = divmod(data_num, 58)
		digits.append(digit)
	digits.extend([0] * leading_zeros)

	return ''.join(base58_alphabet[digit] for digit in reversed(digits))

Converting back to bytes uses the inverse operation at each step, but now there are cases of invalid input to reject: digits outside the specified alphabet and corruption detected by the checksum. (The precise function decomposition is a bit arbitrary and asymmetrical I'll admit.)

class Base58Error(ValueError):
	pass

class BadDigit(Base58Error):
	pass

class BadChecksum(Base58Error):
	pass

def a2b_base58(data):
	digits = [base58_inverse[ord(b)] for b in data]
	if None in digits:
		raise BadDigit

	leading_zeros = 0
	for digit in digits:
		if digit != 0:
			break
		leading_zeros += 1

	data_num = 0
	for digit in digits:
		data_num = 58*data_num + digit

	data_bytes = []
	while data_num:
		data_bytes.append(data_num & 0xFF)
		data_num = data_num >> 8
	data_bytes.extend([0] * leading_zeros)

	return ''.join(chr(b) for b in reversed(data_bytes))

def a2b_base58check(data):
	data = a2b_base58(data)
	payload = data[:-4]
	check = data[-4:]
	if check != sha256d(payload)[:4]:
		raise BadChecksum
	return payload

Finally we apply this encoding to Bitcoin addresses, which have a fixed 160-bit width plus an extra "version" byte that becomes the familiar leading "1".

class BadAddressLength(ValueError):
	pass

class BadAddressVersion(ValueError):
	pass

def parse_address(a):
	b = a2b_base58check(a)
	if len(b) != 21:
		raise BadAddressLength
	if b[0] != '\x00':
		raise BadAddressVersion(ord(b[0]))
	return b[1:]

def format_address(b):
	return b2a_base58check('\x00' + b)

All this format conversion groundwork out of the way, we'll start talking to the database and putting it all together. To be continued!

2020-01-16

Draft gbw-node frontend, part 2

Filed under: Bitcoin, Software — Jacob Welsh @ 18:52

Continued from schema and part 1. Source: gbw-node.py.

What follows is a kluge and probably the sorest point of the thing in my mind. I will attempt to retrace the decision tree that led to it.

Ultimately what we're after is a way to examine all confirmed transactions so as to filter for relevant ones and record these in the database ("scanning"). Since on disk they're already grouped into blocks of constrained size, it makes sense from an I/O standpoint to load them blockwise. But how? We could read directly from the blk####.dat files, but this would go against the "loose coupling" design. Concretely, this manifests as several messes waiting to happen: do we know for sure that only validated blocks get written to these files? How will we detect and handle blocks that were once valid but abandoned when a better chain was found? Might we read a corrupt block if the node is concurrently writing? Much more sensible would be to work at the RPC layer which abstracts over internal storage details and provides the necessary locking.

Unfortunately there is presently no getblock method in TRB, only dumpblock which strikes me as more of a debugging hack than anything else, returning its result by side effect through the filesystem. I worried this could impose a substantial cost in performance and SSD wear, once multiplied across the many blocks and possible repeated scans, especially if run on a traditional filesystem with more synchronous operations than the likes of ext4. It occurred to me to use a named pipe to allow the transfer to happen in memory without requiring changes to the TRB code. I took this route, after some discussion with management confirmed that minimizing TRB changes was preferable. I hadn't really spelled out the thinking on files versus pipes, or anticipated what now seems a fundamental problem with the pipe approach: if the reading process aborts for any reason, the write operation in bitcoind can't complete; in fact, due to coarse locking, the whole node ends up frozen. (This can be cleaned up manually e.g. by cat ~/.gbw/blockpipe >/dev/null).

Given this decision, the next problem was the ordering of reads and writes, as reading the block data through the pipe must complete before the RPC method will return. Thus some sort of threading is needed: either decomposing the RPC client into separate send and receive parts so as to read the pipe in between, or using a more general facility. Hoping to preserve the abstraction of the RPC code, I went with the latter in the form of Python's thread support. I tend to think this was a mistake; besides all the unseen complexity it invokes under the hood, it turned out the threading library provides no reliable way to cancel a thread, which I seemed to need for the case where the RPC call returns an error without sending data. I worked around by making the reader a long-lived "daemon" thread, which ensures it terminates with the program, and using an Event object to synchronize handoff of the result through a global variable.

getblock_thread = None
getblock_done = Event()
getblock_result = None
def getblock_reader(pipe):
	global getblock_result
	while True:
		fd = os_open(pipe, O_RDONLY)
		getblock_result = read_all(fd)
		getblock_done.set()
		close(fd)

def getblock(height):
	global getblock_thread
	pipe = gbw_home + '/blockpipe'
	if getblock_thread is None:
		require_fifo(pipe)
		getblock_thread = Thread(target=getblock_reader, args=(pipe,))
		getblock_thread.daemon = True
		getblock_thread.start()
	if not rpc('dumpblock', height, pipe):
		raise ValueError('dumpblock returned false')
	getblock_done.wait()
	getblock_done.clear()
	return getblock_result

The astute reader may notice a final and rather severe "rake" waiting to be stepped on: a fixed filesystem path is used for the named pipe, so concurrent scan processes will interfere with the result of corrupted blocks and all sorts of possible failures following. While I'm not sure there's any reason to run such concurrent scans, at least this possibility should be excluded. I'm thinking the process ID would make a sufficiently unique identifier to add to the filename, but then the pipe will need to be removed afterward, and what if the process dies first? Yet more code to clean up possibly-defunct pipes? Or one could keep the fixed name and use filesystem locking. At any rate I hope this all makes it clear why I'm not too enamored of dumpblock.

To be continued.

2020-01-11

What's on my mind

Filed under: Ego, Politikos — Jacob Welsh @ 20:46

As I continue shaking off the grogginess of a deep holiday blogging slumber,(i) it's time for a tack away from the recent technical series, fascinating as it is, to make room for some other thoughts that have been occupying my head the past couple days.

There were the occurrences of Mr. Ragavan dropping out of our academy and Mr. Datskovskiy reversing his previous confidence in its Master, based on an alleged lack of capacity for independent judgement. The matter was explored quite nicely I thought by Will Haack and commenters; all I really have to add is my own perspective, a narrow selection from the many available facts that I found most relevant, that I may shed light on myself as well as the subject.

I agree with Diana Coman's comment, as I understand it. MP asked her to stop advertising for the #asciilifeform channel; Datskovskiy took this as a demand to tune him out altogether, seemingly so he could consider his prediction fulfilled and sustain his view of MP as an unreasonable and overreaching madman. I'm not presently sure how I should interpret the invitation to leave, though I wasn't especially tuned in to begin with based on time constraints; I'm not inclined to contribute to building a Berlin Wall, as he puts it, if I don't have to, but at minimum it would give me pause about signing up for his ISP service knowing that I could be kicked out on such grounds.

The specimen of deductive reasoning that Datskovskiy took for finding a spine: it's not because it's not; I don't see the need; I'm not convinced; I don't think it conforms to idea. I'm not too surprised by Ragavan's attitude after how he spoke in our own main interaction. I will note that I don't hold it against anyone for having doubts. I know my own grasp of philosophy and what TMSR is really about remains limited; I've been wrong about plenty of ideological things; and we're all limited beings striving to navigate a complex world. As far as I can see, if he had wanted to avoid getting quite so tied up in knots, his best bet would have been to communicate more, or more clearly. This is a point that I still quite need to keep in mind myself.(ii)

What's been more on my mind though are some different relationships altogether. To be continued...

  1. I see myself as having been something of a compressed spring; once released, I overshot the neutral state, doing no writing at all, and am now having to pick back up the habits of time-sensitive existence. [^]
  2. As for instance about my getting stuck on drafting this article. [^]

2020-01-07

Errata for gbw-node drafts to date, and Bitcoin txid collisions

Filed under: Bitcoin, Software — Jacob Welsh @ 18:11

I've discovered a few mistakes in my wallet code published so far, one from pure carelessness and two from insufficient cognizance of environmental hazards, which together seem interesting enough for a dedicated article.

The first is in the HTTP Basic authentication header in the JSON-RPC client: I slipped in an erroneous space character after the colon in the base64 input, in the course of a hasty and perhaps excessive last-minute style cleanup. I discovered this through basic testing as it broke authentication altogether. I will update the article but preserve the original file for whoever cares to diff.

The second, found in code not yet written up, is the obnoxious but standardized behavior of the SQL SUM function, whereby the sum of the empty set is NULL rather than zero. In Python this becomes the None object then my code passes it to functions that expect integers. I first tried to work around at the Python level, but soon found this to be awkward especially for queries returning multiple columns where some were perfectly well-behaved integers. A fix closer to the source of the problem is found in the standard SQL coalesce function, though having to use this as boilerplate around every use of SUM is not exactly satisfying.

The above fixes are in: draft2/gbw-node.py.

The third and deepest might not even be a problem in practice, but seems to warrant further investigation. From the schema:

CREATE UNIQUE INDEX i_tx_hash ON tx(hash);

The problem is that Bitcoin doesn't guarantee uniqueness of transaction contents - miners can use identical coinbase inputs - despite the fact that the implementation assumes unique transaction hashes! The possibility of collision was realized in 2010,(i) condemning all future implementers to bugwise compatibility, whatever that means. The Power Rangers in their boundless wisdom addressed this in BIP 30 except without solving all that much. Further discussions I've been chewing on include Mirco… Mezzo… Macroflation—Overheated Economy (archived) and of course the forum log. Takeaways so far are that 1) the sky is probably (still) not falling regarding Bitcoin itself, but 2) this could possibly be used by malicious peers to hard-wedge syncing TRB nodes.

The conservative approach for my program would seem to be leaving the schema as is and letting SQL throw an integrity error if you try to monitor the addresses in question. Relaxing the unique constraint should be possible with some further changes to the code, but the question would arise of how exactly the quasi-duplicate outputs should be interpreted. Please speak up in the comments if you know of further references or conclusions on this topic!

  1. Block pair 91722/91880 paying address 1GktTvnY8KGfAS72DhzGYJRyaQNvYrK9Fg and 91812/91842 paying address 16va6NxJrMGe5d2LP6wUzuVnzBBoKQZKom. [^]

2019-12-18

Draft gbw-node frontend, part 1

Filed under: Bitcoin, Software — Jacob Welsh @ 01:18

The database schema for the "node" part of Gales Bitcoin Wallet covered, we now proceed to the frontend program that puts it to work: collecting data from bitcoind, parsing its various binary encodings and extracting something useful.

Source: draft1/gbw-node.py draft2/gbw-node.py (and the previous schema-node.sql).

You'd be well advised to read the downloaded thing prior to executing, especially since it's in an unsigned draft state. As for what's necessary to graduate to a vpatch I'd be ready to sign, my thinking is that it's this review and annotation process itself, plus whatever important changes come out of it, and the previously suggested schema tweaks (since changing that is the most obnoxious part once deployed).

At present there's not much of an installation process and the database is initialized manually. I'd suggest creating some directory to hold the two sources. Then from that directory:

$ chmod +x gbw-node.py
$ mkdir ~/.gbw
$ sqlite3 ~/.gbw/db < schema-node.sql
$ ./gbw-node.py help

In preparing this code for publication I observed that I had continued by force of habit (and editor settings) with the Python style guidelines of four-space indents and some fixed line width limit, in opposition to Republican doctrine. I've attempted to clean it up such that line breaks occur only for good reasons, though I can't say I'm happy with how my browser wraps the long lines. And it's not like I expect the poor thing to know good indentation rules for every possible programming language now... wut do?!

Prologue

We start with the usual Pythonistic pile of imports. The ready libraries are a big reason the language is hard to beat for getting things working quickly, and at the same time a dangerous temptation toward thinking you don't need to care what's inside them.

#!/usr/bin/python2
# J. Welsh, December 2019

from os import getenv, open as os_open, O_RDONLY, O_WRONLY, mkdir, mkfifo, read, write, close, stat
from stat import S_ISDIR, S_ISFIFO
from sys import argv, stdin, stdout, stderr, exit
from socket import socket
from threading import Thread, Event
from binascii import a2b_hex, b2a_hex
from base64 import b64encode
from struct import Struct
from hashlib import sha256 as _sha256
from decimal import Decimal
from inspect import getdoc
import errno
import signal
import string
import json
import sqlite3
from sqlite3 import IntegrityError

The above are all in the standard library, assuming they're enabled on your system. The ones that stick out like sore thumbs to me are threading and decimal; more on these to come.

As the comments say:

# Safety level: scanning stops this many blocks behind tip
CONFIRMS = 6

# There's no provision for handling forks/reorgs. In the event of one deeper than CONFIRMS, a heavy workaround would be:
#   $ sqlite3 ~/.gbw/db
#   sqlite> DELETE FROM output;
#   sqlite> DELETE FROM input;
#   sqlite> DELETE FROM tx;
#   sqlite> .exit
#   $ gbw-node reset
#   $ gbw-node scan

At least a semi-automated and lighter-touch recovery procedure would certainly be nice there.

gbw_home = getenv('HOME') + '/.gbw'
bitcoin_conf_path = getenv('HOME') + '/.bitcoin/bitcoin.conf'

# Further knobs in main() for database tuning.
db = None

This Is The Database; Use It.

For reasons I don't quite recall (probably interpreting hashes as integers, combined with pointer type punning - an unportable C programming practice common in Windows-land), bitcoind ended up reversing byte order compared to the internal representation for hex display of certain things including transaction and block hashes. Thus we have "bytes to little-endian hex" wrappers.

b2lx = lambda b: b2a_hex(b[::-1])
lx2b = lambda x: a2b_hex(x)[::-1]

Not taking any chances with display of monetary amounts, a function to convert integer Satoshi values to fixed-point decimal BTC notation. The remainder/modulus operators have varying definitions between programming languages (sometimes even between implementations of the same language!) when it comes to negative inputs, so we bypass the question.

def format_coin(v):
	neg = False
	if v < 0:
		v = -v
		neg = True
	s = '%d.%08d' % divmod(v, 100000000)
	if neg:
		return '-' + s
	return s

Preloading and giving more intelligible names to some "struct" based byte-packing routines.

u16 = Struct('<H')
u32 = Struct('<I')
u64 = Struct('<Q')
s64 = Struct('<q')
unpack_u16 = u16.unpack
unpack_u32 = u32.unpack
unpack_u64 = u64.unpack
unpack_s64 = s64.unpack
unpack_header = Struct('<i32s32sIII').unpack
unpack_outpoint = Struct('<32sI').unpack

Some shorthand for hash functions.

def sha256(v):
	return _sha256(v).digest()

def sha256d(v):
	return _sha256(_sha256(v).digest()).digest()

An exception type to indicate certain "should not happen" database inconsistencies.

class Conflict(ValueError):
	pass

For reading a complete stream from a low-level file descriptor; experience has led me to be suspicious of Python's file objects.

def read_all(fd):
	parts = []
	while True:
		part = read(fd, 65536)
		if len(part) == 0:
			break
		parts.append(part)
	return ''.join(parts)

Ensuring needed filesystem objects exist.

def require_dir(path):
	try:
		mkdir(path)
	except OSError, e:
		if e.errno != errno.EEXIST:
			raise
		if not S_ISDIR(stat(path).st_mode):
			die('not a directory: %r' % path)

def require_fifo(path):
	try:
		mkfifo(path)
	except OSError, e:
		if e.errno != errno.EEXIST:
			raise
		if not S_ISFIFO(stat(path).st_mode):
			die('not a fifo: %r' % path)

RPC client

Bitcoind uses a password-authenticated JSON-RPC protocol. I expect this is one of the more concise client implementations around.

class JSONRPCError(Exception):
	"Error returned in JSON-RPC response"

	def __init__(self, error):
		super(JSONRPCError, self).__init__(error['code'], error['message'])

	def __str__(self):
		return 'code: {}, message: {}'.format(*self.args)

Some of this code was cribbed from earlier experiments on my shelf. The fancy exception class above doesn't really look like my style; it may have hitchhiked from an outside JSON-RPC library.

The local bitcoin.conf is parsed to get the node's credentials. This is done lazily to avoid unnecessary error conditions for the many commands that won't be needing it.

bitcoin_conf = None
def require_conf():
	global bitcoin_conf
	if bitcoin_conf is None:
		bitcoin_conf = {}
		with open(bitcoin_conf_path) as f:
			for line in f:
				line = line.split('#', 1)[0].rstrip()
				if not line:
					continue
				k, v = line.split('=', 1)
				bitcoin_conf[k.strip()] = v.lstrip()

Side note: I detest that "global" keyword hack. It's "necessary" only because variable definition is conflated with mutation in the single "=" operator, and completely misses the case of a nested function setting a variable in an outer but not global scope. ("So they added 'nonlocal' in Python 3, solves your problem!!")

def rpc(method, *args):
	require_conf()
	host = bitcoin_conf.get('rpcconnect', '127.0.0.1')
	port = int(bitcoin_conf.get('rpcport', 8332))
	auth = 'Basic ' + b64encode('%s:%s' % (
		bitcoin_conf.get('rpcuser', ''),
		bitcoin_conf.get('rpcpassword', '')))
	payload = json.dumps({'method': method, 'params': args})
	headers = [
		('Host', host),
		('Content-Type', 'application/json'),
		('Content-Length', len(payload)),
		('Connection', 'close'),
		('Authorization', auth),
	]
	msg = 'POST / HTTP/1.1\r\n%s\r\n\r\n%s' % ('\r\n'.join('%s: %s' % kv for kv in headers), payload)
	sock = socket()
	sock.connect((host, port))
	sock.sendall(msg)
	response = read_all(sock.fileno())
	sock.close()
	headers, payload = response.split('\r\n\r\n', 1)
	r = json.loads(payload, parse_float=Decimal)
	if r['error'] is not None:
		raise JSONRPCError(r['error'])
	return r['result']

I could see removing the "parse_float=Decimal", and thus the corresponding import, as we won't be calling here any of the problematic interfaces that report monetary values as JSON numbers. But then, I'd also see value in one RPC client implementation that can just be copied for whatever use without hidden hazards.

Bitcoin data parsing

Now things might get interesting. To parse the serialized data structures in a manner similar to the C++ reference implementation and hopefully efficient besides, I used memory views, basically bounds-checking pointers.(i)

# "load" functions take a memoryview and return the object and number of bytes consumed.

def load_compactsize(v):
	# serialize.h WriteCompactSize
	size = ord(v[0])
	if size < 253:
		return size, 1
	elif size == 253:
		return unpack_u16(v[1:3])[0], 3
	elif size == 254:
		return unpack_u32(v[1:5])[0], 5
	else:
		return unpack_u64(v[1:9])[0], 9

def load_string(v):
	# serialize.h Serialize, std::basic_string and CScript overloads
	n, i = load_compactsize(v)
	return v[i:i+n].tobytes(), i+n

def vector_loader(load_element):
	# serialize.h Serialize_impl
	def load_vector(v):
		n, i = load_compactsize(v)
		r = [None]*n
		for elem in xrange(n):
			r[elem], delta = load_element(v[i:])
			i += delta
		return r, i
	return load_vector

def load_txin(v):
	# main.h CTxIn
	i = 36
	txid, pos = unpack_outpoint(v[:i])
	scriptsig, delta = load_string(v[i:])
	i += delta
	i += 4 # skipping sequence
	return (txid, pos, scriptsig), i

load_txins = vector_loader(load_txin)

def load_txout(v):
	# main.h CTxOut
	i = 8
	value, = unpack_s64(v[:i])
	scriptpubkey, delta = load_string(v[i:])
	return (value, scriptpubkey), i+delta

load_txouts = vector_loader(load_txout)

def load_transaction(v):
	# main.h CTransaction
	i = 4 # skipping version
	txins, delta = load_txins(v[i:])
	i += delta
	txouts, delta = load_txouts(v[i:])
	i += delta
	i += 4 # skipping locktime
	hash = sha256d(v[:i])
	return (hash, i, txins, txouts), i

load_transactions = vector_loader(load_transaction)

def load_block(v):
	# main.h CBlock
	i = 80
	head = v[:i]
	version, prev, root, time, target, nonce = unpack_header(head)
	hash = sha256d(head)
	txs, delta = load_transactions(v[i:])
	return (hash, prev, time, target, txs), i+delta

The code dig to come up with this magic for identifying standard pay-to-pubkey-hash outputs and extracting the enclosed addresses was ugly.

def out_script_address(s):
	# Standard P2PKH script: OP_DUP OP_HASH160 20 ... OP_EQUALVERIFY OP_CHECKSIG
	if len(s) == 25 and s[:3] == '\x76\xA9\x14' and s[23:] == '\x88\xAC':
		return s[3:23]
	return None

To be continued.(ii)

Updated for errata.

  1. I'm just now noticing these were added in 2.7, ugh... sorry, 2.6 users. [^]
  2. My blog will be going on hiatus as far as new articles until early January. There's quite a ways to go on this file and I might not make it all the way through on this pass. If the suspense gnaws, you can always keep reading the source! [^]

2019-12-16

Draft gbw-node schema

Filed under: Bitcoin, Software — Jacob Welsh @ 22:09

This series will present my initial implementation of a Bitcoin wallet built up from loosely-coupled parts ("simple droplets"). This follows from necessity as much as design philosophy, in that: 1) the only practical way to secure keys against remote compromise is with dedicated offline hardware; 2) additional online code is needed to support this; and 3) the existing bitcoind is a costly and treacherous thing to hack on. Further background on this work is at:

Today we will look at the data model for the awk-like part, which has shaped up as a line-mode command suite run on demand to collect validated block data, index and make accessible the parts needed to track funds and issue transactions.

Source: schema-node.sql.

Compared to a key-value store like BerkeleyDB as used by bitcoind, using SQL allows formalizing the structure and relationships in the data, working at a high level and enforcing integrity along several dimensions. Besides simplifying implementation, this means the human operator can bypass any limitations of the frontend software and work directly with the database if need be.

Because SQL in practice isn't quite as standard as one might like, one needs to pick either a specific implementation or some abstraction layer to target. While I've worked with MySQL, PostgreSQL and SQLite, I'm most familiar with the last, and figure it has some nice properties for this application: simple to install, widely available, without daemon processes, network listeners, user accounts and passwords to configure. Don't be fooled by the "lite": it's still a large item with fairly comprehensive and optimized SQL engine, claiming robust concurrency and durability properties (ACID transactions) and large, well-defined and tested limits. It includes a command-line shell sqlite3 for manual operation, dump and restore. You are of course welcome to try other implementations and report findings; I haven't used inherently SQLite-specific features like dynamic typing.

A "watched address" model is used such that only transaction outputs and inputs affecting Bitcoin addresses of interest are recorded. The benefit is much lower storage requirement than indexing the full blockchain; the cost is having to rescan after importing addresses with existing history.(i) Addresses are grouped into sets (tags) such that one node can serve many wallets or include decoy sets.

Prologue

--- Gales Bitcoin Wallet: node (online component) schema
--- J. Welsh, December 2019
--- Dialect: SQLite (3.7.0 for WAL)

SQLite originally used a writeback rollback journaling approach, which is terribly slow for write-heavy uses, at least on mechanical HDDs lacking battery-backed cache, due to requiring at least two fsync operations per transaction. It provides no way to relax durability without sacrificing integrity guarantees, which would precisely suit the needs of the scan process. We'll instead use the newer Write-Ahead Logging mode; once activated, it's remembered in the database file.

PRAGMA journal_mode=WAL;
BEGIN;

Transactions

The tx table tracks confirmed Bitcoin transaction metadata. All "_id" fields are database-assigned integers for efficiency of storage and lookup; hashes are stored as raw bytes.

CREATE TABLE tx (
	tx_id        INTEGER PRIMARY KEY,
	hash         BLOB    NOT NULL,
	block_hash   BLOB    NOT NULL,
	block_height INTEGER NOT NULL,
	n            INTEGER NOT NULL, -- position in block
	comment      TEXT,
	size         INTEGER NOT NULL,
	fee          INTEGER
);
CREATE UNIQUE INDEX i_tx_hash ON tx(hash);
CREATE UNIQUE INDEX i_tx_height_n ON tx(block_height, n);

The purist might note that this table isn't entirely normalized since many-to-one block information is stored in two fields; but then, there can also be upsides to avoiding excessive joins. It might be nice to track more block metadata anyway since it's not that big and bitcoind does it quite opaquely.

Comment and fee fields are unused at present: computing fees, while desirable, would seem to require a full transaction index; comments don't seem to fit with the concept of minimizing private information stored by the node.

Transactions can be found either by their 256-bit hash or the more concise block coordinates.

Outputs and inputs

The transaction output is the main structure of interest, holding a fixed parcel of spendable or spent coin.

The REFERENCES clause, besides making relationships explicit, creates a foreign key constraint. Unfortunately SQLite doesn't enforce these by default for compatibility reasons; it must be enabled by the application on each connection. To ensure the sqlite3 shell gets it too, I include in my ~/.sqliterc a PRAGMA foreign_keys=ON;, as well as a .headers on and .mode quote (one per line) for better handling the blobs.

CREATE TABLE output (
	output_id  INTEGER PRIMARY KEY,
	tx_id      INTEGER NOT NULL REFERENCES tx,
	n          INTEGER NOT NULL, -- position in output vector
	address_id INTEGER NOT NULL REFERENCES address,
	value      INTEGER NOT NULL,

SQLite integers take up to 8 bytes as needed, thus can safely represent Satoshi values up to the maximum BTC that will ever exist (1e8 x 21e6 = 2.1e15; 263 > 1018).

	spent      INTEGER REFERENCES input(input_id),

An initial mistake was keeping the spent reference on the input side, following the raw transaction structure. The trouble with this is there's no efficient way to find unspent outputs: "the rows in this table that aren't referenced from another table" are necessarily not contained in that table's index! So instead we keep the link on this side, allowing NULL, and update the row when an input is seen that spends it.

	flags      TEXT
);
CREATE UNIQUE INDEX i_output_txid_n ON output(tx_id, n);
CREATE INDEX        i_output_addrid ON output(address_id);
CREATE UNIQUE INDEX i_output_spent  ON output(spent);

The flags field is presently unused: the idea was to track local state bits like seen or acknowledged.

CREATE TABLE input (
	input_id   INTEGER PRIMARY KEY,
	tx_id      INTEGER NOT NULL REFERENCES tx,
	n          INTEGER NOT NULL -- position in input vector
);
CREATE UNIQUE INDEX i_input_txid_n ON input(tx_id, n);

Having a separate table for inputs is looking pointless now that it's a mere one-to-one reference from output.spent... ah, the joys of schema refactoring.

Addresses and tags

CREATE TABLE address (
	address_id INTEGER PRIMARY KEY,
	address    BLOB NOT NULL
);
CREATE UNIQUE INDEX i_address_address ON address(address);

Like transaction and block hashes, addresses are represented in raw byte form as opposed to Base58Check-encoded. This makes scanning faster but importing and formatting for display slower. "Version" is assumed to be zero, meaning P2SH / "3-addresses" aren't a thing.

CREATE TABLE tag (
	tag_id INTEGER PRIMARY KEY,
	name   TEXT NOT NULL
);
CREATE UNIQUE INDEX i_tag_name ON tag(name);

Unlike the "accounts" seen in bitcoind, addresses can have multiple tags.

CREATE TABLE address_tag (
	address_id INTEGER NOT NULL REFERENCES address,
	tag_id     INTEGER NOT NULL REFERENCES tag,
	PRIMARY KEY (address_id, tag_id)
);
CREATE INDEX i_addrtag_tag ON address_tag(tag_id);

Like many databases, SQLite uses B-tree storage. One implication is that in the case of composite keys (those spanning multiple fields as seen here), a dedicated index is not necessary for the first field as B-trees provide efficient prefix search. Subsequent fields however may benefit from them.

State

Finally, a "singleton" table to track scan state. A more extensible approach to global variables would be textual name and value fields, but I dunno, I like the explicit-ness of mapping variables to column rather than row.

CREATE TABLE state (
	scan_height INTEGER NOT NULL DEFAULT(-1)
);
INSERT INTO state DEFAULT VALUES;

COMMIT;

Stay tuned for the code that puts it all to work. To run it you'll be needing Python 2 (2.7 tested) built with the sqlite3 module, the standalone sqlite3 shell (as loading the schema is manual for now), and of course, a Bitcoin node.

  1. A test scan of 607`849 blocks on 1`102 addresses finding 8`929`351 transactions took 13 hours producing a 1.8GB database file. [^]

2019-12-15

Review of polarbeard_add_sendrawtransaction_rpc.vpatch

Filed under: Bitcoin, Software — Jacob Welsh @ 20:14

This patch to The Real Bitcoin was published somewhere around 2016 by someone who as I gather showed up to dump some fairly verbose and undocumented patches then didn't stick around to advocate for them or keep them updated for the meanwhile changing tree.

One of them fills a still-unmet requirement of any split wallet so I'm reviewing it in hopes of saving some time on having to redo from scratch. Full patch is here.

+Value sendrawtransaction(const Array& params, bool fHelp)
+{
+    if (fHelp || params.size() < 1 || params.size() > 1)
+        throw runtime_error(
+            "sendrawtransaction <hex string>\n"
+            "Submits raw transaction (serialized, hex-encoded) to local node and network.");
+
+    vector<unsigned char> txData(ParseHex(params[0].get_str()));
+    CDataStream ssData(txData, SER_NETWORK, VERSION);

Above we see the three-argument form of the CDataStream constructor(i) with explicit arguments matching the defaults for the nTypeIn and nVersionIn parameters.

Some documentation of what if anything these parameters are for would be nice (though not a deficiency of this patch). For now it seems to me a reasonable demand that the defaults never change, or at any rate that the costs of such a change be born by he who makes it. Thus their inclusion here would be redundant, and not really the helpful kind of redundancy at that i.e. the kind that catches mistakes. I had omitted them in my getrawtransaction patch.

+    CTransaction tx;
+
+    try {
+        ssData >> tx;
+    }
+    catch (std::exception &e) {
+        throw JSONRPCError(-22, "tx decode failed");
+    }
+

The -22 isn't seen elsewhere in the tree, but matches the number in the PRB implementation and named RPC_DESERIALIZATION_ERROR.

The condition of excess data in the stream after the transaction does not appear to be detected, which I figure would be nice to have.

+    CTxDB txdb("r");
+    uint256 txHash = tx.GetHash();
+    uint256 hashBlock = 0;
+
+    if (TransactionSeen(txHash, hashBlock))
+    {
+        throw JSONRPCError(-5, string("tx already ") +
+            (hashBlock != 0 ?
+                string("included in block ") + hashBlock.ToString() :
+                "in memory pool"));
+    }
+

Here it gets iffy in my opinion. Just because my node sees the transaction, whether in block or mempool, it doesn't follow that other nodes do too. What if I want to re-broadcast?

Secondarily, the use of zero as a sentinel seems mildly unhygienic as it's a legal hash value, though I suppose if a preimage were found then the whole system would be screwed anyway. Along with the use of argument mutation, I see it more as an indication that too much implicit behavior is being packed into this new TransactionSeen interface, such that its (ONE!) caller needs a hack to do what it actually wants.

+    bool fMissingInputs = false;
+
+    if (!tx.AcceptToMemoryPool(txdb, true, &fMissingInputs) || fMissingInputs)
+    {
+        throw JSONRPCError(-22, string("tx rejected") +
+            (fMissingInputs ? ", missing inputs" : ""));
+    }
+

Perhaps a nitpick but why bother having error codes if you don't use them to distinguish things? Is the human text part of API now?

Looking into AcceptToMemoryPool, it doesn't actually do anything with that fMissingInputs besides initializing to false!

By the earlier argument it's also questionable whether that second fCheckInputs parameter should be true. It activates what looks like assorted anti-spam code of dubious value in this context.

+    SyncWithWallets(tx, NULL, true);
+    CInv inv(MSG_TX, txHash);
+    RelayMessage(inv, tx);
+
+    return txHash.GetHex();
+}

Oh yes, there's code to support handling wallets plural, not that anything is done with it.

The CInv boilerplate is demanded by the poor abstraction of RelayMessage. That function includes some 15-minute expiration mechanism whose purpose and workings aren't readily apparent to me.

Generally, it seems awkward that all these steps are needed here; shouldn't there be a function somewhere that handles transactions received from the network, that we'd just reuse? I'm almost afraid to look.

+// return wether a given transaction is found on mempool or already included in a block
+bool TransactionSeen(const uint256 &hash, uint256 &hashBlock)
+{
+    CRITICAL_BLOCK(cs_mapTransactions)
+        if (mapTransactions.count(hash))
+            return true;
+
+    CTransaction tx;
+    CTxDB txdb("r");
+    CTxIndex txindex;
+
+    if (tx.ReadFromDisk(txdb, COutPoint(hash, 0), txindex))
+    {
+        CBlock block;
+        if (block.ReadFromDisk(txindex.pos.nFile, txindex.pos.nBlockPos, false))
+            hashBlock = block.GetHash();
+        return true;
+    }
+
+    return false;
+}

All the objects getting passed around and mutated here are painful to follow. This appears to do what it says, though I'd much rather be rid of it.

Painfully apparent in the debts of the codebase is documentation of locking protocol. Getting this sort of thing wrong results in non-deterministic failures, either data corruption or deadlock depending on the direction of the mistake.

 // make sure all wallets know about the given transaction, in the given block
-void static SyncWithWallets(const CTransaction& tx, const CBlock* pblock = NULL, bool fUpdate = false)
+void SyncWithWallets(const CTransaction& tx, const CBlock* pblock, bool fUpdate)
 {
     BOOST_FOREACH(CWallet* pwallet, setpwalletRegistered)
         pwallet->AddToWalletIfInvolvingMe(tx, pblock, fUpdate);

Parameter defaults move to the declaration; I gather C++ puts the burden of setting these on the generated calling code, which seems sensible for a statically-compiled language.


 void RegisterWallet(CWallet* pwalletIn);
 void UnregisterWallet(CWallet* pwalletIn);
+void SyncWithWallets(const CTransaction& tx, const CBlock* pblock = NULL, bool fUpdate = false);
+bool TransactionSeen(const uint256 &hash, uint256 &hashBlock);
 bool ProcessBlock(CNode* pfrom, CBlock* pblock);
 bool CheckDiskSpace(uint64 nAdditionalBytes=0);
 FILE* OpenBlockFile(unsigned int nFile, unsigned int nBlockPos, const char* pszMode="rb");

Perhaps polarbeard had other patches that make use of this TransactionSeen. I'd rather this sort of patch not introduce globally-visible interfaces unnecessarily.

TransactionSeen is introduced as a globally-visible interface so that the underlying mapTransactions object stays static, in contrast with my more liberal approach.

In conclusion, the patch appears safe to apply and should do for now for the sake of testing my offline wallet, which might help inform any changes needed here. After some testing I might even end up signing it, though I'd much prefer something simpler and more clearly defined.

  1. And here I long for a self-hosted and accurately cross-referenced i.e. language- and context-aware source browser, but thanks to jurov for this approximate one. [^]
Older Posts »

Powered by MP-WP. Copyright Jacob Welsh.