Draft gbw-node frontend, part 6

Filed under: Bitcoin, Software — Jacob Welsh @ 21:32

Continued from:

The first of the input/output commands is to print a table of possibly-spendable outputs in the format required by the offline signer. While the Bitcoin protocol refers to transactions by 256-bit hash, the more compact confirmation coordinates (height, index) are included for convenience in the comment field. The queries are a bit lengthy, since we now join several tables to build the flat output file, but aren't doing anything too fancy once you break them down. The only difference when a tag is specified is the extra join to filter on its ID.

In some cases, BLOB fields need to be converted back to str.(i)

def cmd_unspent_outs(argv):
	unspent-outs [TAG]

	Display the unspent outputs table for addresses with the given TAG (or all watched addresses), as required by the offline wallet, ordered by age.
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=? \
				ORDER BY block_height DESC', (tag_id,))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				WHERE spent IS NULL \
				ORDER BY block_height DESC')
	for a, v, hash, n_out, height, n_tx in r:
		stdout.write('%s %s %s %s #blk %s tx %s\n' % (format_address(str(a)), format_coin(v), b2lx(hash), n_out, height, n_tx))

Idea: Add a command to print the outputs table for a given raw transaction. For example, this would enable spending unconfirmed or too recently confirmed outputs in a pinch, without requiring any further changes. Or more generally: all the data conversion code is already here so might as well make it accessible.

Next we proceed to the accounting commands, as they're really just another kind of output command. The balance of an address set is the total value of unspent outputs to addresses in the set.

def cmd_balance(argv):
	balance [TAG]

	Display confirmed balance of addresses with the given TAG (or all watched addresses).
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=?', (tag_id,))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output WHERE spent IS NULL')
	bal, = r.fetchone()
	stdout.write('%s\n' % format_coin(bal))

Things get tricker for the register report as it attempts to usefully summarize several things in a small space. In particular, summing the incoming and outgoing value per transaction seems to require separate queries since the join criteria differ.(ii)

def cmd_register(argv):
	register [TAG]

	Display a tab-delimited transaction register report for addresses with the given TAG (or all watched addresses). Columns are:

	- confirmation block height
	- number of transaction within block
	- total deposits (new outputs)
	- total withdrawals (spent outputs)
	- running balance
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
	bal = 0
	for height, n, o_val, i_val in merge_moves(outs.fetchall(), ins.fetchall()):
		bal = bal + o_val - i_val
		stdout.write('%s\t%s\t%s\t%s\t%s\n' % (height, n, format_coin(o_val), format_coin(-i_val), format_coin(bal)))

A helper is used to join the two possibly uneven lists by transaction, inserting zeros for transactions found on only one side. Perhaps it could all be done in SQL with subqueries and some type of outer joins, but I wasn't quite seeing it, so resorted to the low level with an algorithm reminiscent of the merging step of classical mergesort.

# Merge ordered lists of total input and output values per transaction into single table with columns for both.
def merge_moves(outs, ins):
	i = o = 0

	while True:
		if o == len(outs):
			for height, n, val in ins[i:]:
				yield (height, n, 0, val)
		o_height, o_n, o_val = outs[o]
		o_key = (o_height, o_n)

		if i == len(ins):
			for height, n, val in outs[o:]:
				yield (height, n, val, 0)
		i_height, i_n, i_val = ins[i]
		i_key = (i_height, i_n)

		if o_key < i_key:
			yield (o_height, o_n, o_val, 0)
			o += 1
		elif i_key < o_key:
			yield (i_height, i_n, 0, i_val)
			i += 1
			yield (o_height, o_n, o_val, i_val)
			i += 1
			o += 1

Next, the input commands. For sanity's sake, we exclude newlines in tag names as implicitly required by the tags listing format.

def cmd_watch(argv):
	watch [TAG]

	Import a set of addresses to watch linewise from stdin, optionally named by the given TAG. Addresses can be associated with multiple tags using multiple watch commands.
	tag_id = None
	if len(argv) > 0:
		name = argv.pop(0)
		if '\n' in name:
			die('newline not allowed in tag name')
		tag_id = insert_or_get_tag_id(name)
	while True:
		l = stdin.readline()
		if len(l) == 0:
		addr_id = insert_or_get_address_id(parse_address(l.rstrip('\n')))
		if tag_id is not None:
				db.execute('INSERT INTO address_tag (address_id, tag_id) VALUES (?,?)',
						(addr_id, tag_id))
			except IntegrityError:

def cmd_push(argv):

	Import raw hex transactions linewise from stdin and send to bitcoind.
	while True:
		line = stdin.readline()
		if len(line) == 0:
		tx_hex = line.rstrip('\n')
		stdout.write('txid %s\n' % rpc('sendrawtransaction', tx_hex))

General or command-specific help, and a command registry allowing abbreviation:

def cmd_help(argv):
	help [COMMAND]

	Display help for a given command or list all commands.
	if len(argv) > 0:
		name = argv.pop(0)
		name, func = get_command(name)
		doc = getdoc(func)
		if doc is None:
			stdout.write('No help for %r\n' % name)
			stdout.write('gbw-node %s\n' % doc)
		stdout.write('''Usage: gbw-node COMMAND [ARGS]

Available commands (can be abbreviated when unambiguous):

''' % '\n'.join([name for name, f in cmds]))

cmds = (
	('help', cmd_help),
	('scan', cmd_scan),
	('reset', cmd_reset),
	('tags', cmd_tags),
	('addresses', cmd_addresses),
	('unspent-outs', cmd_unspent_outs),
	('watch', cmd_watch),
	('push', cmd_push),
	('balance', cmd_balance),
	('register', cmd_register),

def get_command(name):
	rows = [r for r in cmds if r[0].startswith(name)]
	if len(rows) == 0:
		die('command not found: %s' % name)
	if len(rows) > 1:
		die('ambiguous command %s. Completions: %s' % (name, ' '.join([r[0] for r in rows])))
	return rows[0]

When invoked as a program (as opposed to imported elsewhere e.g. for testing), we connect to the database, enable foreign key constraints, and boost cache size and checkpoint interval from the meager defaults. These can be tuned if needed to optimize the scan process for your machine. Finally we dispatch to the given command.

Ideally, we'd create the database from schema here if not found.

def main():
	global db
	signal.signal(signal.SIGINT, signal.SIG_DFL)
	db = sqlite3.connect(gbw_home + '/db', timeout=600) # in seconds
	db.execute('PRAGMA foreign_keys=ON')
	db.execute('PRAGMA cache_size=-8000') # negative means in KiB
	db.execute('PRAGMA wal_autocheckpoint=10000') # in pages (4k)
	if len(argv) < 2:
		die('missing command', help=True)

if __name__ == '__main__':

This concludes the node frontend. Congratulations if you've followed thus far! There's no magic in programming, just a ruthless decomposition of bigger problems into smaller ones, a search for useful and robust abstractions -- and of course a whole lot of background reading and practice.

In the next month or two I will be completing the missing pieces of the signer; meanwhile, the code here is quite ready to play with. Import some addresses, run a scan, run the reports, and let me know how it goes in the comments below.

  1. Well, so far the only such case is format_address, so perhaps it should just be changed to allow passing a buffer. [^]
  2. It's looking like the COALESCE trick is pointless here, since rows are only generated by the join when matching outputs are present; that is, the SUM aggregation is always getting at least one row. Was I overzealous before? I don't recall if I observed an actual problem here rather than just in cmd_balance. It does no harm to leave in though, at least as far as correctness. [^]


Draft gbw-node frontend, part 5

Filed under: Bitcoin, Software — Jacob Welsh @ 19:02

Continued from:

Command implementations

The core scanning logic is in a helper function that takes a block's height and a memory view of its contents.

Referential integrity between blocks is ensured by scanning sequentially by height; that is, all relevant tx and output records from prior blocks will be known by the time we see the inputs that spend them. However, as far as I know this topological ordering is not guaranteed for the transaction sequence within a block (eg. tx 1 could spend outputs of tx 2, or vice versa) so we do separate passes over the transaction list for outputs and inputs.

def scan_block(height, v):
	stdout.write('block %s' % height)
	# [perf] computing every tx hash
	(blkhash, prev, time, target, txs), size = load_block(v)

The performance comment above was just to note some not-strictly-necessary work being done, in case the scan ended up horribly slow.(i)

An output is relevant if its script is standard and pays a known address. At least with foreign key constraints enabled, we can't insert an output until the tx record it references exists, but we don't know whether to insert the tx until we see if any of its outputs are relevant, so we again use a two-pass approach.

	count_out = 0
	n_tx = 0
	for (hash, size, txins, txouts) in txs:
		matched_outs = []
		for n, txout in enumerate(txouts):
			val, script = txout
			a = out_script_address(script)
			if a is not None:
				#print format_address(a)
				addr_id = get_address_id(a)
				if addr_id is not None:
					matched_outs.append((n, addr_id, val))
		if len(matched_outs) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, addr_id, val in matched_outs:
				insert_output(tx_id, n, addr_id, val)
			count_out += len(matched_outs)
		n_tx += 1
	stdout.write(' new-outs %s' % count_out)

An input is relevant if it spends a known output. Recall that insert_input updates the corresponding output to create the back-reference, indicating it has been spent.

	# Inputs scanned second in case an output from the same block is spent.
	# Coinbase (input of first tx in block) doesn't reference anything.
	count_in = 0
	n_tx = 1
	for (hash, size, txins, txouts) in txs[1:]:
		matched_ins = []
		for n, txin in enumerate(txins):
			prevout_hash, prevout_n, scriptsig = txin
			prevout_tx_id = get_tx_id(prevout_hash)
			if prevout_tx_id is not None:
				prevout_id = get_output_id(prevout_tx_id, prevout_n)
				if prevout_id is not None:
					matched_ins.append((n, prevout_id))
		if len(matched_ins) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, prevout_id in matched_ins:
				insert_input(tx_id, n, prevout_id)
			count_in += len(matched_ins)
		n_tx += 1
	stdout.write(' spent-outs %s\n' % count_in)

Assorted helpers: handling usage errors; looking up a tag ID that must exist.

def die(msg, help=False):
	stderr.write('gbw-node: %s\n' % msg)
	if help:

def require_tag(name):
	i = get_tag_id(name)
	if i is None:
		die('tag not found: %r' % name)
	return i

The entry point for any user command "X" is the function "cmd_X", having help text in its docstring and taking a list of any supplied CLI arguments past the command name.

First, the sync commands. The scan process commits one database transaction per block.

def cmd_scan(argv):

	Iterate blocks from bitcoind, indexing transaction inputs and outputs affecting watched addresses. May be safely interrupted and resumed.

	NOT PRESENTLY SAFE TO RUN CONCURRENT INSTANCES due to the dumpblock to named pipe kludge.
	db.execute('PRAGMA synchronous=NORMAL')
	height = db.execute('SELECT scan_height FROM state').fetchone()[0]
	blockcount = max(-1, rpc('getblockcount') - CONFIRMS)
	while height < blockcount:
		height += 1
		scan_block(height, memoryview(getblock(height)))
		db.execute('UPDATE state SET scan_height = ?', (height,))

def cmd_reset(argv):

	Reset the scan pointer so the next scan will proceed from the genesis block, to find transactions associated with newly watched addresses.
	db.execute('UPDATE state SET scan_height = -1')

Next, commands to query the watched address sets (not in the original spec but trivial and clearly useful).

def cmd_tags(argv):

	List all tag names.
	for name, in db.execute('SELECT name FROM tag'):
		stdout.write(name + '\n')

def cmd_addresses(argv):
	addresses [TAG]

	List addresses with the given TAG (or all watched addresses).
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address FROM address \
				JOIN address_tag ON address.address_id=address_tag.address_id \
				WHERE tag_id=?', (tag_id,))
		r = db.execute('SELECT address FROM address')
	for a, in r:
		stdout.write(format_address(str(a)) + '\n')

To be continued.

  1. I've found the Python profiler quite useful so far compared to such guesswork; still, optimization is something of a balance between experimentally-driven efforts and not doing obviously wasteful things from the start. [^]

Draft gbw-node frontend, part 4

Filed under: Bitcoin, Software — Jacob Welsh @ 04:36

Continued from:

Common database operations

As an internal convention, a "get_X_id" function will return the database ID for the row in table "X" named by its bulkier external reference, or None if not found. Similarly, "insert_or_get_X_id" will insert a row if needed and in either case return the ID. Some of these have only a single caller, but I find that collecting the various similar queries in one place and wrapping them into tidy functions helps readability.

The mapping of Python to SQLite types is fairly straightforward, except that buffer is needed to specify a BLOB.

The "parameter substitution" feature is used throughout, avoiding improper mixing of code and data that could manifest as SQL injection or thrashing the compiled statement cache.

def get_address_id(a):
	r = db.execute('SELECT address_id FROM address WHERE address=?', (buffer(a),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_address_id(a):
	i = get_address_id(a)
	if i is not None:
		return i
	return db.execute('INSERT INTO address (address) VALUES (?)', (buffer(a),)).lastrowid

def get_tx_id(hash):
	r = db.execute('SELECT tx_id FROM tx WHERE hash=?', (buffer(hash),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tx_id(hash, blkhash, height, n, size):
		return db.execute('INSERT INTO tx (hash, block_hash, block_height, n, size) VALUES (?,?,?,?,?)',
				(buffer(hash), buffer(blkhash), height, n, size)).lastrowid
	except IntegrityError:
		# XXX check equality?
		return get_tx_id(hash)

I now think we should indeed catch that condition (differing transactions with identical hash), especially given the possibility of TXID collisions. Perhaps I left it out from excessive worry about scan performance. Or just laziness.

The mixture of check-first and try-first styles seen above also doesn't sit well. The possibility of TOCTTOUs,(i) depending on the details of transaction isolation level, would seem to make a strong case for try-first. It's a minor point though; the worst case here would be an uncaught IntegrityError halting the program gracefully.

def insert_output(tx_id, n, addr_id, val):
		db.execute('INSERT INTO output (tx_id, n, address_id, value) VALUES (?,?,?,?)',
				(tx_id, n, addr_id, val))
	except IntegrityError:
		r = db.execute('SELECT address_id, value FROM output WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()
		if r != (addr_id, val):
			raise Conflict('output differs from previous content', tx_id, n, (addr_id, val), r)

def insert_input(tx_id, n, prevout_id):
		input_id = db.execute('INSERT INTO input (tx_id, n) VALUES (?,?)', (tx_id, n)).lastrowid
	except IntegrityError:
		input_id = db.execute('SELECT input_id FROM input WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()[0]
	db.execute('UPDATE output SET spent=? WHERE output_id=?', (input_id, prevout_id))

def get_output_id(tx_id, n):
	r = db.execute('SELECT output_id FROM output WHERE tx_id=? AND n=?', (tx_id, n)).fetchone()
	return None if r is None else r[0]

def get_tag_id(name):
	r = db.execute('SELECT tag_id FROM tag WHERE name=?', (name,)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tag_id(name):
	i = get_tag_id(name)
	if i is not None:
		return i
	return db.execute('INSERT INTO tag (name) VALUES (?)', (name,)).lastrowid

Next up, we'll finally get to implementing the commands themselves. To be continued.

  1. The "time of check to time of use" race condition. You know, like sitting down when some trickster's meanwhile moved the chair. [^]


Draft gbw-node frontend, part 3

Filed under: Bitcoin, Software — Jacob Welsh @ 18:02

Continued from:


Bitcoin addresses are conventionally written in a special-purpose encoding and include a hash truncated to 32 bits for error detection. As the reference implementation explains:

Why base-58 instead of standard base-64 encoding?
- Don't want 0OIl characters that look the same in some fonts and could be used to create visually identical looking account numbers.
- A string with non-alphanumeric characters is not as easily accepted as an account number.
- E-mail usually won't line-break if there's no punctuation to break at.
- Doubleclicking selects the whole number as one word if it's all alphanumeric.

Of course, all these points would have been answered just as well by hexadecimal, and without the various burdens: case-sensitivity for the user (the surest way I've found to read these out is the fully explicit: "one five big-A three little-X ..."); more code for the implementer; and more work for the machine (as the lack of bit alignment demands a general base conversion algorithm).

We start with lookup tables to convert the digits 0-57 to the specified alphabet and back. I was once surprised to learn the scope of iteration variables in a Python "for" loop is not restricted to the loop body: a potential source of referential confusion, reflecting the language's casual approach to mutation. Thus, when at the global scope I like to ensure throwaway names, like "index" and "character" here, are safely contained in a function.

base58_alphabet = (string.digits + string.uppercase + string.lowercase).translate(None, '0OIl')
base58_inverse = [None]*256
def init_base58_inverse():
	for index, character in enumerate(base58_alphabet):
		base58_inverse[ord(character)] = index

To do base conversion we'll need to treat byte sequences as integers with the same ordering conventions as the reference code. Otherwise put: to decode from base-256 to abstract integers. Python 2 doesn't have a builtin for this. The algorithm is not optimal, but the base-58 part will be worse anyway.

def bytes_to_int(b):
	"Convert big-endian byte sequence to unsigned integer"
	i = 0
	for byte in b:
		i = (i << 8) + ord(byte)
	return i

To complete the bytes-to-ASCII converter we extract digits from the integer, least significant first, by iterated division with remainder by 58. Since the conversion to integer loses track of field width, the convention is to pad with the same number of base-58 zeros as there were base-256 leading zeros in the input. In further fallout from using a non-bit-aligned encoding, these are not naturally constant time or constant control-flow operations.

For the same bit cost of the error detection code we could have had error correction. But that would have required, like, math, and stuff.

def b2a_base58check(data):
	data += sha256d(data)[:4]

	leading_zeros = 0
	for b in data:
		if b != '\x00':
		leading_zeros += 1

	data_num = bytes_to_int(data)

	digits = []
	while data_num:
		data_num, digit = divmod(data_num, 58)
	digits.extend([0] * leading_zeros)

	return ''.join(base58_alphabet[digit] for digit in reversed(digits))

Converting back to bytes uses the inverse operation at each step, but now there are cases of invalid input to reject: digits outside the specified alphabet and corruption detected by the checksum. (The precise function decomposition is a bit arbitrary and asymmetrical I'll admit.)

class Base58Error(ValueError):

class BadDigit(Base58Error):

class BadChecksum(Base58Error):

def a2b_base58(data):
	digits = [base58_inverse[ord(b)] for b in data]
	if None in digits:
		raise BadDigit

	leading_zeros = 0
	for digit in digits:
		if digit != 0:
		leading_zeros += 1

	data_num = 0
	for digit in digits:
		data_num = 58*data_num + digit

	data_bytes = []
	while data_num:
		data_bytes.append(data_num & 0xFF)
		data_num = data_num >> 8
	data_bytes.extend([0] * leading_zeros)

	return ''.join(chr(b) for b in reversed(data_bytes))

def a2b_base58check(data):
	data = a2b_base58(data)
	payload = data[:-4]
	check = data[-4:]
	if check != sha256d(payload)[:4]:
		raise BadChecksum
	return payload

Finally we apply this encoding to Bitcoin addresses, which have a fixed 160-bit width plus an extra "version" byte that becomes the familiar leading "1".

class BadAddressLength(ValueError):

class BadAddressVersion(ValueError):

def parse_address(a):
	b = a2b_base58check(a)
	if len(b) != 21:
		raise BadAddressLength
	if b[0] != '\x00':
		raise BadAddressVersion(ord(b[0]))
	return b[1:]

def format_address(b):
	return b2a_base58check('\x00' + b)

All this format conversion groundwork out of the way, we'll start talking to the database and putting it all together. To be continued!


Draft gbw-node frontend, part 2

Filed under: Bitcoin, Software — Jacob Welsh @ 18:52

Continued from schema and part 1. Source:

What follows is a kluge and probably the sorest point of the thing in my mind. I will attempt to retrace the decision tree that led to it.

Ultimately what we're after is a way to examine all confirmed transactions so as to filter for relevant ones and record these in the database ("scanning"). Since on disk they're already grouped into blocks of constrained size, it makes sense from an I/O standpoint to load them blockwise. But how? We could read directly from the blk####.dat files, but this would go against the "loose coupling" design. Concretely, this manifests as several messes waiting to happen: do we know for sure that only validated blocks get written to these files? How will we detect and handle blocks that were once valid but abandoned when a better chain was found? Might we read a corrupt block if the node is concurrently writing? Much more sensible would be to work at the RPC layer which abstracts over internal storage details and provides the necessary locking.

Unfortunately there is presently no getblock method in TRB, only dumpblock which strikes me as more of a debugging hack than anything else, returning its result by side effect through the filesystem. I worried this could impose a substantial cost in performance and SSD wear, once multiplied across the many blocks and possible repeated scans, especially if run on a traditional filesystem with more synchronous operations than the likes of ext4. It occurred to me to use a named pipe to allow the transfer to happen in memory without requiring changes to the TRB code. I took this route, after some discussion with management confirmed that minimizing TRB changes was preferable. I hadn't really spelled out the thinking on files versus pipes, or anticipated what now seems a fundamental problem with the pipe approach: if the reading process aborts for any reason, the write operation in bitcoind can't complete; in fact, due to coarse locking, the whole node ends up frozen. (This can be cleaned up manually e.g. by cat ~/.gbw/blockpipe >/dev/null).

Given this decision, the next problem was the ordering of reads and writes, as reading the block data through the pipe must complete before the RPC method will return. Thus some sort of threading is needed: either decomposing the RPC client into separate send and receive parts so as to read the pipe in between, or using a more general facility. Hoping to preserve the abstraction of the RPC code, I went with the latter in the form of Python's thread support. I tend to think this was a mistake; besides all the unseen complexity it invokes under the hood, it turned out the threading library provides no reliable way to cancel a thread, which I seemed to need for the case where the RPC call returns an error without sending data. I worked around by making the reader a long-lived "daemon" thread, which ensures it terminates with the program, and using an Event object to synchronize handoff of the result through a global variable.

getblock_thread = None
getblock_done = Event()
getblock_result = None
def getblock_reader(pipe):
	global getblock_result
	while True:
		fd = os_open(pipe, O_RDONLY)
		getblock_result = read_all(fd)

def getblock(height):
	global getblock_thread
	pipe = gbw_home + '/blockpipe'
	if getblock_thread is None:
		getblock_thread = Thread(target=getblock_reader, args=(pipe,))
		getblock_thread.daemon = True
	if not rpc('dumpblock', height, pipe):
		raise ValueError('dumpblock returned false')
	return getblock_result

The astute reader may notice a final and rather severe "rake" waiting to be stepped on: a fixed filesystem path is used for the named pipe, so concurrent scan processes will interfere with the result of corrupted blocks and all sorts of possible failures following. While I'm not sure there's any reason to run such concurrent scans, at least this possibility should be excluded. I'm thinking the process ID would make a sufficiently unique identifier to add to the filename, but then the pipe will need to be removed afterward, and what if the process dies first? Yet more code to clean up possibly-defunct pipes? Or one could keep the fixed name and use filesystem locking. At any rate I hope this all makes it clear why I'm not too enamored of dumpblock.

To be continued.


Errata for gbw-node drafts to date, and Bitcoin txid collisions

Filed under: Bitcoin, Software — Jacob Welsh @ 18:11

I've discovered a few mistakes in my wallet code published so far, one from pure carelessness and two from insufficient cognizance of environmental hazards, which together seem interesting enough for a dedicated article.

The first is in the HTTP Basic authentication header in the JSON-RPC client: I slipped in an erroneous space character after the colon in the base64 input, in the course of a hasty and perhaps excessive last-minute style cleanup. I discovered this through basic testing as it broke authentication altogether. I will update the article but preserve the original file for whoever cares to diff.

The second, found in code not yet written up, is the obnoxious but standardized behavior of the SQL SUM function, whereby the sum of the empty set is NULL rather than zero. In Python this becomes the None object then my code passes it to functions that expect integers. I first tried to work around at the Python level, but soon found this to be awkward especially for queries returning multiple columns where some were perfectly well-behaved integers. A fix closer to the source of the problem is found in the standard SQL coalesce function, though having to use this as boilerplate around every use of SUM is not exactly satisfying.

The above fixes are in: draft2/

The third and deepest might not even be a problem in practice, but seems to warrant further investigation. From the schema:

CREATE UNIQUE INDEX i_tx_hash ON tx(hash);

The problem is that Bitcoin doesn't guarantee uniqueness of transaction contents - miners can use identical coinbase inputs - despite the fact that the implementation assumes unique transaction hashes! The possibility of collision was realized in 2010,(i) condemning all future implementers to bugwise compatibility, whatever that means. The Power Rangers in their boundless wisdom addressed this in BIP 30 except without solving all that much. Further discussions I've been chewing on include Mirco… Mezzo… Macroflation—Overheated Economy (archived) and of course the forum log. Takeaways so far are that 1) the sky is probably (still) not falling regarding Bitcoin itself, but 2) this could possibly be used by malicious peers to hard-wedge syncing TRB nodes.

The conservative approach for my program would seem to be leaving the schema as is and letting SQL throw an integrity error if you try to monitor the addresses in question. Relaxing the unique constraint should be possible with some further changes to the code, but the question would arise of how exactly the quasi-duplicate outputs should be interpreted. Please speak up in the comments if you know of further references or conclusions on this topic!

  1. Block pair 91722/91880 paying address 1GktTvnY8KGfAS72DhzGYJRyaQNvYrK9Fg and 91812/91842 paying address 16va6NxJrMGe5d2LP6wUzuVnzBBoKQZKom. [^]


Draft gbw-node frontend, part 1

Filed under: Bitcoin, Software — Jacob Welsh @ 01:18

The database schema for the "node" part of Gales Bitcoin Wallet covered, we now proceed to the frontend program that puts it to work: collecting data from bitcoind, parsing its various binary encodings and extracting something useful.

Source: draft1/ draft2/ (and the previous schema-node.sql).

You'd be well advised to read the downloaded thing prior to executing, especially since it's in an unsigned draft state. As for what's necessary to graduate to a vpatch I'd be ready to sign, my thinking is that it's this review and annotation process itself, plus whatever important changes come out of it, and the previously suggested schema tweaks (since changing that is the most obnoxious part once deployed).

At present there's not much of an installation process and the database is initialized manually. I'd suggest creating some directory to hold the two sources. Then from that directory:

$ chmod +x
$ mkdir ~/.gbw
$ sqlite3 ~/.gbw/db < schema-node.sql
$ ./ help

In preparing this code for publication I observed that I had continued by force of habit (and editor settings) with the Python style guidelines of four-space indents and some fixed line width limit, in opposition to Republican doctrine. I've attempted to clean it up such that line breaks occur only for good reasons, though I can't say I'm happy with how my browser wraps the long lines. And it's not like I expect the poor thing to know good indentation rules for every possible programming language now... wut do?!


We start with the usual Pythonistic pile of imports. The ready libraries are a big reason the language is hard to beat for getting things working quickly, and at the same time a dangerous temptation toward thinking you don't need to care what's inside them.

# J. Welsh, December 2019

from os import getenv, open as os_open, O_RDONLY, O_WRONLY, mkdir, mkfifo, read, write, close, stat
from stat import S_ISDIR, S_ISFIFO
from sys import argv, stdin, stdout, stderr, exit
from socket import socket
from threading import Thread, Event
from binascii import a2b_hex, b2a_hex
from base64 import b64encode
from struct import Struct
from hashlib import sha256 as _sha256
from decimal import Decimal
from inspect import getdoc
import errno
import signal
import string
import json
import sqlite3
from sqlite3 import IntegrityError

The above are all in the standard library, assuming they're enabled on your system. The ones that stick out like sore thumbs to me are threading and decimal; more on these to come.

As the comments say:

# Safety level: scanning stops this many blocks behind tip

# There's no provision for handling forks/reorgs. In the event of one deeper than CONFIRMS, a heavy workaround would be:
#   $ sqlite3 ~/.gbw/db
#   sqlite> DELETE FROM output;
#   sqlite> DELETE FROM input;
#   sqlite> DELETE FROM tx;
#   sqlite> .exit
#   $ gbw-node reset
#   $ gbw-node scan

At least a semi-automated and lighter-touch recovery procedure would certainly be nice there.

gbw_home = getenv('HOME') + '/.gbw'
bitcoin_conf_path = getenv('HOME') + '/.bitcoin/bitcoin.conf'

# Further knobs in main() for database tuning.
db = None

This Is The Database; Use It.

For reasons I don't quite recall (probably interpreting hashes as integers, combined with pointer type punning - an unportable C programming practice common in Windows-land), bitcoind ended up reversing byte order compared to the internal representation for hex display of certain things including transaction and block hashes. Thus we have "bytes to little-endian hex" wrappers.

b2lx = lambda b: b2a_hex(b[::-1])
lx2b = lambda x: a2b_hex(x)[::-1]

Not taking any chances with display of monetary amounts, a function to convert integer Satoshi values to fixed-point decimal BTC notation. The remainder/modulus operators have varying definitions between programming languages (sometimes even between implementations of the same language!) when it comes to negative inputs, so we bypass the question.

def format_coin(v):
	neg = False
	if v < 0:
		v = -v
		neg = True
	s = '%d.%08d' % divmod(v, 100000000)
	if neg:
		return '-' + s
	return s

Preloading and giving more intelligible names to some "struct" based byte-packing routines.

u16 = Struct('<H')
u32 = Struct('<I')
u64 = Struct('<Q')
s64 = Struct('<q')
unpack_u16 = u16.unpack
unpack_u32 = u32.unpack
unpack_u64 = u64.unpack
unpack_s64 = s64.unpack
unpack_header = Struct('<i32s32sIII').unpack
unpack_outpoint = Struct('<32sI').unpack

Some shorthand for hash functions.

def sha256(v):
	return _sha256(v).digest()

def sha256d(v):
	return _sha256(_sha256(v).digest()).digest()

An exception type to indicate certain "should not happen" database inconsistencies.

class Conflict(ValueError):

For reading a complete stream from a low-level file descriptor; experience has led me to be suspicious of Python's file objects.

def read_all(fd):
	parts = []
	while True:
		part = read(fd, 65536)
		if len(part) == 0:
	return ''.join(parts)

Ensuring needed filesystem objects exist.

def require_dir(path):
	except OSError, e:
		if e.errno != errno.EEXIST:
		if not S_ISDIR(stat(path).st_mode):
			die('not a directory: %r' % path)

def require_fifo(path):
	except OSError, e:
		if e.errno != errno.EEXIST:
		if not S_ISFIFO(stat(path).st_mode):
			die('not a fifo: %r' % path)

RPC client

Bitcoind uses a password-authenticated JSON-RPC protocol. I expect this is one of the more concise client implementations around.

class JSONRPCError(Exception):
	"Error returned in JSON-RPC response"

	def __init__(self, error):
		super(JSONRPCError, self).__init__(error['code'], error['message'])

	def __str__(self):
		return 'code: {}, message: {}'.format(*self.args)

Some of this code was cribbed from earlier experiments on my shelf. The fancy exception class above doesn't really look like my style; it may have hitchhiked from an outside JSON-RPC library.

The local bitcoin.conf is parsed to get the node's credentials. This is done lazily to avoid unnecessary error conditions for the many commands that won't be needing it.

bitcoin_conf = None
def require_conf():
	global bitcoin_conf
	if bitcoin_conf is None:
		bitcoin_conf = {}
		with open(bitcoin_conf_path) as f:
			for line in f:
				line = line.split('#', 1)[0].rstrip()
				if not line:
				k, v = line.split('=', 1)
				bitcoin_conf[k.strip()] = v.lstrip()

Side note: I detest that "global" keyword hack. It's "necessary" only because variable definition is conflated with mutation in the single "=" operator, and completely misses the case of a nested function setting a variable in an outer but not global scope. ("So they added 'nonlocal' in Python 3, solves your problem!!")

def rpc(method, *args):
	host = bitcoin_conf.get('rpcconnect', '')
	port = int(bitcoin_conf.get('rpcport', 8332))
	auth = 'Basic ' + b64encode('%s:%s' % (
		bitcoin_conf.get('rpcuser', ''),
		bitcoin_conf.get('rpcpassword', '')))
	payload = json.dumps({'method': method, 'params': args})
	headers = [
		('Host', host),
		('Content-Type', 'application/json'),
		('Content-Length', len(payload)),
		('Connection', 'close'),
		('Authorization', auth),
	msg = 'POST / HTTP/1.1\r\n%s\r\n\r\n%s' % ('\r\n'.join('%s: %s' % kv for kv in headers), payload)
	sock = socket()
	sock.connect((host, port))
	response = read_all(sock.fileno())
	headers, payload = response.split('\r\n\r\n', 1)
	r = json.loads(payload, parse_float=Decimal)
	if r['error'] is not None:
		raise JSONRPCError(r['error'])
	return r['result']

I could see removing the "parse_float=Decimal", and thus the corresponding import, as we won't be calling here any of the problematic interfaces that report monetary values as JSON numbers. But then, I'd also see value in one RPC client implementation that can just be copied for whatever use without hidden hazards.

Bitcoin data parsing

Now things might get interesting. To parse the serialized data structures in a manner similar to the C++ reference implementation and hopefully efficient besides, I used memory views, basically bounds-checking pointers.(i)

# "load" functions take a memoryview and return the object and number of bytes consumed.

def load_compactsize(v):
	# serialize.h WriteCompactSize
	size = ord(v[0])
	if size < 253:
		return size, 1
	elif size == 253:
		return unpack_u16(v[1:3])[0], 3
	elif size == 254:
		return unpack_u32(v[1:5])[0], 5
		return unpack_u64(v[1:9])[0], 9

def load_string(v):
	# serialize.h Serialize, std::basic_string and CScript overloads
	n, i = load_compactsize(v)
	return v[i:i+n].tobytes(), i+n

def vector_loader(load_element):
	# serialize.h Serialize_impl
	def load_vector(v):
		n, i = load_compactsize(v)
		r = [None]*n
		for elem in xrange(n):
			r[elem], delta = load_element(v[i:])
			i += delta
		return r, i
	return load_vector

def load_txin(v):
	# main.h CTxIn
	i = 36
	txid, pos = unpack_outpoint(v[:i])
	scriptsig, delta = load_string(v[i:])
	i += delta
	i += 4 # skipping sequence
	return (txid, pos, scriptsig), i

load_txins = vector_loader(load_txin)

def load_txout(v):
	# main.h CTxOut
	i = 8
	value, = unpack_s64(v[:i])
	scriptpubkey, delta = load_string(v[i:])
	return (value, scriptpubkey), i+delta

load_txouts = vector_loader(load_txout)

def load_transaction(v):
	# main.h CTransaction
	i = 4 # skipping version
	txins, delta = load_txins(v[i:])
	i += delta
	txouts, delta = load_txouts(v[i:])
	i += delta
	i += 4 # skipping locktime
	hash = sha256d(v[:i])
	return (hash, i, txins, txouts), i

load_transactions = vector_loader(load_transaction)

def load_block(v):
	# main.h CBlock
	i = 80
	head = v[:i]
	version, prev, root, time, target, nonce = unpack_header(head)
	hash = sha256d(head)
	txs, delta = load_transactions(v[i:])
	return (hash, prev, time, target, txs), i+delta

The code dig to come up with this magic for identifying standard pay-to-pubkey-hash outputs and extracting the enclosed addresses was ugly.

def out_script_address(s):
	if len(s) == 25 and s[:3] == '\x76\xA9\x14' and s[23:] == '\x88\xAC':
		return s[3:23]
	return None

To be continued.(ii)

Updated for errata.

  1. I'm just now noticing these were added in 2.7, ugh... sorry, 2.6 users. [^]
  2. My blog will be going on hiatus as far as new articles until early January. There's quite a ways to go on this file and I might not make it all the way through on this pass. If the suspense gnaws, you can always keep reading the source! [^]


Draft gbw-node schema

Filed under: Bitcoin, Software — Jacob Welsh @ 22:09

This series will present my initial implementation of a Bitcoin wallet built up from loosely-coupled parts ("simple droplets"). This follows from necessity as much as design philosophy, in that: 1) the only practical way to secure keys against remote compromise is with dedicated offline hardware; 2) additional online code is needed to support this; and 3) the existing bitcoind is a costly and treacherous thing to hack on. Further background on this work is at:

Today we will look at the data model for the awk-like part, which has shaped up as a line-mode command suite run on demand to collect validated block data, index and make accessible the parts needed to track funds and issue transactions.

Source: schema-node.sql.

Compared to a key-value store like BerkeleyDB as used by bitcoind, using SQL allows formalizing the structure and relationships in the data, working at a high level and enforcing integrity along several dimensions. Besides simplifying implementation, this means the human operator can bypass any limitations of the frontend software and work directly with the database if need be.

Because SQL in practice isn't quite as standard as one might like, one needs to pick either a specific implementation or some abstraction layer to target. While I've worked with MySQL, PostgreSQL and SQLite, I'm most familiar with the last, and figure it has some nice properties for this application: simple to install, widely available, without daemon processes, network listeners, user accounts and passwords to configure. Don't be fooled by the "lite": it's still a large item with fairly comprehensive and optimized SQL engine, claiming robust concurrency and durability properties (ACID transactions) and large, well-defined and tested limits. It includes a command-line shell sqlite3 for manual operation, dump and restore. You are of course welcome to try other implementations and report findings; I haven't used inherently SQLite-specific features like dynamic typing.

A "watched address" model is used such that only transaction outputs and inputs affecting Bitcoin addresses of interest are recorded. The benefit is much lower storage requirement than indexing the full blockchain; the cost is having to rescan after importing addresses with existing history.(i) Addresses are grouped into sets (tags) such that one node can serve many wallets or include decoy sets.


--- Gales Bitcoin Wallet: node (online component) schema
--- J. Welsh, December 2019
--- Dialect: SQLite (3.7.0 for WAL)

SQLite originally used a writeback rollback journaling approach, which is terribly slow for write-heavy uses, at least on mechanical HDDs lacking battery-backed cache, due to requiring at least two fsync operations per transaction. It provides no way to relax durability without sacrificing integrity guarantees, which would precisely suit the needs of the scan process. We'll instead use the newer Write-Ahead Logging mode; once activated, it's remembered in the database file.

PRAGMA journal_mode=WAL;


The tx table tracks confirmed Bitcoin transaction metadata. All "_id" fields are database-assigned integers for efficiency of storage and lookup; hashes are stored as raw bytes.

	tx_id        INTEGER PRIMARY KEY,
	hash         BLOB    NOT NULL,
	block_hash   BLOB    NOT NULL,
	block_height INTEGER NOT NULL,
	n            INTEGER NOT NULL, -- position in block
	comment      TEXT,
	size         INTEGER NOT NULL,
	fee          INTEGER
CREATE UNIQUE INDEX i_tx_hash ON tx(hash);
CREATE UNIQUE INDEX i_tx_height_n ON tx(block_height, n);

The purist might note that this table isn't entirely normalized since many-to-one block information is stored in two fields; but then, there can also be upsides to avoiding excessive joins. It might be nice to track more block metadata anyway since it's not that big and bitcoind does it quite opaquely.

Comment and fee fields are unused at present: computing fees, while desirable, would seem to require a full transaction index; comments don't seem to fit with the concept of minimizing private information stored by the node.

Transactions can be found either by their 256-bit hash or the more concise block coordinates.

Outputs and inputs

The transaction output is the main structure of interest, holding a fixed parcel of spendable or spent coin.

The REFERENCES clause, besides making relationships explicit, creates a foreign key constraint. Unfortunately SQLite doesn't enforce these by default for compatibility reasons; it must be enabled by the application on each connection. To ensure the sqlite3 shell gets it too, I include in my ~/.sqliterc a PRAGMA foreign_keys=ON;, as well as a .headers on and .mode quote (one per line) for better handling the blobs.

	n          INTEGER NOT NULL, -- position in output vector
	address_id INTEGER NOT NULL REFERENCES address,
	value      INTEGER NOT NULL,

SQLite integers take up to 8 bytes as needed, thus can safely represent Satoshi values up to the maximum BTC that will ever exist (1e8 x 21e6 = 2.1e15; 263 > 1018).

	spent      INTEGER REFERENCES input(input_id),

An initial mistake was keeping the spent reference on the input side, following the raw transaction structure. The trouble with this is there's no efficient way to find unspent outputs: "the rows in this table that aren't referenced from another table" are necessarily not contained in that table's index! So instead we keep the link on this side, allowing NULL, and update the row when an input is seen that spends it.

	flags      TEXT
CREATE UNIQUE INDEX i_output_txid_n ON output(tx_id, n);
CREATE INDEX        i_output_addrid ON output(address_id);
CREATE UNIQUE INDEX i_output_spent  ON output(spent);

The flags field is presently unused: the idea was to track local state bits like seen or acknowledged.

	n          INTEGER NOT NULL -- position in input vector
CREATE UNIQUE INDEX i_input_txid_n ON input(tx_id, n);

Having a separate table for inputs is looking pointless now that it's a mere one-to-one reference from output.spent... ah, the joys of schema refactoring.

Addresses and tags

CREATE TABLE address (
	address    BLOB NOT NULL
CREATE UNIQUE INDEX i_address_address ON address(address);

Like transaction and block hashes, addresses are represented in raw byte form as opposed to Base58Check-encoded. This makes scanning faster but importing and formatting for display slower. "Version" is assumed to be zero, meaning P2SH / "3-addresses" aren't a thing.

CREATE UNIQUE INDEX i_tag_name ON tag(name);

Unlike the "accounts" seen in bitcoind, addresses can have multiple tags.

CREATE TABLE address_tag (
	address_id INTEGER NOT NULL REFERENCES address,
	PRIMARY KEY (address_id, tag_id)
CREATE INDEX i_addrtag_tag ON address_tag(tag_id);

Like many databases, SQLite uses B-tree storage. One implication is that in the case of composite keys (those spanning multiple fields as seen here), a dedicated index is not necessary for the first field as B-trees provide efficient prefix search. Subsequent fields however may benefit from them.


Finally, a "singleton" table to track scan state. A more extensible approach to global variables would be textual name and value fields, but I dunno, I like the explicit-ness of mapping variables to column rather than row.



Stay tuned for the code that puts it all to work. To run it you'll be needing Python 2 (2.7 tested) built with the sqlite3 module, the standalone sqlite3 shell (as loading the schema is manual for now), and of course, a Bitcoin node.

  1. A test scan of 607`849 blocks on 1`102 addresses finding 8`929`351 transactions took 13 hours producing a 1.8GB database file. [^]


Review of polarbeard_add_sendrawtransaction_rpc.vpatch

Filed under: Bitcoin, Software — Jacob Welsh @ 20:14

This patch to The Real Bitcoin was published somewhere around 2016 by someone who as I gather showed up to dump some fairly verbose and undocumented patches then didn't stick around to advocate for them or keep them updated for the meanwhile changing tree.

One of them fills a still-unmet requirement of any split wallet so I'm reviewing it in hopes of saving some time on having to redo from scratch. Full patch is here.

+Value sendrawtransaction(const Array& params, bool fHelp)
+    if (fHelp || params.size() < 1 || params.size() > 1)
+        throw runtime_error(
+            "sendrawtransaction <hex string>\n"
+            "Submits raw transaction (serialized, hex-encoded) to local node and network.");
+    vector<unsigned char> txData(ParseHex(params[0].get_str()));
+    CDataStream ssData(txData, SER_NETWORK, VERSION);

Above we see the three-argument form of the CDataStream constructor(i) with explicit arguments matching the defaults for the nTypeIn and nVersionIn parameters.

Some documentation of what if anything these parameters are for would be nice (though not a deficiency of this patch). For now it seems to me a reasonable demand that the defaults never change, or at any rate that the costs of such a change be born by he who makes it. Thus their inclusion here would be redundant, and not really the helpful kind of redundancy at that i.e. the kind that catches mistakes. I had omitted them in my getrawtransaction patch.

+    CTransaction tx;
+    try {
+        ssData >> tx;
+    }
+    catch (std::exception &e) {
+        throw JSONRPCError(-22, "tx decode failed");
+    }

The -22 isn't seen elsewhere in the tree, but matches the number in the PRB implementation and named RPC_DESERIALIZATION_ERROR.

The condition of excess data in the stream after the transaction does not appear to be detected, which I figure would be nice to have.

+    CTxDB txdb("r");
+    uint256 txHash = tx.GetHash();
+    uint256 hashBlock = 0;
+    if (TransactionSeen(txHash, hashBlock))
+    {
+        throw JSONRPCError(-5, string("tx already ") +
+            (hashBlock != 0 ?
+                string("included in block ") + hashBlock.ToString() :
+                "in memory pool"));
+    }

Here it gets iffy in my opinion. Just because my node sees the transaction, whether in block or mempool, it doesn't follow that other nodes do too. What if I want to re-broadcast?

Secondarily, the use of zero as a sentinel seems mildly unhygienic as it's a legal hash value, though I suppose if a preimage were found then the whole system would be screwed anyway. Along with the use of argument mutation, I see it more as an indication that too much implicit behavior is being packed into this new TransactionSeen interface, such that its (ONE!) caller needs a hack to do what it actually wants.

+    bool fMissingInputs = false;
+    if (!tx.AcceptToMemoryPool(txdb, true, &fMissingInputs) || fMissingInputs)
+    {
+        throw JSONRPCError(-22, string("tx rejected") +
+            (fMissingInputs ? ", missing inputs" : ""));
+    }

Perhaps a nitpick but why bother having error codes if you don't use them to distinguish things? Is the human text part of API now?

Looking into AcceptToMemoryPool, it doesn't actually do anything with that fMissingInputs besides initializing to false!

By the earlier argument it's also questionable whether that second fCheckInputs parameter should be true. It activates what looks like assorted anti-spam code of dubious value in this context.

+    SyncWithWallets(tx, NULL, true);
+    CInv inv(MSG_TX, txHash);
+    RelayMessage(inv, tx);
+    return txHash.GetHex();

Oh yes, there's code to support handling wallets plural, not that anything is done with it.

The CInv boilerplate is demanded by the poor abstraction of RelayMessage. That function includes some 15-minute expiration mechanism whose purpose and workings aren't readily apparent to me.

Generally, it seems awkward that all these steps are needed here; shouldn't there be a function somewhere that handles transactions received from the network, that we'd just reuse? I'm almost afraid to look.

+// return wether a given transaction is found on mempool or already included in a block
+bool TransactionSeen(const uint256 &hash, uint256 &hashBlock)
+    CRITICAL_BLOCK(cs_mapTransactions)
+        if (mapTransactions.count(hash))
+            return true;
+    CTransaction tx;
+    CTxDB txdb("r");
+    CTxIndex txindex;
+    if (tx.ReadFromDisk(txdb, COutPoint(hash, 0), txindex))
+    {
+        CBlock block;
+        if (block.ReadFromDisk(txindex.pos.nFile, txindex.pos.nBlockPos, false))
+            hashBlock = block.GetHash();
+        return true;
+    }
+    return false;

All the objects getting passed around and mutated here are painful to follow. This appears to do what it says, though I'd much rather be rid of it.

Painfully apparent in the debts of the codebase is documentation of locking protocol. Getting this sort of thing wrong results in non-deterministic failures, either data corruption or deadlock depending on the direction of the mistake.

 // make sure all wallets know about the given transaction, in the given block
-void static SyncWithWallets(const CTransaction& tx, const CBlock* pblock = NULL, bool fUpdate = false)
+void SyncWithWallets(const CTransaction& tx, const CBlock* pblock, bool fUpdate)
     BOOST_FOREACH(CWallet* pwallet, setpwalletRegistered)
         pwallet->AddToWalletIfInvolvingMe(tx, pblock, fUpdate);

Parameter defaults move to the declaration; I gather C++ puts the burden of setting these on the generated calling code, which seems sensible for a statically-compiled language.

 void RegisterWallet(CWallet* pwalletIn);
 void UnregisterWallet(CWallet* pwalletIn);
+void SyncWithWallets(const CTransaction& tx, const CBlock* pblock = NULL, bool fUpdate = false);
+bool TransactionSeen(const uint256 &hash, uint256 &hashBlock);
 bool ProcessBlock(CNode* pfrom, CBlock* pblock);
 bool CheckDiskSpace(uint64 nAdditionalBytes=0);
 FILE* OpenBlockFile(unsigned int nFile, unsigned int nBlockPos, const char* pszMode="rb");

Perhaps polarbeard had other patches that make use of this TransactionSeen. I'd rather this sort of patch not introduce globally-visible interfaces unnecessarily.

TransactionSeen is introduced as a globally-visible interface so that the underlying mapTransactions object stays static, in contrast with my more liberal approach.

In conclusion, the patch appears safe to apply and should do for now for the sake of testing my offline wallet, which might help inform any changes needed here. After some testing I might even end up signing it, though I'd much prefer something simpler and more clearly defined.

  1. And here I long for a self-hosted and accurately cross-referenced i.e. language- and context-aware source browser, but thanks to jurov for this approximate one. [^]


Basic getrawtransaction patch proposal for bitcoind

Filed under: Bitcoin, Software — Jacob Welsh @ 17:35

I present a bitcoind patch, of my own authorship, for consideration: a basic but functional getrawtransaction RPC command. I expect the need for it is clear: if it's your node then you shouldn't accept any sort of "I'm sorry Dave, I'm afraid I can't do that" especially regarding the data it already has handy.

To speak plainly about some deficits, this time not my own: the Real Bitcoin project presently rests on some shaky foundations. Despite apparently intricate efforts, a Keccak based vtree has seen little progress, in that the original patch signers besides mod6 have not signed on to the regrinds of their work, and some recommended patches have not been reground at all, updated for the very necessary manifest spec, or included in the mainline collection. Further, the sorry state of the inherited code such as magic numbers everywhere has seen little improvement. Perhaps I will have to take up some of these burdens in time; for now I'll leave it as an entreaty to the elder hands to please find a way to make it happen!

The patch

As in the original introduced somewhere around PRB 0.7.0, this command takes a transaction ID (256 bits of hex), searches first the node's own memory pool then the confirmed transaction database (blkindex.dat) and returns a hex dump of the encoded transaction. Unlike the original it does not support a "verbose" option to give a JSON representation of the data. This task seems to me better suited to an external tool, but I could see including it here if the implementation is concise and obviously correct.

Backporting the original was not possible due to the many intervening changes, though I did consult it to confirm I hadn't missed anything important and matched its numeric error code.

Based on the overall idea in the V version control system of building yggdrasil, I'm breaking from one of the project's prior naming conventions by including "bitcoin" but not the author in the patch name; the author is still recorded in the enclosed manifest file. Due to the problems noted earlier with the prior patch tree it's also not a proper vpatch yet.

Download: bitcoin_getrawtransaction.draft.patch.

To try this you will need to:

  1. Perform a press to asciilifeform_whogaveblox.vpatch;
  2. Manually apply mod6_phexdigit_fix.vpatch (which could be missed otherwise due to lacking a manifest entry);
  3. Manually apply the patch in question.

In detail

I added a manifest entry for the phexdigit fix, to make its inclusion explicit:

--- a/bitcoin/manifest.txt
+++ b/bitcoin/manifest.txt
@@ -28,3 +28,5 @@
 542413 asciilifeform_aggressive_pushgetblocks asciilifeform Issue PushGetBlocks command to any peer that issues 'version' command
 542413 mod6_excise_hash_truncation mod6 Regrind of ben_vulpes original; removes truncation of hashes printed to TRB log file
 543661 asciilifeform_whogaveblox asciilifeform Record the origin of every incoming candidate block (whether accepted or rejected)
+606789 mod6_phexdigit_fix mod6 Fix decoding LUT which wrongly accepted certain invalid characters as hex
+606789 bitcoin_getrawtransaction jfw Add RPC to get transactions from memory pool or database (hex only)

The RPC itself is about as simple as it gets in this codebase. First we try the mempool, as this should be fast and may contain unconfirmed transactions.(i)

--- a/bitcoin/src/bitcoinrpc.cpp
+++ b/bitcoin/src/bitcoinrpc.cpp
@@ -1351,7 +1351,31 @@
     return entry;

+Value getrawtransaction(const Array& params, bool fHelp)
+    if (fHelp || params.size() != 1)
+        throw runtime_error(
+            "getrawtransaction <txid>\n"
+            "Get hex serialization of <txid> from memory pool or database.");

+    uint256 hash;
+    map<uint256, CTransaction>::iterator it;
+    CTransaction tx;
+    CDataStream ssTx;
+    hash.SetHex(params[0].get_str());
+    it = mapTransactions.find(hash);
+    if (it != mapTransactions.end())
+        tx = it->second;

Functions everywhere have to open their own database connections - though some have the luxury of getting one passed in - which then implies a whole caching mechanism so as not to be horribly inefficient. Odin knows why there couldn't just be one global (or at least per-thread) "This Is The Database; Use It" object.

+    else {
+        CTxDB txdb("r");
+        if (!txdb.ReadDiskTx(hash, tx))
+            throw JSONRPCError(-5, "Transaction not found in memory pool or database.");
+    }
+    ssTx << tx;
+    return HexStr(ssTx.begin(), ssTx.end());
 Value backupwallet(const Array& params, bool fHelp)
     if (fHelp || params.size() != 1)

Wiring the function into the RPC dispatch table (I don't recall how I chose where to insert it, as the list was already non-alphabetical; probably based on where it seemed sensible in the help listing):

@@ -1865,6 +1889,7 @@
     make_pair("getreceivedbyaccount",   &getreceivedbyaccount),
     make_pair("listreceivedbyaddress",  &listreceivedbyaddress),
     make_pair("listreceivedbyaccount",  &listreceivedbyaccount),
+    make_pair("getrawtransaction",      &getrawtransaction),
     make_pair("backupwallet",           &backupwallet),
     make_pair("keypoolrefill",          &keypoolrefill),
     make_pair("walletpassphrase",       &walletpassphrase),

The mempool object now needs to be visible between compilation units. I suggest doing a grep to verify this introduces no name conflicts.

--- a/bitcoin/src/main.cpp
+++ b/bitcoin/src/main.cpp
@@ -26,7 +26,7 @@

 CCriticalSection cs_main;

-static map<uint256, CTransaction> mapTransactions;
+map<uint256, CTransaction> mapTransactions;
 CCriticalSection cs_mapTransactions;
 unsigned int nTransactionsUpdated = 0;
 map<COutPoint, CInPoint> mapNextTx;
--- a/bitcoin/src/main.h
+++ b/bitcoin/src/main.h
@@ -46,6 +46,7 @@

 extern CCriticalSection cs_main;
+extern std::map<uint256, CTransaction> mapTransactions;
 extern std::map<uint256, CBlockIndex*> mapBlockIndex;
 extern uint256 hashGenesisBlock;
 extern CBlockIndex* pindexGenesisBlock;

I tested that it builds, successfully fetches transactions from both mempool and database, and returns the expected errors for missing argument or transaction not found. It does accept invalid hex strings, perhaps a flaw in that SetHex method. I've been running the patch in production since around August 10th of this year.

  1. The original cause for my writing the patch was a stuck transaction that wasn't getting relayed to miners or block explorers for unknown reasons. Upon fishing the raw tx from the mempool and submitting it to one such site, a useful error was finally obtained identifying the problem as the S-value normalization mess; the -lows option provided a workaround, after double-spending to self for safety, which was a whole other pain. [^]
Older Posts »

Powered by MP-WP. Copyright Jacob Welsh.