Fixpoint

2020-04-07

Selection and other sundries for MP-WP

Filed under: MP-WP — Jacob Welsh @ 04:49

Here are four patches to the V-tree for Mircea Popescu's Wordpress, which amount to an approximation of the changes I've been running since initial setup of Fixpoint, reground to build upon subsequent improvements by billymg and Diana Coman.

I plan to sign them shortly unless cause for revision comes to light (such as for example server-side selection being implemented outside the theme-specific code).

Quoth the manifest:

624752 mp-wp_remove-textselectionjs-pop3-etc jfw Remove the unreliable JS-based selection, posting by POP3 login, and a stray .php.orig file. Neutralize and comment the example pingback updater.
624752 mp-wp_svg-screenshots-and-errorreporting jfw Allow .svg extensions in theme screenshot search. Don't clobber the user's errorreporting level without WP_DEBUG.
624752 mp-wp_serverside-selection jfw Add server-side text selection to example htaccess and themes, roughly as seen in http://trilema.com/2019/proper-html-linking-the-crisis-the-solution-the-resolution-conclusion/ (xmlrpc.php changes not included)
624752 mp-wp_footnote-link-tweaks jfw Avoid turning double-quotes into backquotes in footnote tooltips. Expand the link part of footnote identifiers to cover the pre_identifier and post_identifier strings, for a larger clickable area.

2020-04-02

V in Perl with parsing fix, keksum, and starter, plus the ill-fated vdiff

Filed under: Software, V — Jacob Welsh @ 17:50

Following my prior adventures, I reoriented my efforts toward some simpler changes to the v.pl tree, abandoning hopes of a robust patch creation tool built on Busybox diff.

I've split the changes into two patches. The first is "v_strict_headers", which I think would be of interest to any v.pl user. It tightens vpatch parsing to prevent false-positive header matches that could cause incorrect or nonsensical antecedent information to be extracted from valid vpatches. Following the precedent of the vtools vpatch program, this is done by requiring the string "diff " at the start of a line preceding the header, which works because all other lines of a diff "packet"(i) start with either @, +, -, or space characters. This patch also backfills the manifest file and brings it fully in line with the spec.

The second patch, "v_keksum_busybox", swaps keksum and patch in for ksum and vpatch, making V presses possible again on systems with little more than a C toolchain, Busybox utilities and Perl.

I have also mirrored the rest of the VTree and contributed my own seals, which can be found in the same directory.

For deployment on systems with no previous V, there's a starter tarball which includes the tree pressed to v_keksum_busybox, the keksum code, and an install script. Take a look at what it does, then run as root, from the extracted directory:

# sh install.sh

Download

The ill-fated vdiff

What follows is my abandoned attempt at vdiff in awk, supporting any conforming diff program. It identifies headers using a three-state machine to recognize the ---, +++, @@ sequence. This would still be fooled by a ---, +++ sequence followed immediately by another hunk, except that the lines of context prevent this, unless the change comes at the end of the file in which case there can't be another hunk prior to the next file header.

It works as far as parsing both GNU and Busybox diff output, produces working vpatches in the GNU case, and could even be expanded to do the same for Busybox. But since fully-reproducible output seems to be desirable, I can't presently justify further work in this direction or recommend it over the vtools vdiff.

#!/bin/sh
export LC_COLLATE=C
diff -uNr $1 $2 | awk -v sq=\' '
function shell_quote(s) {
	sub(sq, sq "\\" sq sq, s);
	return sq s sq;
}

function vhash(path) {
	if (path == "/dev/null") return "false";
	qpath = shell_quote(path);
	cmd = "test -e " qpath " && keksum -s256 -l512 -- " qpath;
	gotline = cmd | getline rec;
	close(cmd);
	if (!gotline) return "false";
	split(rec, parts);
	return parts[1];
}

function print_header(line) {
	split(line, parts);
	print parts[1], parts[2], vhash(parts[2]);
}

{
	if (state == 0) {
		if ($0 ~ /^---/) {
			from = $0;
			state = 1;
		}
		else {
			print;
		}
	}
	else if (state == 1) {
		if ($0 ~ /^\+\+\+/) {
			to = $0;
			state = 2;
		}
		else if ($0 ~ /^---/) {
			print from;
			from = $0;
		}
		else {
			print from;
			print;
			state = 0;
		}
	}
	else if (state == 2) {
		if ($0 ~ /^@@/) {
			print_header(from);
			print_header(to);
			print;
			state = 0;
		}
		else if ($0 ~ /^---/) {
			print from;
			print to;
			from = $0;
			state = 1;
		}
		else {
			print from;
			print to;
			print;
			state = 0;
		}
	}
}

END {
	if (state == 1) {
		print from;
	}
	else if (state == 2) {
		print from;
		print to;
	}
}'
  1. Or what else do you call the header and sequence of hunks associated with a single file? [^]

2020-03-31

Adventures in the forest of V

Filed under: Historia, Software, V — Jacob Welsh @ 19:11

It started as what I thought a simple enough job: take the existing SHA512 v.pl I'd been using to press the Bitcoin code, or rather the VTree that grew from it, swap out the hash with my own keksum so as to avoid a hefty and otherwise unnecessary GNAT requirement, add my version of the classic vdiff modified likewise, bundle up a "starter" to cut the bootstrapping knot, and publish the bunch as my own tested and supported offering for wherever a V may be needed.

Such a thing would still require Perl, itself not an insignificant liability. While work had been underway to replace that, the results fell short of completeness, and from the ensuing discussion I decided it would be best to shore up my own grounding in the historical tools before venturing deeper into the frontier. I suppose I should be glad, because I got even more of that grounding - or swamping, more like - than I had asked for.

I.

One pitfall I already knew was that file header lines in the "unified diff" format used by V, which begin with "---" and "+++", cannot be accurately distinguished from deleted lines beginning "--" and inserted lines beginning "++", if parsing linewise and statelessly as done by the original "one-liner" vdiff. This was discovered in practice through an MP-WP patch containing base64-encoded images, and the potential damage is hardly restricted to that; for instance both SQL and Ada programming languages use "--" as comment marker. This was part of the motivation behind vtools, which took the approach of avoiding the system's existing "diff" program in favor of a stripped-down version of the GNU codebase with integrated hashing. My own approach had been more lightweight: tightening up the awk regex to at least reduce the false positive cases. It wasn't too satisfying, but had worked well enough so far.

II.

The first surprise I hit (stupidly late in the process, after I'd already signed my patch and starter) was that the Busybox version of "diff -N" replaces the input or output file path with "/dev/null" for the cases of creation and deletion respectively.

This reflects a larger reservation I have about Busybox code: it's been hacked extensively toward the goal of minimizing executable and memory footprint, which sometimes but only sometimes coincides with clear code and sensible interfaces. In this case, from brief inspection I surmise that it literally uses /dev/null so as to avoid some kind of null check in the downstream code that compares and emits the header. It's clever, but breaks compatibility with the GNU format in unforeseen ways.(i) In fairness to Busybox, the format was poorly specified in the first place - and nobody involved with V apparently found this important enough to remedy either.

III.

Another surprise for me was that the sloppy parsing affects not just diffing but pressing too. At least v.py and v.pl exhibit the same sort of blind regexing in extracting antecedent information from vpatches. (I'd guess that use of somewhat tighter regexes has prevented this from causing trouble in practice yet.) Further, v.pl extracts file paths only from the "---" part of the header which suggests it would indeed be broken by "/dev/null" style patches. On the extended vtools side, vfilter makes yet another assumption not backed by either such documentation as exists for the format or the Busybox version: a line showing a diff pseudo-command at the start of the header.

IV.

Finally, I've noticed what strikes me as a design problem affecting all V implementations, which I haven't seen mentioned before: it's not possible to have the same (path, hash) pair as an output of two different patches in the same VTree. More simply put, you can't have a patch that changes a file back to a previous state, contrary to the suggestion that "adding and removing the null character from the manifest file in every other patch would work" seen in the manifest spec. The reason is that both patches would end up in the antecedent set of a patch referencing either version of the file, in some cases producing a cyclic graph.(ii)

Stay tuned for the aforementioned patch and starter that make progress on a few of these fronts.

  1. A related annoyance I've had is Busybox "diff -qr" doesn't report added or removed directories, while adding -N replaces "Only in ..." messages with the less helpful "Files ... differ". [^]
  2. At this point I must say I wonder why V wasn't made to simply include in the header of each patch the hash of its antecedent patch as a whole. It would have neatly bypassed all these problems, forcing a tree topology and simplifying implementation. Would it have smelled too much like Git, or what? [^]

2020-03-07

JFW's 130 top Trilema picks to date

Filed under: Bitcoin, Hardware, Historia, Lex, Paidagogia, Philosophia, Politikos, Software, Vita — Jacob Welsh @ 16:25

Inquiring minds have asked of me to please shed a bit more light on what this Republic thing and that Popescu fellow in particular are all about. Is there more to it than the ravings that first meet the eye, of sluts and slaves and scandalous sexual predations and every "ism" and trigger word known to man or woman? What's the value I see in it that keeps me coming back? And what's the plan for this world domination thing anyway?

I gave the most accurate response I could, if not the most helpful: see, all you gotta do is read a couple thousand articles in multiple languages averaging maybe a thousand words each, a couple times over, and likely a bunch of the imported cultural environment and extensive chat logs besides, and then all will become clear! At least as clear as it can be so far. At least I think it will. But what would I know, I'm a long ways from being there.

Well great, so couldn't I at least give an executive summary? Not exactly an easy task either. Short of that, here's an attempt at picking some of the especially interesting, informative or significant articles on Trilema from my reading so far, a map of sorts of enticing entries to the rabbit hole.

The very unfair process that articles went through to make this list was as follows:

  1. I extracted an initial set of 957 items from my presently accessible browsing history, using some CLI magic.(i)
  2. I narrowed the list to those where I believed I recalled something of the article, going off the title alone. This brought it down to 424.
  3. I further selected based on roughly the above "interesting, informative or significant" standard in my subjective perception, again by memory from title alone.(ii) I also ended up skipping some that would have met this by way of having especially horrified me; not sure if I've done anyone any favors thus, but there it is.

The ordering within each publication year is merely alphabetical (because I can't quite see a pressing need to do it better in this context).

Enjoy... if you dare. What can I say, it's not for everyone.

2012

2013

2014

2015

2016

2017

2018

2019

2020

  • The slap and human dignity
  • Fin.

    1. You know Firefox keeps this in a SQL database, yes? Because they told you about it in the manual, and documented the schema and all? [^]
    2. At times I was overpowered by the temptation to go check, with the inevitable expenditure of time on re-reading which, useful as it can be, I hadn't planned on getting drawn into just now. And while my shiny tools got this down to a minimal "this button to keep, that button to skip" flow, they were entirely powerless to speed up the thinking. [^]

2020-03-04

Bitcoin transactions and their signing, 2: attachment

Filed under: Bitcoin, Software — Jacob Welsh @ 20:10

Having outlined the shape of the building block provided by digital signatures, we now face the potential problem of how to attach signatures to the messages they sign. The one hard requirement for any attachment scheme is that the verification function can work, that is, can answer unambiguously whether a signature is valid for a specific message and key. I will explore the space of possible approaches here,(i) then describe the one used in Bitcoin.

The simplest approach is to say: "what problem?" That is, treat the message and signature as separate objects (bitstrings, numbers, files or however you like to think of them) and use some external system to organize them. This is known in the traditional GPG toolset as detached signing. It has its advantages, besides the obvious "less work to implement". The original, unmodified message is directly available to the reader and his tools. New signatures can be added to a collection without duplicating or modifying the message object, and thus without needing further verification that they in fact refer to the same message. These properties are exploited in present manifestations of the V version control concept.

Assuming one does indeed want attached signatures, then, the first option is to package the message and signature together in some container format. Depending on how it's done, this can preserve the advantage that at least a semblance of the original message is readily visible in plain text, as with GPG clearsigning.(ii) New signatures can be added either with support from the container format, producing a single multiply-signed document, or without such support, either by nesting (such that each new signature references the previous stack) or duplication.

A second option, when the message represents a formal data structure, is to embed signatures in that structure itself in an application-specific way. At first sight this appears to be a circular data dependency: how can a signature be computed for a message that includes a representation of that signature?(iii) However, this can be worked around by applying a transformation to clip or whiteout the signature field at both signing and verification time.

The third and final option is to generalize the previous into a flexible or perhaps even universal embedding scheme. For example, signatures can be wrapped in whatever comment delimiters are available in a programming language, as seen in Mircea Popescu's recent proposal.(iv)

Bitcoin transactions, we can now say, use option #2: format-specific embedding, though with some added complications as follows.

The signature on each input is wrapped using the "script" encoding, in a field originally named "scriptSig", and its interpretation is determined by a corresponding script in the linked output being spent, originally "scriptPubKey". If we constrain our interest to transactions in the standard pay-to-pubkey-hash form, these considerations reduce to a formality.

The whiteout procedure is basically to replace the scriptSig on each input with an empty script. This implies the signatures are independent of each other. The twist, though, is that for the input for which a signature is being computed, the scriptSig is replaced instead by the corresponding scriptPubKey. I can't see any security advantage in doing this, since the previous output is already referenced by a unique identifier(v) covered by the signature. The result is that a different message must be signed for each input, and transaction verification takes quadratic time with respect to the number of inputs. This makes for a good reminder that the Bitcoin protocol externalizes much of the cost of transacting onto all node operators, and unless a satisfactory solution to that tough problem is deployed, transaction throughput must be kept a scarce resource.

To be continued.

  1. I struggled more than usual in writing about these, perhaps indicating I didn't grasp them as well as I'd thought. I don't claim to be equipped to discuss why one choice might be philosophically preferable to others; yet neither can I take a "purely technical" approach since cryptography is necessarily shaped as much from above by its utility to human society as from below by mathematical possibility. Maybe search the logs? [^]
  2. That format however incurs further complexity from tackling the additional perceived problems of linefeed normalization and in-band bracketing for inclusion in a larger text, with the drawback of having to quote instances of the magical bracket sequence in the signed message. [^]
  3. Such a message can be conceived as a fixpoint of the hash-sign-attach pipeline, but finding one in practice would seem to constitute a severe break in the cryptographic primitives. [^]
  4. It's not yet clear to me if or how this can be implemented reliably. For starters, how would you distinguish actual signatures from, say, quoted signatures, without knowing the lexical rules of the target language? How would the "whiteout" work to produce the same hash after addition of new signatures, without knowing same? [^]
  5. Well, not quite unique but at least identifying its contents including the scriptPubKey in question, to the extent you trust SHA256. And if you don't trust that, the signature hash would seem to be the bigger problem. [^]

2020-02-25

Bitcoin transactions and their signing, 1

Filed under: Bitcoin, Software — Jacob Welsh @ 23:42

As my offline Bitcoin signer nears completion, it's a good time to introduce just what Bitcoin transactions are anyway, how they are signed, and not least of all how it could go horribly wrong if we're not careful. This first part will cover the basics that I consider required knowledge for anyone who handles the currency.

A Bitcoin transaction is a message with particular structure and binary encoding rules,(i) specifying the transfer of given quantities from one set of accounts to another.

Transactions are composed of inputs and outputs. Each output specifies a monetary value and a destination address.(ii) Each input contains a reference to a previous transaction output(iii) and a signature authorizing its spending. In a quirk of implementation, the "accounts" mentioned above don't explicitly exist in the system; outputs are considered either unspent or spent in full by inclusion in a subsequent transaction. Your available balance, then, is the total value of unspent outputs for which you are able to issue valid signatures. Since the amount to be sent isn't usually an exact sum of previous outputs, a "change" output is added so as to overshoot and send the excess back to the original owner.

Observing that the scheme as presented so far rests on the strength of the signature, let's briefly expand on that concept, leaving the mathematical details as a black box for present purposes. A digital signature scheme provides three high-level operations: key generation, signing, and verification. Key generation takes some cryptographic entropy as input and produces a public/private key pair. Signing takes a fixed-length message hash, a private key, and possibly some further entropy and produces a signature. Verification answers whether a purported signature is valid for a given hash and public key. This gives a high degree of confidence that the signature could only have been issued by someone with knowledge of the private key (as long as some underlying unproven mathematical assumptions hold, which they appear to have so far despite ample incentive to break them). Note the distinct advantage over traditional pen-and-paper signatures: simply seeing one does not grant an ability to forge it or pass it off as covering some other message, despite the susceptibility of digital information to perfect copying and easy modification.

To be continued.

  1. Due to an unfortunate misallocation of brain cycles by Satoshi and the others who imagined themselves Bitcoin developers in the early days, there's a whole cocktail of encodings with, for example, at least four different ways to represent integers. While this makes for some added implementation complexity, the details aren't especially important for normal usage. [^]
  2. Technically a "script", but for simplicity we'll consider only the standard "pay-to-pubkey-hash" form. [^]
  3. Except in the case of "coinbase" transactions which issue mining rewards. [^]

2020-01-30

From the forum log, 30 December 2019

Filed under: Paidagogia, Software, Summaries — Jacob Welsh @ 18:48

From #trilema: 2019-12-30.

MP clarified for trinque that nobody's proposing all software be installed everywhere, and his own point was more about making gradual code and design cleanups so the mess becomes much smaller. A V tree could grow in different directions for different purposes, with a GUI environment likely being an early such split.

He weighed in on the Lisp critique thread. The latest ASDF was wrecked by Francois-Rene Rideau aka fare when the elder Lispers were no longer around to stop it. MP's experience with Lisp code had been that most ended up being Python anyway. He cited the late Erik Naggum on the trouble with free Lisp implementations lacking the more powerful features, but lambasted him for checking out of life before he could do anything useful, along with recent TMSR dropouts. He formulated the purpose of Lisp as for solving closed problems in the most efficient way, once problems are adequately understood.

In computer graphics, the nonsense of "XML shaders" was more popular than diana_coman knew.

Hanbot noted the use of Scheme scripting in Gimp, apropos of both practical Lisp usage and bypassing large dependency trees as seen in ImageMagick.

Mocky said he'll catch up.

MP and diana_coman picked up a discussion from blog comments about Eulora work, wherein various parts are stalled; the project outgrew the world it came from, in terms of both tools and artists. MP explored the art question by comparing valuation of a lone fighter aircraft, to a black-box Go-playing AI, to a cube of colored pixels: all being piles of accumulated tinkering, results of another's process, useless on their own without access to the process. Maintaining infrastructure to cater to the current crop of artists would be costly; growing new talent likewise. Hanbot had once done some Blender character modeling work but got stuck on the last step of exporting to a usable format; unfortunately she didn't recall documenting the struggle. MP posed the practical options as either trawling the entire Internet for artwork and sorting out what's usable, or machine-generating it and doing similar, at least knowing it's in a usable format. Diana_coman wasn't enthused by either. MP was interested in doing the first anyway, inviting unemployed readership to talk.

For the first time ever, MP found himself struggling with recent publication volume. He and diana_coman pointed out some kinks in trinque's blog setup; trinque started looking into it.

2020-01-20

Draft gbw-node frontend, part 6

Filed under: Bitcoin, Software — Jacob Welsh @ 21:32

Continued from:

The first of the input/output commands is to print a table of possibly-spendable outputs in the format required by the offline signer. While the Bitcoin protocol refers to transactions by 256-bit hash, the more compact confirmation coordinates (height, index) are included for convenience in the comment field. The queries are a bit lengthy, since we now join several tables to build the flat output file, but aren't doing anything too fancy once you break them down. The only difference when a tag is specified is the extra join to filter on its ID.

In some cases, BLOB fields need to be converted back to str.(i)

def cmd_unspent_outs(argv):
	'''
	unspent-outs [TAG]

	Display the unspent outputs table for addresses with the given TAG (or all watched addresses), as required by the offline wallet, ordered by age.
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=? \
				ORDER BY block_height DESC', (tag_id,))
	else:
		r = db.execute('SELECT address, value, hash, output.n, block_height, tx.n FROM output \
				JOIN address ON output.address_id = address.address_id \
				JOIN tx ON output.tx_id = tx.tx_id \
				WHERE spent IS NULL \
				ORDER BY block_height DESC')
	for a, v, hash, n_out, height, n_tx in r:
		stdout.write('%s %s %s %s #blk %s tx %s\n' % (format_address(str(a)), format_coin(v), b2lx(hash), n_out, height, n_tx))

Idea: Add a command to print the outputs table for a given raw transaction. For example, this would enable spending unconfirmed or too recently confirmed outputs in a pinch, without requiring any further changes. Or more generally: all the data conversion code is already here so might as well make it accessible.

Next we proceed to the accounting commands, as they're really just another kind of output command. The balance of an address set is the total value of unspent outputs to addresses in the set.

def cmd_balance(argv):
	'''
	balance [TAG]

	Display confirmed balance of addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE spent IS NULL AND tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT COALESCE(SUM(value),0) FROM output WHERE spent IS NULL')
	bal, = r.fetchone()
	stdout.write('%s\n' % format_coin(bal))

Things get tricker for the register report as it attempts to usefully summarize several things in a small space. In particular, summing the incoming and outgoing value per transaction seems to require separate queries since the join criteria differ.(ii)

def cmd_register(argv):
	'''
	register [TAG]

	Display a tab-delimited transaction register report for addresses with the given TAG (or all watched addresses). Columns are:

	- confirmation block height
	- number of transaction within block
	- total deposits (new outputs)
	- total withdrawals (spent outputs)
	- running balance
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				JOIN address_tag ON output.address_id = address_tag.address_id \
				WHERE tag_id=? \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n', (tag_id,))
	else:
		outs = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN output ON output.tx_id = tx.tx_id \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
		ins = db.execute('SELECT block_height, tx.n, COALESCE(SUM(value),0) FROM tx \
				JOIN input ON input.tx_id = tx.tx_id \
				JOIN output ON input.input_id = output.spent \
				GROUP BY tx.tx_id \
				ORDER BY block_height, tx.n')
	bal = 0
	for height, n, o_val, i_val in merge_moves(outs.fetchall(), ins.fetchall()):
		bal = bal + o_val - i_val
		stdout.write('%s\t%s\t%s\t%s\t%s\n' % (height, n, format_coin(o_val), format_coin(-i_val), format_coin(bal)))

A helper is used to join the two possibly uneven lists by transaction, inserting zeros for transactions found on only one side. Perhaps it could all be done in SQL with subqueries and some type of outer joins, but I wasn't quite seeing it, so resorted to the low level with an algorithm reminiscent of the merging step of classical mergesort.

# Merge ordered lists of total input and output values per transaction into single table with columns for both.
def merge_moves(outs, ins):
	i = o = 0

	while True:
		if o == len(outs):
			for height, n, val in ins[i:]:
				yield (height, n, 0, val)
			return
		o_height, o_n, o_val = outs[o]
		o_key = (o_height, o_n)

		if i == len(ins):
			for height, n, val in outs[o:]:
				yield (height, n, val, 0)
			return
		i_height, i_n, i_val = ins[i]
		i_key = (i_height, i_n)

		if o_key < i_key:
			yield (o_height, o_n, o_val, 0)
			o += 1
		elif i_key < o_key:
			yield (i_height, i_n, 0, i_val)
			i += 1
		else:
			yield (o_height, o_n, o_val, i_val)
			i += 1
			o += 1

Next, the input commands. For sanity's sake, we exclude newlines in tag names as implicitly required by the tags listing format.

def cmd_watch(argv):
	'''
	watch [TAG]

	Import a set of addresses to watch linewise from stdin, optionally named by the given TAG. Addresses can be associated with multiple tags using multiple watch commands.
	'''
	tag_id = None
	if len(argv) > 0:
		name = argv.pop(0)
		if '\n' in name:
			die('newline not allowed in tag name')
		tag_id = insert_or_get_tag_id(name)
	while True:
		l = stdin.readline()
		if len(l) == 0:
			break
		addr_id = insert_or_get_address_id(parse_address(l.rstrip('\n')))
		if tag_id is not None:
			try:
				db.execute('INSERT INTO address_tag (address_id, tag_id) VALUES (?,?)',
						(addr_id, tag_id))
			except IntegrityError:
				pass
		db.commit()

def cmd_push(argv):
	'''
	push

	Import raw hex transactions linewise from stdin and send to bitcoind.
	'''
	while True:
		line = stdin.readline()
		if len(line) == 0:
			break
		tx_hex = line.rstrip('\n')
		stdout.write('txid %s\n' % rpc('sendrawtransaction', tx_hex))

General or command-specific help, and a command registry allowing abbreviation:

def cmd_help(argv):
	'''
	help [COMMAND]

	Display help for a given command or list all commands.
	'''
	if len(argv) > 0:
		name = argv.pop(0)
		name, func = get_command(name)
		doc = getdoc(func)
		if doc is None:
			stdout.write('No help for %r\n' % name)
		else:
			stdout.write('gbw-node %s\n' % doc)
	else:
		stdout.write('''Usage: gbw-node COMMAND [ARGS]

Available commands (can be abbreviated when unambiguous):

%s
''' % '\n'.join([name for name, f in cmds]))

cmds = (
	('help', cmd_help),
	('scan', cmd_scan),
	('reset', cmd_reset),
	('tags', cmd_tags),
	('addresses', cmd_addresses),
	('unspent-outs', cmd_unspent_outs),
	('watch', cmd_watch),
	('push', cmd_push),
	('balance', cmd_balance),
	('register', cmd_register),
)

def get_command(name):
	rows = [r for r in cmds if r[0].startswith(name)]
	if len(rows) == 0:
		die('command not found: %s' % name)
	if len(rows) > 1:
		die('ambiguous command %s. Completions: %s' % (name, ' '.join([r[0] for r in rows])))
	return rows[0]

When invoked as a program (as opposed to imported elsewhere e.g. for testing), we connect to the database, enable foreign key constraints, and boost cache size and checkpoint interval from the meager defaults. These can be tuned if needed to optimize the scan process for your machine. Finally we dispatch to the given command.

Ideally, we'd create the database from schema here if not found.

def main():
	global db
	signal.signal(signal.SIGINT, signal.SIG_DFL)
	require_dir(gbw_home)
	db = sqlite3.connect(gbw_home + '/db', timeout=600) # in seconds
	db.execute('PRAGMA foreign_keys=ON')
	db.execute('PRAGMA cache_size=-8000') # negative means in KiB
	db.execute('PRAGMA wal_autocheckpoint=10000') # in pages (4k)
	if len(argv) < 2:
		die('missing command', help=True)
	get_command(argv[1])[1](argv[2:])

if __name__ == '__main__':
	main()

This concludes the node frontend. Congratulations if you've followed thus far! There's no magic in programming, just a ruthless decomposition of bigger problems into smaller ones, a search for useful and robust abstractions -- and of course a whole lot of background reading and practice.

In the next month or two I will be completing the missing pieces of the signer; meanwhile, the code here is quite ready to play with. Import some addresses, run a scan, run the reports, and let me know how it goes in the comments below.

  1. Well, so far the only such case is format_address, so perhaps it should just be changed to allow passing a buffer. [^]
  2. It's looking like the COALESCE trick is pointless here, since rows are only generated by the join when matching outputs are present; that is, the SUM aggregation is always getting at least one row. Was I overzealous before? I don't recall if I observed an actual problem here rather than just in cmd_balance. It does no harm to leave in though, at least as far as correctness. [^]

2020-01-19

Draft gbw-node frontend, part 5

Filed under: Bitcoin, Software — Jacob Welsh @ 19:02

Continued from:

Command implementations

The core scanning logic is in a helper function that takes a block's height and a memory view of its contents.

Referential integrity between blocks is ensured by scanning sequentially by height; that is, all relevant tx and output records from prior blocks will be known by the time we see the inputs that spend them. However, as far as I know this topological ordering is not guaranteed for the transaction sequence within a block (eg. tx 1 could spend outputs of tx 2, or vice versa) so we do separate passes over the transaction list for outputs and inputs.

def scan_block(height, v):
	stdout.write('block %s' % height)
	# [perf] computing every tx hash
	(blkhash, prev, time, target, txs), size = load_block(v)

The performance comment above was just to note some not-strictly-necessary work being done, in case the scan ended up horribly slow.(i)

An output is relevant if its script is standard and pays a known address. At least with foreign key constraints enabled, we can't insert an output until the tx record it references exists, but we don't know whether to insert the tx until we see if any of its outputs are relevant, so we again use a two-pass approach.

	count_out = 0
	n_tx = 0
	for (hash, size, txins, txouts) in txs:
		matched_outs = []
		for n, txout in enumerate(txouts):
			val, script = txout
			a = out_script_address(script)
			if a is not None:
				#print format_address(a)
				addr_id = get_address_id(a)
				if addr_id is not None:
					matched_outs.append((n, addr_id, val))
		if len(matched_outs) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, addr_id, val in matched_outs:
				insert_output(tx_id, n, addr_id, val)
			count_out += len(matched_outs)
		n_tx += 1
	stdout.write(' new-outs %s' % count_out)

An input is relevant if it spends a known output. Recall that insert_input updates the corresponding output to create the back-reference, indicating it has been spent.

	# Inputs scanned second in case an output from the same block is spent.
	# Coinbase (input of first tx in block) doesn't reference anything.
	count_in = 0
	n_tx = 1
	for (hash, size, txins, txouts) in txs[1:]:
		matched_ins = []
		for n, txin in enumerate(txins):
			prevout_hash, prevout_n, scriptsig = txin
			prevout_tx_id = get_tx_id(prevout_hash)
			if prevout_tx_id is not None:
				prevout_id = get_output_id(prevout_tx_id, prevout_n)
				if prevout_id is not None:
					matched_ins.append((n, prevout_id))
		if len(matched_ins) > 0:
			tx_id = insert_or_get_tx_id(hash, blkhash, height, n_tx, size)
			for n, prevout_id in matched_ins:
				insert_input(tx_id, n, prevout_id)
			count_in += len(matched_ins)
		n_tx += 1
	stdout.write(' spent-outs %s\n' % count_in)

Assorted helpers: handling usage errors; looking up a tag ID that must exist.

def die(msg, help=False):
	stderr.write('gbw-node: %s\n' % msg)
	if help:
		cmd_help([])
	exit(-1)

def require_tag(name):
	i = get_tag_id(name)
	if i is None:
		die('tag not found: %r' % name)
	return i

The entry point for any user command "X" is the function "cmd_X", having help text in its docstring and taking a list of any supplied CLI arguments past the command name.

First, the sync commands. The scan process commits one database transaction per block.

def cmd_scan(argv):
	'''
	scan

	Iterate blocks from bitcoind, indexing transaction inputs and outputs affecting watched addresses. May be safely interrupted and resumed.

	NOT PRESENTLY SAFE TO RUN CONCURRENT INSTANCES due to the dumpblock to named pipe kludge.
	'''
	db.execute('PRAGMA synchronous=NORMAL')
	height = db.execute('SELECT scan_height FROM state').fetchone()[0]
	blockcount = max(-1, rpc('getblockcount') - CONFIRMS)
	while height < blockcount:
		height += 1
		scan_block(height, memoryview(getblock(height)))
		db.execute('UPDATE state SET scan_height = ?', (height,))
		db.commit()

def cmd_reset(argv):
	'''
	reset

	Reset the scan pointer so the next scan will proceed from the genesis block, to find transactions associated with newly watched addresses.
	'''
	db.execute('UPDATE state SET scan_height = -1')
	db.commit()

Next, commands to query the watched address sets (not in the original spec but trivial and clearly useful).

def cmd_tags(argv):
	'''
	tags

	List all tag names.
	'''
	for name, in db.execute('SELECT name FROM tag'):
		stdout.write(name + '\n')

def cmd_addresses(argv):
	'''
	addresses [TAG]

	List addresses with the given TAG (or all watched addresses).
	'''
	if len(argv) > 0:
		tag_id = require_tag(argv.pop(0))
		r = db.execute('SELECT address FROM address \
				JOIN address_tag ON address.address_id=address_tag.address_id \
				WHERE tag_id=?', (tag_id,))
	else:
		r = db.execute('SELECT address FROM address')
	for a, in r:
		stdout.write(format_address(str(a)) + '\n')

To be continued.

  1. I've found the Python profiler quite useful so far compared to such guesswork; still, optimization is something of a balance between experimentally-driven efforts and not doing obviously wasteful things from the start. [^]

Draft gbw-node frontend, part 4

Filed under: Bitcoin, Software — Jacob Welsh @ 04:36

Continued from:

Common database operations

As an internal convention, a "get_X_id" function will return the database ID for the row in table "X" named by its bulkier external reference, or None if not found. Similarly, "insert_or_get_X_id" will insert a row if needed and in either case return the ID. Some of these have only a single caller, but I find that collecting the various similar queries in one place and wrapping them into tidy functions helps readability.

The mapping of Python to SQLite types is fairly straightforward, except that buffer is needed to specify a BLOB.

The "parameter substitution" feature is used throughout, avoiding improper mixing of code and data that could manifest as SQL injection or thrashing the compiled statement cache.

def get_address_id(a):
	r = db.execute('SELECT address_id FROM address WHERE address=?', (buffer(a),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_address_id(a):
	i = get_address_id(a)
	if i is not None:
		return i
	return db.execute('INSERT INTO address (address) VALUES (?)', (buffer(a),)).lastrowid

def get_tx_id(hash):
	r = db.execute('SELECT tx_id FROM tx WHERE hash=?', (buffer(hash),)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tx_id(hash, blkhash, height, n, size):
	try:
		return db.execute('INSERT INTO tx (hash, block_hash, block_height, n, size) VALUES (?,?,?,?,?)',
				(buffer(hash), buffer(blkhash), height, n, size)).lastrowid
	except IntegrityError:
		# XXX check equality?
		return get_tx_id(hash)

I now think we should indeed catch that condition (differing transactions with identical hash), especially given the possibility of TXID collisions. Perhaps I left it out from excessive worry about scan performance. Or just laziness.

The mixture of check-first and try-first styles seen above also doesn't sit well. The possibility of TOCTTOUs,(i) depending on the details of transaction isolation level, would seem to make a strong case for try-first. It's a minor point though; the worst case here would be an uncaught IntegrityError halting the program gracefully.

def insert_output(tx_id, n, addr_id, val):
	try:
		db.execute('INSERT INTO output (tx_id, n, address_id, value) VALUES (?,?,?,?)',
				(tx_id, n, addr_id, val))
	except IntegrityError:
		r = db.execute('SELECT address_id, value FROM output WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()
		if r != (addr_id, val):
			raise Conflict('output differs from previous content', tx_id, n, (addr_id, val), r)

def insert_input(tx_id, n, prevout_id):
	try:
		input_id = db.execute('INSERT INTO input (tx_id, n) VALUES (?,?)', (tx_id, n)).lastrowid
	except IntegrityError:
		input_id = db.execute('SELECT input_id FROM input WHERE tx_id=? AND n=?',
				(tx_id, n)).fetchone()[0]
	db.execute('UPDATE output SET spent=? WHERE output_id=?', (input_id, prevout_id))

def get_output_id(tx_id, n):
	r = db.execute('SELECT output_id FROM output WHERE tx_id=? AND n=?', (tx_id, n)).fetchone()
	return None if r is None else r[0]

def get_tag_id(name):
	r = db.execute('SELECT tag_id FROM tag WHERE name=?', (name,)).fetchone()
	return None if r is None else r[0]

def insert_or_get_tag_id(name):
	i = get_tag_id(name)
	if i is not None:
		return i
	return db.execute('INSERT INTO tag (name) VALUES (?)', (name,)).lastrowid

Next up, we'll finally get to implementing the commands themselves. To be continued.

  1. The "time of check to time of use" race condition. You know, like sitting down when some trickster's meanwhile moved the chair. [^]
Older Posts »

Powered by MP-WP. Copyright Jacob Welsh.