Draft gbw-node schema

2019-12-16

Draft gbw-node schema

Filed under: Bitcoin, Software — Jacob Welsh @ 22:09

This series will present my initial implementation of a Bitcoin wallet built up from loosely-coupled parts ("simple droplets"). This follows from necessity as much as design philosophy, in that: 1) the only practical way to secure keys against remote compromise is with dedicated offline hardware; 2) additional online code is needed to support this; and 3) the existing bitcoind is a costly and treacherous thing to hack on. Further background on this work is at:

Today we will look at the data model for the awk-like part, which has shaped up as a line-mode command suite run on demand to collect validated block data, index and make accessible the parts needed to track funds and issue transactions.

Source: schema-node.sql.

Compared to a key-value store like BerkeleyDB as used by bitcoind, using SQL allows formalizing the structure and relationships in the data, working at a high level and enforcing integrity along several dimensions. Besides simplifying implementation, this means the human operator can bypass any limitations of the frontend software and work directly with the database if need be.

Because SQL in practice isn't quite as standard as one might like, one needs to pick either a specific implementation or some abstraction layer to target. While I've worked with MySQL, PostgreSQL and SQLite, I'm most familiar with the last, and figure it has some nice properties for this application: simple to install, widely available, without daemon processes, network listeners, user accounts and passwords to configure. Don't be fooled by the "lite": it's still a large item with fairly comprehensive and optimized SQL engine, claiming robust concurrency and durability properties (ACID transactions) and large, well-defined and tested limits. It includes a command-line shell sqlite3 for manual operation, dump and restore. You are of course welcome to try other implementations and report findings; I haven't used inherently SQLite-specific features like dynamic typing.

A "watched address" model is used such that only transaction outputs and inputs affecting Bitcoin addresses of interest are recorded. The benefit is much lower storage requirement than indexing the full blockchain; the cost is having to rescan after importing addresses with existing history.⁽ⁱ⁾ Addresses are grouped into sets (tags) such that one node can serve many wallets or include decoy sets.

Prologue

--- Gales Bitcoin Wallet: node (online component) schema
--- J. Welsh, December 2019
--- Dialect: SQLite (3.7.0 for WAL)

SQLite originally used a ~~writeback~~ rollback journaling approach, which is terribly slow for write-heavy uses, at least on mechanical HDDs lacking battery-backed cache, due to requiring at least two fsync operations per transaction. It provides no way to relax durability without sacrificing integrity guarantees, which would precisely suit the needs of the scan process. We'll instead use the newer Write-Ahead Logging mode; once activated, it's remembered in the database file.

PRAGMA journal_mode=WAL;
BEGIN;

Transactions

The tx table tracks confirmed Bitcoin transaction metadata. All "_id" fields are database-assigned integers for efficiency of storage and lookup; hashes are stored as raw bytes.

CREATE TABLE tx (
	tx_id        INTEGER PRIMARY KEY,
	hash         BLOB    NOT NULL,
	block_hash   BLOB    NOT NULL,
	block_height INTEGER NOT NULL,
	n            INTEGER NOT NULL, -- position in block
	comment      TEXT,
	size         INTEGER NOT NULL,
	fee          INTEGER
);
CREATE UNIQUE INDEX i_tx_hash ON tx(hash);
CREATE UNIQUE INDEX i_tx_height_n ON tx(block_height, n);

The purist might note that this table isn't entirely normalized since many-to-one block information is stored in two fields; but then, there can also be upsides to avoiding excessive joins. It might be nice to track more block metadata anyway since it's not that big and bitcoind does it quite opaquely.

Comment and fee fields are unused at present: computing fees, while desirable, would seem to require a full transaction index; comments don't seem to fit with the concept of minimizing private information stored by the node.

Transactions can be found either by their 256-bit hash or the more concise block coordinates.

Outputs and inputs

The transaction output is the main structure of interest, holding a fixed parcel of spendable or spent coin.

The REFERENCES clause, besides making relationships explicit, creates a foreign key constraint. Unfortunately SQLite doesn't enforce these by default for compatibility reasons; it must be enabled by the application on each connection. To ensure the sqlite3 shell gets it too, I include in my ~/.sqliterc a PRAGMA foreign_keys=ON;, as well as a .headers on and .mode quote (one per line) for better handling the blobs.

CREATE TABLE output (
	output_id  INTEGER PRIMARY KEY,
	tx_id      INTEGER NOT NULL REFERENCES tx,
	n          INTEGER NOT NULL, -- position in output vector
	address_id INTEGER NOT NULL REFERENCES address,
	value      INTEGER NOT NULL,

SQLite integers take up to 8 bytes as needed, thus can safely represent Satoshi values up to the maximum BTC that will ever exist (1e8 x 21e6 = 2.1e15; 2⁶³ > 10¹⁸).

	spent      INTEGER REFERENCES input(input_id),

An initial mistake was keeping the spent reference on the input side, following the raw transaction structure. The trouble with this is there's no efficient way to find unspent outputs: "the rows in this table that aren't referenced from another table" are necessarily not contained in that table's index! So instead we keep the link on this side, allowing NULL, and update the row when an input is seen that spends it.

	flags      TEXT
);
CREATE UNIQUE INDEX i_output_txid_n ON output(tx_id, n);
CREATE INDEX        i_output_addrid ON output(address_id);
CREATE UNIQUE INDEX i_output_spent  ON output(spent);

The flags field is presently unused: the idea was to track local state bits like seen or acknowledged.

CREATE TABLE input (
	input_id   INTEGER PRIMARY KEY,
	tx_id      INTEGER NOT NULL REFERENCES tx,
	n          INTEGER NOT NULL -- position in input vector
);
CREATE UNIQUE INDEX i_input_txid_n ON input(tx_id, n);

Having a separate table for inputs is looking pointless now that it's a mere one-to-one reference from output.spent... ah, the joys of schema refactoring.

Addresses and tags

CREATE TABLE address (
	address_id INTEGER PRIMARY KEY,
	address    BLOB NOT NULL
);
CREATE UNIQUE INDEX i_address_address ON address(address);

Like transaction and block hashes, addresses are represented in raw byte form as opposed to Base58Check-encoded. This makes scanning faster but importing and formatting for display slower. "Version" is assumed to be zero, meaning P2SH / "3-addresses" aren't a thing.

CREATE TABLE tag (
	tag_id INTEGER PRIMARY KEY,
	name   TEXT NOT NULL
);
CREATE UNIQUE INDEX i_tag_name ON tag(name);

Unlike the "accounts" seen in bitcoind, addresses can have multiple tags.

CREATE TABLE address_tag (
	address_id INTEGER NOT NULL REFERENCES address,
	tag_id     INTEGER NOT NULL REFERENCES tag,
	PRIMARY KEY (address_id, tag_id)
);
CREATE INDEX i_addrtag_tag ON address_tag(tag_id);

Like many databases, SQLite uses B-tree storage. One implication is that in the case of composite keys (those spanning multiple fields as seen here), a dedicated index is not necessary for the first field as B-trees provide efficient prefix search. Subsequent fields however may benefit from them.

State

Finally, a "singleton" table to track scan state. A more extensible approach to global variables would be textual name and value fields, but I dunno, I like the explicit-ness of mapping variables to column rather than row.

CREATE TABLE state (
	scan_height INTEGER NOT NULL DEFAULT(-1)
);
INSERT INTO state DEFAULT VALUES;

COMMIT;

Stay tuned for the code that puts it all to work. To run it you'll be needing Python 2 (2.7 tested) built with the sqlite3 module, the standalone sqlite3 shell (as loading the schema is manual for now), and of course, a Bitcoin node.

A test scan of 607`849 blocks on 1`102 addresses finding 8`929`351 transactions took 13 hours producing a 1.8GB database file. [^]

14 Comments »

[...] database schema for the "node" part of Gales Bitcoin Wallet covered, we now proceed to the frontend program that [...]

Pingback by Draft gbw-node frontend, part 1 « Fixpoint — 2019-12-18 @ 01:18
> this means the human operator can bypass any limitations of the frontend software and work directly with the database if need be.

This, incidentally, is huge. Absolutely, only way to do this right. The program-data coupling is the bane of fucking humanity.

The SQLite choice adds a third item to the traditional mysql or psql republican recital, of course... but I don't think this is indefensible.

Comment by Mircea Popescu — 2019-12-19 @ 07:56
[...] from schema and part 1. Source: [...]

Pingback by Draft gbw-node frontend, part 2 « Fixpoint — 2020-01-16 @ 18:52
[...] Schema (source); [...]

Pingback by Draft gbw-node frontend, part 3 « Fixpoint — 2020-01-17 @ 18:02
[...] Schema (source); [...]

Pingback by Draft gbw-node frontend, part 4 « Fixpoint — 2020-01-19 @ 04:36
[...] Schema (source); [...]

Pingback by Draft gbw-node frontend, part 5 « Fixpoint — 2020-01-19 @ 19:02
[...] Schema (source); [...]

Pingback by Draft gbw-node frontend, part 6 « Fixpoint — 2020-01-20 @ 21:32
[...] know Firefox keeps this in a SQL database, yes? Because they told you about it in the manual, and documented the schema and all? [^]At times I was overpowered by the temptation to go check, with the inevitable [...]

Pingback by JFW's 130 top Trilema picks to date « Fixpoint — 2020-03-07 @ 16:25
[...] Gales Bitcoin Wallet (GBW) : designed to be used over an airgapped system. The offline signing component is written in Gales Scheme9. [...]

Pingback by Protect What Matters with JWRD « Dorion Mode — 2020-05-02 @ 03:07
[...] of the hard work for this job has already been completed by jfw. He released gbw-node, a piece of software for the online portion of his wallet that provides a way to track the balances [...]

Pingback by A Bitcoin Block Explorer - The Why & The How « whaack — 2020-07-09 @ 16:22
A small point, which shouldn't matter much for wallet software, but perhaps you should remove the UNIQUE keyword on the tx hash index, since as you mentioned somewhere else tx hashes are not necessarily unique.

CREATE INDEX i_tx_hash ON tx(hash);

Comment by whaack — 2020-07-11 @ 12:56
Ah my apologies, you reviewed this point here http://fixpoint.welshcomputing.com/2020/errata-for-gbw-node-drafts-to-date-and-bitcoin-txid-collisions/?b=The%20third&e=:#select . I'll take some more time to think about how I can best deal with this problem for my block explorer version of gbw-node and comment there.

Comment by whaack — 2020-07-11 @ 15:24
[...] SQL database schema for a node extension to index blockchain data by address. [...]

Pingback by Gales Bitcoin Wallet (re)release « Fixpoint — 2021-12-03 @ 08:59
[...] and may not be fully reliable but there's room for improvement there too! [^]Some exceptions were a series presenting a new program while it was still somewhat in flux, my cautious first step into proposing changes to the big scary TRB, and an in-depth review of a [...]

Pingback by Sending buffer overflows fixed in bitcoind, and other cleanups « Fixpoint — 2022-05-21 @ 22:39

RSS feed for comments on this post. TrackBack URL

Fixpoint

2019-12-16