Fixpoint

2020-04-02

V in Perl with parsing fix, keksum, and starter, plus the ill-fated vdiff

Filed under: Software, V — Jacob Welsh @ 17:50

Following my prior adventures, I reoriented my efforts toward some simpler changes to the v.pl tree, abandoning hopes of a robust patch creation tool built on Busybox diff.

I've split the changes into two patches. The first is "v_strict_headers", which I think would be of interest to any v.pl user. It tightens vpatch parsing to prevent false-positive header matches that could cause incorrect or nonsensical antecedent information to be extracted from valid vpatches. Following the precedent of the vtools vpatch program, this is done by requiring the string "diff " at the start of a line preceding the header, which works because all other lines of a diff "packet"(i) start with either @, +, -, or space characters. This patch also backfills the manifest file and brings it fully in line with the spec.

The second patch, "v_keksum_busybox", swaps keksum and patch in for ksum and vpatch, making V presses possible again on systems with little more than a C toolchain, Busybox utilities and Perl.

I have also mirrored the rest of the VTree and contributed my own seals, which can be found in the same directory.

For deployment on systems with no previous V, there's a starter tarball which includes the tree pressed to v_keksum_busybox, the keksum code, and an install script. Take a look at what it does, then run as root, from the extracted directory:

# sh install.sh

Download

The ill-fated vdiff

What follows is my abandoned attempt at vdiff in awk, supporting any conforming diff program. It identifies headers using a three-state machine to recognize the ---, +++, @@ sequence. This would still be fooled by a ---, +++ sequence followed immediately by another hunk, except that the lines of context prevent this, unless the change comes at the end of the file in which case there can't be another hunk prior to the next file header.

It works as far as parsing both GNU and Busybox diff output, produces working vpatches in the GNU case, and could even be expanded to do the same for Busybox. But since fully-reproducible output seems to be desirable, I can't presently justify further work in this direction or recommend it over the vtools vdiff.

#!/bin/sh
export LC_COLLATE=C
diff -uNr $1 $2 | awk -v sq=\' '
function shell_quote(s) {
	sub(sq, sq "\\" sq sq, s);
	return sq s sq;
}

function vhash(path) {
	if (path == "/dev/null") return "false";
	qpath = shell_quote(path);
	cmd = "test -e " qpath " && keksum -s256 -l512 -- " qpath;
	gotline = cmd | getline rec;
	close(cmd);
	if (!gotline) return "false";
	split(rec, parts);
	return parts[1];
}

function print_header(line) {
	split(line, parts);
	print parts[1], parts[2], vhash(parts[2]);
}

{
	if (state == 0) {
		if ($0 ~ /^---/) {
			from = $0;
			state = 1;
		}
		else {
			print;
		}
	}
	else if (state == 1) {
		if ($0 ~ /^\+\+\+/) {
			to = $0;
			state = 2;
		}
		else if ($0 ~ /^---/) {
			print from;
			from = $0;
		}
		else {
			print from;
			print;
			state = 0;
		}
	}
	else if (state == 2) {
		if ($0 ~ /^@@/) {
			print_header(from);
			print_header(to);
			print;
			state = 0;
		}
		else if ($0 ~ /^---/) {
			print from;
			print to;
			from = $0;
			state = 1;
		}
		else {
			print from;
			print to;
			print;
			state = 0;
		}
	}
}

END {
	if (state == 1) {
		print from;
	}
	else if (state == 2) {
		print from;
		print to;
	}
}'
  1. Or what else do you call the header and sequence of hunks associated with a single file? [^]

2020-03-31

Adventures in the forest of V

Filed under: Historia, Software, V — Jacob Welsh @ 19:11

It started as what I thought a simple enough job: take the existing SHA512 v.pl I'd been using to press the Bitcoin code, or rather the VTree that grew from it, swap out the hash with my own keksum so as to avoid a hefty and otherwise unnecessary GNAT requirement, add my version of the classic vdiff modified likewise, bundle up a "starter" to cut the bootstrapping knot, and publish the bunch as my own tested and supported offering for wherever a V may be needed.

Such a thing would still require Perl, itself not an insignificant liability. While work had been underway to replace that, the results fell short of completeness, and from the ensuing discussion I decided it would be best to shore up my own grounding in the historical tools before venturing deeper into the frontier. I suppose I should be glad, because I got even more of that grounding - or swamping, more like - than I had asked for.

I.

One pitfall I already knew was that file header lines in the "unified diff" format used by V, which begin with "---" and "+++", cannot be accurately distinguished from deleted lines beginning "--" and inserted lines beginning "++", if parsing linewise and statelessly as done by the original "one-liner" vdiff. This was discovered in practice through an MP-WP patch containing base64-encoded images, and the potential damage is hardly restricted to that; for instance both SQL and Ada programming languages use "--" as comment marker. This was part of the motivation behind vtools, which took the approach of avoiding the system's existing "diff" program in favor of a stripped-down version of the GNU codebase with integrated hashing. My own approach had been more lightweight: tightening up the awk regex to at least reduce the false positive cases. It wasn't too satisfying, but had worked well enough so far.

II.

The first surprise I hit (stupidly late in the process, after I'd already signed my patch and starter) was that the Busybox version of "diff -N" replaces the input or output file path with "/dev/null" for the cases of creation and deletion respectively.

This reflects a larger reservation I have about Busybox code: it's been hacked extensively toward the goal of minimizing executable and memory footprint, which sometimes but only sometimes coincides with clear code and sensible interfaces. In this case, from brief inspection I surmise that it literally uses /dev/null so as to avoid some kind of null check in the downstream code that compares and emits the header. It's clever, but breaks compatibility with the GNU format in unforeseen ways.(i) In fairness to Busybox, the format was poorly specified in the first place - and nobody involved with V apparently found this important enough to remedy either.

III.

Another surprise for me was that the sloppy parsing affects not just diffing but pressing too. At least v.py and v.pl exhibit the same sort of blind regexing in extracting antecedent information from vpatches. (I'd guess that use of somewhat tighter regexes has prevented this from causing trouble in practice yet.) Further, v.pl extracts file paths only from the "---" part of the header which suggests it would indeed be broken by "/dev/null" style patches. On the extended vtools side, vfilter makes yet another assumption not backed by either such documentation as exists for the format or the Busybox version: a line showing a diff pseudo-command at the start of the header.

IV.

Finally, I've noticed what strikes me as a design problem affecting all V implementations, which I haven't seen mentioned before: it's not possible to have the same (path, hash) pair as an output of two different patches in the same VTree. More simply put, you can't have a patch that changes a file back to a previous state, contrary to the suggestion that "adding and removing the null character from the manifest file in every other patch would work" seen in the manifest spec. The reason is that both patches would end up in the antecedent set of a patch referencing either version of the file, in some cases producing a cyclic graph.(ii)

Stay tuned for the aforementioned patch and starter that make progress on a few of these fronts.

  1. A related annoyance I've had is Busybox "diff -qr" doesn't report added or removed directories, while adding -N replaces "Only in ..." messages with the less helpful "Files ... differ". [^]
  2. At this point I must say I wonder why V wasn't made to simply include in the header of each patch the hash of its antecedent patch as a whole. It would have neatly bypassed all these problems, forcing a tree topology and simplifying implementation. Would it have smelled too much like Git, or what? [^]

Powered by MP-WP. Copyright Jacob Welsh.