Fixpoint

2020-06-27

A bevy of fixes for V in Perl

Filed under: Software, V — Jacob Welsh @ 21:49

"Fixes" may be a strong term, in that it could be argued the item in question wasn't exactly broken. It was however darn near unusable in practice, and contained a number of fragile and unstated assumptions about its environment and invocation.

I thought I'd introduce my work by putting it in context with a quick recap of what V is; but then it turned out that my understanding of the matter, and perhaps even my approach to gaining understanding, needed some fixing too.(i) In short, it's nearsighted to define V as an improved kind of version control system; rather, its proper placement is a few levels up in the tree of concepts as a new way of thinking about, talking about, and deploying software, broadly construed. Thus I surmise it's also inadequate to consider the present "V in Perl" artifact, its label notwithstanding, to be a kind of reference implementation of V, but rather an early implementation of a relatively small part of the overall vision.

Still, whether I fully manage to see in this horseless carriage prototype a "transportation revolution to permanently alter the shape of civilization" or "just a faster kind of cart" - or something in between - if I'm to be riding in it then I want the wheels on straight; and I'm plenty capable of seeing to that.

The first patch packs in quite a few relatively small and independent fixes, building off the GNAT-demanding ksum/vpatch branch. Paralleling my previous work, the second swaps in keksum and patch to avoid the secondary compiler requirement (a change that's now more straightforward).

Download

Changes

Taking the main patch, one item at a time, from the manifest:

(1) Eliminate use of external binaries (cat ls sort pwd which) and provide more hygienic directory listing;

Now I'm no Perl monk (and would likely never choose it myself for a new project) but I know you don't have to open subshells and pipelines just in order to read a file or list a directory, really now! The "ls" scraping is a particularly risky pattern due to the possibility of unexpected control characters in filenames. I was surprised to learn that stock Perl lacks an internal "getcwd" function, but the "pwd" and "which" usage turned out to be doing more harm than good anyway and easily eliminated.

(2) add missing error handling in build_wot;

This is a repeated pattern in the code, and one I probably haven't entirely eliminated. Much like the underlying C, many Perl functions don't raise exceptions but expect the caller to check manually for errors. Much like the underlying C programmers, many Perl programmers don't bother with that pesky error checking stuff. Combine this with mutable variables repeatedly reinitialized by a loop on the assumption that nothing fails, and you get all sorts of interesting leakage possibilities. In this case, if one of the GPG key files placed in ".wot" was invalid, the program would not only fail to notice but label the bad key with the metadata of an unrelated victim that was seen previously.(ii)

(3) eliminate some variable regex patterns;

Using a complex tool when a simple one suffices - and, predictably, not using it correctly (i.e. by quoting regex metacharacters).

(4) avoid slurping(iii) full vpatches and verbose output into memory;

Thus it should now work (albeit slowly) on arbitrarily large vpatch files with respect to the system's main memory, and for verbose mode flush the "patch" output to terminal in closer to real time.

(5) fix exponential recursion blowup in traverse_press_path and get_all_descendant_nodes;

This was the original motivation for this work, as I'd noted:

Algorithmic inefficiency is a serious drawback of this tool. I suspect the "toposort" is something like O(n^2 log n) while "traverse_press_path" is exponential, like the textbook Fibonacci example of how not to do recursion. This becomes acutely noticeable around 32+ patches.

Exponential traversals turned out to be in two different functions; happily, the memoization needed to avoid revisiting the same subtrees over and over was already inherent in the data structures being built.

After tracking down the second instance, I used some "awk" to build a full list of recursive calls in the program for scrutiny (namely: traverse_press_path verify_ante remove_desc add_desc_edges add_desc_src_files), so I'm fairly confident there are no more heads on this particular hydra.

(6) allow the patchdir to be a relative path;

It happened that relative paths worked already for the seals and wot directories.

(7) replace some numeric indexing with named variables;
(8) tweak hash program parsing to not require two spaces;
(9) document restricted positioning of global options handled distinctly from commands;

The option-handling code clearly wasn't doing what it was meant to be doing; I haven't exactly fixed that but have at least noted the extant restrictions on where the wotdir/patchdir/sealdir options may be given.

(10) make help and version commands work without a wotdir;

You can't demand I already know how to use a program in order to view its documentation! Well I mean, you can, but... you know what I mean?!

(11) allow patches and seals to share a directory but require standard extensions (.vpatch .sig);

I never quite saw the point of separate subdirectories here, so now you can just ln -s patches .seals and keep them all under "patches", saving a fair amount of pointless shuffling in my own usage. (The default ".seals" path is preserved for compatibility.)

(12) take basenames of patch arguments to allow tab-completeable paths;

That is:

v.pl press a some_big_long_name.vpatch

can now be spelled as

v.pl press a patches/some_big_long_name.vpatch

(i.e. the actual path, although the prefix could be anything). This applies to all subcommands that take patch names.

(13) clean up the tempdir on SIGINT (^C);

Less important now that it doesn't get effectively wedged on exponential algorithms, but still.

(14) factor out some repetition;
(15) other minor simplifications.
Version 99989.

Stats

$ diffstat v_fix_exptimes_paths_etc.vpatch
 manifest |    1
 v.pl     |  234 +++++++++++++++++++++++++++++++++------------------------------
 2 files changed, 124 insertions(+), 111 deletions(-)

Enjoy!

  1. Full details in the logs:

    jfw: diana_coman: I'm trying for a concise intro/description of V, in the present (post-Republic) context. Does this about capture it: "versioning system that supports owner control of computing by placing primary focus on the change and explicit management of trust through strong cryptography" ?
    diana_coman: jfw - hm, what do you mean by "of computing" there?
    jfw: well, of the operation of one's own computers
    jfw: possibly a bit circular with "ownership of one's own"...
    diana_coman: it's more that the definition as you gave it doesn't do all that much - though it takes a few readings, hm.
    diana_coman: it's a bit tortured on various fences by the looks of it; for one thing, defining it as a versioning system cuts away an important part - the deployment of software that is usually not all that much the traditional concern of versioning systems
    diana_coman: jfw - what's the audience you have in mind there or is this blog/generic?
    jfw: it's the blog, yes - and partly for clarifying it for myself, heh.
    jfw: my grasp of what V does for deployment is basically to say that the other tools traditionally used for it aren't necessary
    diana_coman: ahaha, going for once fully-negative-space there (and that getting rid of all the other "tools" is not a tiny thing either at that, but it's more of a consequence than anything else)
    diana_coman: V is a complete solution in that sense, hence "the other tools [...] aren't necessary"
    jfw: (though um, it's still known to lean on 'wget' etc.)
    diana_coman: well, it also still requires an OS!
    diana_coman: anyways, I wouldn't say that "other tools are not necessary" - it's more that the change is so fundamental that previous tools don't fit /don't have a useful place anymore; other tools though *are* still necessary - only they need to be built
    diana_coman: it changes the whole landscape if you want
    diana_coman: but let's rewind and try to grab it from some more concrete end perhaps
    jfw: alright
    diana_coman: so for one thing, V is not some particular implementation but essentially a paradigm for software
    diana_coman: and software as a whole, not just development, nor even just deployment, it goes all the way to even what software *is*
    diana_coman: sure, one can use V for some narrow part that they care about and it's true that the first implementation was just that, a very narrow thing in fact, but that doesn't mean much.
    diana_coman: and I suppose that the current state of V-use and development otherwise might give the impression that there isn't anything more to it either, huh
    jfw: I suppose I've tried to understand the species based on observations of what's shared by the known instances
    diana_coman: jfw - you know, I think your attempt and question there hits actually deeper (and well done for it, too) than you intended, lol
    jfw: haha, indeed
    diana_coman: jfw - so where did you start from, anyway? from the current implementations of V, is that what you mean by the instances?
    jfw: right
    jfw: heh, you know the one about the blind men and the elephant?
    diana_coman: that kind of locks you unhelpfully into some rather sterile and narrow mindframe, myeah (and I'll leave the tracing of the root cause there to each log reader)
    diana_coman: jfw - hm? doesn't come to mind, no.
    jfw: apparently a story that exists in many versions, but basically each man feels a different part of the elephant and extrapolates a completely different (& quite incomplete) picture of what an elephant is.
    diana_coman: ah, the fable, yes
    diana_coman: I can see the similarity, indeed
    jfw: https://allpoetry.com/The-Blind-Man-And-The-Elephant - possibly the main English version.
    jfw: ponders how to "see true v-elephant with mind's eye"
    diana_coman: the thing is, V is not just a different type of versioning system - a bit like a car is not just a faster cart, hm
    diana_coman: jfw - well, better start from the beginning as it were which indeed is *not* whatever implementation, no matter what claims are made otherwise; e.g. [http://trilema.com/2015/no-such-labs-releases-v-for-victory/?b=change&e=satellites#select][the change similar to that introduced by the understanding and controlling movement in terms of mass, impulse and energy, such as it occurs in the launching of
    diana_coman: satellites]
    diana_coman: damn, it still broke the link, didn't it
    jfw: space between the words in the text, yeah
    diana_coman: jfw - my, yrc can't recall previous line??
    jfw: nope :/
    diana_coman: jfw - why, why, why whyyyyyy
    diana_coman: the change similar to that introduced by the understanding and controlling movement in terms of mass, impulse and energy, such as it occurs in the launching of satellites
    jfw: because it's young still
    diana_coman: so based on the above, you can start perhaps with a broad definition of V as a new way of understanding software - and therefore, as a consequence of this deeper and more precise understanding, the resulting more efficient way of talking about software, developing (version controlling being only one part of that developing) software, deploying software, maintaining software and so on.
    diana_coman: jfw - well, yrc may be young and have all the time ahead of it indeed but what can I say, I'm getting older day by day here so pleaaaase: can haz tab-completion and last-line recall?
    jfw: yes; and kill/yank (cut/paste) for the input is needed too.
    jfw: "manage his investment of trust at all junctures so that he is never required to implicitly trust either an unknown code author, or a code snippet of unknown provenance." - hey I pretty much got that part, right?
    diana_coman: with that broad definition at hand to help you avoid the pitfalls of stupid compartmentalizing, narrow focus, childish pick-and-choose and other numerous afflictions of the "software industry/engineering", the next step is to review the stated principles at the root of it all:
    jfw: (but yes, paradigm rather than particular set of scripts was missing.)
    diana_coman: namely software being the property of those running it and identity being constructed by others' view, upon a fixed support
    diana_coman: jfw - trust is possibly the skin of that particular elephant and at least the word itself has been repeatedly brandied about for sure
    diana_coman: it might have been bandied, but I do like brandied better.
    jfw: mmm, brandytrust!
    diana_coman: quite, it can produce... intoxication!
    jfw: especially hazardous when pregnant with concepts & definitions
    diana_coman: ahahah, indeed!
    diana_coman: looking back at your original definition, I'm afraid there isn't much of it left though.
    diana_coman: making a first attempt at tightening up that previous definition:
    sonofawitch: 2020-06-23 21:56:55 (#ossasepia) diana_coman: so based on the above, you can start perhaps with a broad definition of V as a new way of understanding software - and therefore, as a consequence of this deeper and more precise understanding, the resulting more efficient way of talking about software, developing (version controlling being only one part of that developing) software, deploying software, maintaining software and so on.
    diana_coman: V is a new conceptual framework for software, emerging from a better understanding of what software is and providing as main benefits the means for explicit, verifiable enforcement of software ownership by users as well as the correct incentives and supporting concepts for a qualitative jump in the way software is developed, deployed, maintained and evolved.
    diana_coman: jfw - does the above sound like the sort of concise definition you were looking for?
    diana_coman: it aims for a more practical intro so it necessarily leaves some stuff out/picks some to highlight.
    jfw: diana_coman: it's the sort of definition, yes - I don't know that I'll use it here directly though because if I'm to give a definition I'd want it to be one I fully understand myself (i.e. to have that new understanding of software & be able to explain why it's better)
    jfw: I'll work on getting there but the present article can make do without it.
    diana_coman: jfw - ah, no need to use it directly anywhere, lol; and anyways, if not clear, ask further tomorrow or whenever, sure.
    jfw: yep, & thanks for the pointers.
    diana_coman: yw
    diana_coman: such excellent questions are a pleasure to answer, so...keep asking them!
    [^]

  2. "User error", yes; but as the Python folks say, "errors should never pass silently, unless explicitly silenced", one reason that I still grade that language a cut above Perl. [^]
  3. It's the official Perl term, what can I say? [^]

2020-04-02

V in Perl with parsing fix, keksum, and starter, plus the ill-fated vdiff

Filed under: Software, V — Jacob Welsh @ 17:50

Following my prior adventures, I reoriented my efforts toward some simpler changes to the v.pl tree, abandoning hopes of a robust patch creation tool built on Busybox diff.

I've split the changes into two patches. The first is "v_strict_headers", which I think would be of interest to any v.pl user. It tightens vpatch parsing to prevent false-positive header matches that could cause incorrect or nonsensical antecedent information to be extracted from valid vpatches. Following the precedent of the vtools vpatch program, this is done by requiring the string "diff " at the start of a line preceding the header, which works because all other lines of a diff "packet"(i) start with either @, +, -, or space characters. This patch also backfills the manifest file and brings it fully in line with the spec.

The second patch, "v_keksum_busybox", swaps keksum and patch in for ksum and vpatch, making V presses possible again on systems with little more than a C toolchain, Busybox utilities and Perl.

I have also mirrored the rest of the VTree and contributed my own seals, which can be found in the same directory.

For deployment on systems with no previous V, there's a starter tarball which includes the tree pressed to v_keksum_busybox, the keksum code, and an install script. Take a look at what it does, then run as root, from the extracted directory:

# sh install.sh

Download

The ill-fated vdiff

What follows is my abandoned attempt at vdiff in awk, supporting any conforming diff program. It identifies headers using a three-state machine to recognize the ---, +++, @@ sequence. This would still be fooled by a ---, +++ sequence followed immediately by another hunk, except that the lines of context prevent this, unless the change comes at the end of the file in which case there can't be another hunk prior to the next file header.

It works as far as parsing both GNU and Busybox diff output, produces working vpatches in the GNU case, and could even be expanded to do the same for Busybox. But since fully-reproducible output seems to be desirable, I can't presently justify further work in this direction or recommend it over the vtools vdiff.

#!/bin/sh
export LC_COLLATE=C
diff -uNr $1 $2 | awk -v sq=\' '
function shell_quote(s) {
	gsub(sq, sq "\\" sq sq, s);
	return sq s sq;
}

function vhash(path) {
	if (path == "/dev/null") return "false";
	qpath = shell_quote(path);
	cmd = "test -e " qpath " && keksum -s256 -l512 -- " qpath;
	gotline = cmd | getline rec;
	close(cmd);
	if (!gotline) return "false";
	split(rec, parts);
	return parts[1];
}

function print_header(line) {
	split(line, parts);
	print parts[1], parts[2], vhash(parts[2]);
}

{
	if (state == 0) {
		if ($0 ~ /^---/) {
			from = $0;
			state = 1;
		}
		else {
			print;
		}
	}
	else if (state == 1) {
		if ($0 ~ /^\+\+\+/) {
			to = $0;
			state = 2;
		}
		else if ($0 ~ /^---/) {
			print from;
			from = $0;
		}
		else {
			print from;
			print;
			state = 0;
		}
	}
	else if (state == 2) {
		if ($0 ~ /^@@/) {
			print_header(from);
			print_header(to);
			print;
			state = 0;
		}
		else if ($0 ~ /^---/) {
			print from;
			print to;
			from = $0;
			state = 1;
		}
		else {
			print from;
			print to;
			print;
			state = 0;
		}
	}
}

END {
	if (state == 1) {
		print from;
	}
	else if (state == 2) {
		print from;
		print to;
	}
}'
  1. Or what else do you call the header and sequence of hunks associated with a single file? [^]

2020-03-31

Adventures in the forest of V

Filed under: Historia, Software, V — Jacob Welsh @ 19:11

It started as what I thought a simple enough job: take the existing SHA512 v.pl I'd been using to press the Bitcoin code, or rather the VTree that grew from it, swap out the hash with my own keksum so as to avoid a hefty and otherwise unnecessary GNAT requirement, add my version of the classic vdiff modified likewise, bundle up a "starter" to cut the bootstrapping knot, and publish the bunch as my own tested and supported offering for wherever a V may be needed.

Such a thing would still require Perl, itself not an insignificant liability. While work had been underway to replace that, the results fell short of completeness, and from the ensuing discussion I decided it would be best to shore up my own grounding in the historical tools before venturing deeper into the frontier. I suppose I should be glad, because I got even more of that grounding - or swamping, more like - than I had asked for.

I.

One pitfall I already knew was that file header lines in the "unified diff" format used by V, which begin with "---" and "+++", cannot be accurately distinguished from deleted lines beginning "--" and inserted lines beginning "++", if parsing linewise and statelessly as done by the original "one-liner" vdiff. This was discovered in practice through an MP-WP patch containing base64-encoded images, and the potential damage is hardly restricted to that; for instance both SQL and Ada programming languages use "--" as comment marker. This was part of the motivation behind vtools, which took the approach of avoiding the system's existing "diff" program in favor of a stripped-down version of the GNU codebase with integrated hashing. My own approach had been more lightweight: tightening up the awk regex to at least reduce the false positive cases. It wasn't too satisfying, but had worked well enough so far.

II.

The first surprise I hit (stupidly late in the process, after I'd already signed my patch and starter) was that the Busybox version of "diff -N" replaces the input or output file path with "/dev/null" for the cases of creation and deletion respectively.

This reflects a larger reservation I have about Busybox code: it's been hacked extensively toward the goal of minimizing executable and memory footprint, which sometimes but only sometimes coincides with clear code and sensible interfaces. In this case, from brief inspection I surmise that it literally uses /dev/null so as to avoid some kind of null check in the downstream code that compares and emits the header. It's clever, but breaks compatibility with the GNU format in unforeseen ways.(i) In fairness to Busybox, the format was poorly specified in the first place - and nobody involved with V apparently found this important enough to remedy either.

III.

Another surprise for me was that the sloppy parsing affects not just diffing but pressing too. At least v.py and v.pl exhibit the same sort of blind regexing in extracting antecedent information from vpatches. (I'd guess that use of somewhat tighter regexes has prevented this from causing trouble in practice yet.) Further, v.pl extracts file paths only from the "---" part of the header which suggests it would indeed be broken by "/dev/null" style patches. On the extended vtools side, vfilter makes yet another assumption not backed by either such documentation as exists for the format or the Busybox version: a line showing a diff pseudo-command at the start of the header.

IV.

Finally, I've noticed what strikes me as a design problem affecting all V implementations, which I haven't seen mentioned before: it's not possible to have the same (path, hash) pair as an output of two different patches in the same VTree. More simply put, you can't have a patch that changes a file back to a previous state, contrary to the suggestion that "adding and removing the null character from the manifest file in every other patch would work" seen in the manifest spec. The reason is that both patches would end up in the antecedent set of a patch referencing either version of the file, in some cases producing a cyclic graph.(ii)

Stay tuned for the aforementioned patch and starter that make progress on a few of these fronts.

  1. A related annoyance I've had is Busybox "diff -qr" doesn't report added or removed directories, while adding -N replaces "Only in ..." messages with the less helpful "Files ... differ". [^]
  2. At this point I must say I wonder why V wasn't made to simply include in the header of each patch the hash of its antecedent patch as a whole. It would have neatly bypassed all these problems, forcing a tree topology and simplifying implementation. Would it have smelled too much like Git, or what? [^]

Powered by MP-WP. Copyright Jacob Welsh.