Fixpoint

2020-03-31

Adventures in the forest of V

Filed under: Historia, Software, V — Jacob Welsh @ 19:11

It started as what I thought a simple enough job: take the existing SHA512 v.pl I'd been using to press the Bitcoin code, or rather the VTree that grew from it, swap out the hash with my own keksum so as to avoid a hefty and otherwise unnecessary GNAT requirement, add my version of the classic vdiff modified likewise, bundle up a "starter" to cut the bootstrapping knot, and publish the bunch as my own tested and supported offering for wherever a V may be needed.

Such a thing would still require Perl, itself not an insignificant liability. While work had been underway to replace that, the results fell short of completeness, and from the ensuing discussion I decided it would be best to shore up my own grounding in the historical tools before venturing deeper into the frontier. I suppose I should be glad, because I got even more of that grounding - or swamping, more like - than I had asked for.

I.

One pitfall I already knew was that file header lines in the "unified diff" format used by V, which begin with "---" and "+++", cannot be accurately distinguished from deleted lines beginning "--" and inserted lines beginning "++", if parsing linewise and statelessly as done by the original "one-liner" vdiff. This was discovered in practice through an MP-WP patch containing base64-encoded images, and the potential damage is hardly restricted to that; for instance both SQL and Ada programming languages use "--" as comment marker. This was part of the motivation behind vtools, which took the approach of avoiding the system's existing "diff" program in favor of a stripped-down version of the GNU codebase with integrated hashing. My own approach had been more lightweight: tightening up the awk regex to at least reduce the false positive cases. It wasn't too satisfying, but had worked well enough so far.

II.

The first surprise I hit (stupidly late in the process, after I'd already signed my patch and starter) was that the Busybox version of "diff -N" replaces the input or output file path with "/dev/null" for the cases of creation and deletion respectively.

This reflects a larger reservation I have about Busybox code: it's been hacked extensively toward the goal of minimizing executable and memory footprint, which sometimes but only sometimes coincides with clear code and sensible interfaces. In this case, from brief inspection I surmise that it literally uses /dev/null so as to avoid some kind of null check in the downstream code that compares and emits the header. It's clever, but breaks compatibility with the GNU format in unforeseen ways.(i) In fairness to Busybox, the format was poorly specified in the first place - and nobody involved with V apparently found this important enough to remedy either.

III.

Another surprise for me was that the sloppy parsing affects not just diffing but pressing too. At least v.py and v.pl exhibit the same sort of blind regexing in extracting antecedent information from vpatches. (I'd guess that use of somewhat tighter regexes has prevented this from causing trouble in practice yet.) Further, v.pl extracts file paths only from the "---" part of the header which suggests it would indeed be broken by "/dev/null" style patches. On the extended vtools side, vfilter makes yet another assumption not backed by either such documentation as exists for the format or the Busybox version: a line showing a diff pseudo-command at the start of the header.

IV.

Finally, I've noticed what strikes me as a design problem affecting all V implementations, which I haven't seen mentioned before: it's not possible to have the same (path, hash) pair as an output of two different patches in the same VTree. More simply put, you can't have a patch that changes a file back to a previous state, contrary to the suggestion that "adding and removing the null character from the manifest file in every other patch would work" seen in the manifest spec. The reason is that both patches would end up in the antecedent set of a patch referencing either version of the file, in some cases producing a cyclic graph.(ii)

Stay tuned for the aforementioned patch and starter that make progress on a few of these fronts.

  1. A related annoyance I've had is Busybox "diff -qr" doesn't report added or removed directories, while adding -N replaces "Only in ..." messages with the less helpful "Files ... differ". [^]
  2. At this point I must say I wonder why V wasn't made to simply include in the header of each patch the hash of its antecedent patch as a whole. It would have neatly bypassed all these problems, forcing a tree topology and simplifying implementation. Would it have smelled too much like Git, or what? [^]

8 Comments »

  1. Re: Busybox and "nobody involved with V apparently found this important enough to remedy either" -- AFAIK all known V users currently use Phf's vdiff or variants thereof, no one is using my original diff/grep/awk-powered one-liner.

    Re: cyclic graphs: I specifically forbade them in the orig. Vtron. (Observe, there is a detector and eggog.) There is no, AFAIK, up-side to ever having a cyclic patch graph, and plentiful cost (e.g. requires a considerably more complicated "tortoise & hare" toposort, not to mention "wtf, why?!". Consider why chess players are forbidden from repeating a previously-seen board position.)

    Re: "why V wasn't made to simply include in the header of each patch the hash of its antecedent patch as a whole" : originally didn't have manifest, either. Observe, in Phf's TRB page, a patch was able to have multiple antecedents. Prior to manifest, there was less forced structure, but at the cost of making potentially nonsensical combinations pressable. In 2018 I posted a demo for an experimental Vtron that would behave as you described. (The algorithm is not mine, but originally Trinque's.) But AFAIK no one tried it.

    Comment by Stanislav Datskovskiy — 2020-03-31 @ 19:34

  2. I have mentioned the problem in IV. before when I was discussing a practice V I had written for myself.

    Comment by whaack — 2020-03-31 @ 20:00

  3. @Stanislav Datskovskiy: I'm aware of Phf's vdiff, I even described it in the text, you know? You're of course welcome to exclude me from "all known V users" if you like, indeed I admittedly hadn't been dealing with nontrivial trees besides TRB so as to feel a need for a Keccak presser previously.

    I'm not at all arguing that a cyclic graph should be allowed. The problem in IV occurs even with use of a manifest. (@whaack, do you still think it doesn't? Maybe I need to spell out an example.)

    Comment by Jacob Welsh — 2020-03-31 @ 22:04

  4. Re #3 -- indeed you had the link. My apologies, missed it on 1st pass.

    Re IV -- it still isn't clear to me why you would construct a tree where "have the same (path, hash) pair as an output of two different patches in the same VTree". If you end up with such a tree, for whatever reason, it is "coarse error of pilotage".

    IMHO if you were to make a change to a program but then decided to revert, ought to at least put a comment, "tried $change here, proved fatal" -- rather than bitwise reversion to a previous state. This would be good practice even if V did not exist -- why burn the work that went into testing the unsuccessful change, by not documenting?

    Comment by Stanislav Datskovskiy — 2020-03-31 @ 22:27

  5. I suppose I'm OK with the "so don't do that" approach; glad to have it out explicitly though so thank you. It also serves as one concrete reason for a "thou shalt test thy vpatch with a full press before publishing".

    Comment by Jacob Welsh — 2020-04-01 @ 18:30

  6. @jfw

    I was just pointing out that I had stumbled upon the same "problem" when writing my own version of topological sort for V. At the time I was unaware that the live versions of V in use also had the same quirk; I was speculating that they leveraged the manifest file to work around the inability to revert a file. You have now clarified that the known versions of V don't do this; I don't think I need an example spelled out.

    Comment by whaack — 2020-04-01 @ 20:08

  7. @whaack, cool.

    Re III, discussion with bvt has been illuminating - in brief, the ^diff matching originated with phf's vpatch and so is pretty entrenched, and there are further possible headaches in store for alternative diff implementations.

    Comment by Jacob Welsh — 2020-04-02 @ 15:21

  8. [...] my prior adventures, I reoriented my efforts toward some simpler changes to the v.pl tree, abandoning hopes of a robust [...]

    Pingback by V in Perl with parsing fix, keksum, and starter, plus the ill-fated vdiff « Fixpoint — 2020-04-02 @ 17:51

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by MP-WP. Copyright Jacob Welsh.