Of coffee and code

2021-05-23

Of coffee and code

Filed under: Historia, JWRD, Software, V, Vita — Jacob Welsh @ 11:13

This article might end up being about habits, or anti-habits for that matter, either of which would be quite some distance from where it germinated. If it be such a vine, then wind with me for a bit, and perhaps we'll reach something firm to grab onto. Or did you think we humans had it so much better than plants because of some capacity to see where we're going?

Anyway,⁽ⁱ⁾ each time I set to thinking on how to get started, every sentence would run right off into another pile of background seemingly required for its meaning to get across. From one perspective, that could be a symptom of being buried under quite a pile of unwritten articles, each one a weight that could have been a support, a liability that could have been an asset, had it been brought into the light. There's not much choice I can see in the matter of forming habits and anti-habits; for what you do builds on itself, and what you don't builds likewise.

But then again, it could also be that I'm just still inexperienced at translating the self to the medium as it is.

Anyway,⁽ⁱⁱ⁾ ten days ago I decided more or less spontaneously that it was time to quit coffee again. The last time was around 2015, in part as an exercise to prove to myself that it was the sort of habit that I was in control of because "I could totally stop any time, I just choose not to." Among other reasons. I stayed "dry" for a year and some before resuming, first occasionally, then daily because where's the pleasure in going through repeated mini-withdrawals? Better to be fully in or out. I stepped up my game a bit further, getting a French press and home roasting my own beans in weekly batches where previously I'd just done the grinding.⁽ⁱⁱⁱ⁾ It turns out this doesn't require any equipment beyond the humble iron skillet and oven, so long as you keep a close eye on the timing, and the smell that fills the house is quite lovely, almost like baking chocolate brownies. I got my "green" beans by importing though - from the States by suitcase express to Panama, a coffee exporting country. Because of networks, and economics, I guess. And amateurs.^(iv) After a while though I just couldn't be bothered with either the roasting or the French press anymore and returned happily enough to the vulgar plastic-laden auto-dripper.

As for quitting now, it's easy to rationalize either way; sure, it doesn't cost much, in time or otherwise, and it's a pleasant way to start the day, both the consumption and the bit of routine activity to get moving; but then, it's an unnecessary dependency (beyond the obvious level of the thing itself, there's also the needing a bathroom nearby at that time of day), it complicates sleep regulation, there's any number of alternatives for a get-moving routine, and I could use every bit of extra time I can get. It'll have to suffice to say that some bit flipped in my head and it's just not my preference anymore. My intake had been moderate and the withdrawal wasn't too bad this time around: mostly just some general drowsiness with evening headaches that responded fine to basic pain pills.

AAAAnyway,^(v) there's this fresh sensation of time in the morning, where the restraints of habit have been replaced, however temporarily, by this vacuum wherein I'm Free! to jump right in, to anything at all, from the great, uncountably infinite if tightly bounded space of possibility. So where have I been jumping, and how's it working out?

Since I'm far from being without plans, I picked an item that's not been getting enough attention otherwise, namely to get moving with some writing, such as those articles I've been hinting at or otherwise having hinted at for me. But I perceived a gap, in that I was going to be writing about code and yet didn't have under my control any kind of adequate system by which I could link to the specific items under discussion in their larger context. My V patches and seals collection may serve for a reference code shelf, but - going with the metaphor here - a shelf alone, with no facilities for picking something out and sitting down to read, much less indexing and searching, does not make for much of a library.

This had bothered me before, but the pain was becoming acute, especially following the shutdown of a loosely Bitcoin Foundation connected code viewer left me with a bunch of dead links.^(vi) Thinking of the approach outlined recently and earlier by my Master^(vii) to starting small on system design, as well as my natural distaste for excessive layers of indirection and misdirection, I thought of how a minimal prototype might work that would serve my current purpose yet not set me up for trouble in future changes or expansions. I concluded that for the simple task of showing a read-only, marked-up view of a tree of code, there was no call for any RDBMMS - Relational Database Management MegaSystem; the necessarily present and tree-structured Unix filesystem would be plenty. Future additions like identifier indexing and patch browsing might benefit from a fancier database; but then, flat files might be fine too and there was no need to decide this yet. The processing would not require a heavy-weight language-and-library system like PHP; in principle it could even be an Awk job. For the web serving part, we're already committed to using Apache, and it would handle this fine no matter what I chose for the other parts; and anyway it has the smallest codebase of the four in that famous LAMP stack, not a meaningful data point in isolation but perhaps more so in light of the maturity of the project and its wide usage.

At this point, in hindsight, I suspect I got carried away by some old and rather strong habits. I figured, "how hard could this be? I'll just give it a quick shot and see what I can come up with." Well... morning gave way to afternoon, evening and a late night ("just one more thing to wrap up!!") and then there was still the writeup to do. I should really learn already that my present minimum "quick shot" at a new thing is two days and change. Some costs of this enduring naivity were that I didn't get to discuss the project with the management side of JWRD ("why should I have to wait around for a meeting? I'm just trying something out real quick!"); that my normally scheduled work didn't happen and I'll probably end up cutting further into the weekend for it (which I wouldn't normally do but we're in a bit of a final sprint); and that I didn't do much explicit documentation of the choices I considered or attempted and my reasoning. So to make the most of the time spent - especially the pile of discoveries of what *doesn't* work, that great universal brokenness abhorred by the juvenile mind yet providing the substrate on which science is built^(viii) - I'll now proceed to document by retracing my path as best I can from memory while still fresh.

In some past undocumented research (mostly August 2019, if timestamps are to be believed) I had been looking at ready-made systems with the more ambitious goal of fully accurate identifier cross referencing (i.e. aware of language syntax and semantics) for C and C++ codebases. This is way harder - but all the more valuable I expect - to do for C++ because of the various namespacing, overloading and template features. The list of what looked at least interesting enough to get an initial bookmark came to: Synopsis, Cxref, DXR, Dehydra, gcc-python-plugin, Source Insight, OpenGrok, Cscope, Doxygen, Woboq, GNU GLOBAL, searchcode, LXR. None stood out as a clear winner and most seemed poor fits either from being too narrow, not accurate, generally too bloated, or coming with intolerable environmental demands. I lacked a clear idea of my evaluation criteria and their relative importance, partly due to the larger lack of a clear need or problem to be solved, so didn't take the search any further, but in general it was LXR, Cxref and Doxygen that came out looking the least-bad, possibly with honorable mention for Cscope if the web-based requirement is dropped.

None of these however would be friendly to a prototyping approach; I would be importing someone else's problems, committing to yet another large framework, investing some nontrivial effort just to get the basics working and a whole lot more to figure out the internals enough to shape them to my future needs or patch the expected gaping holes. Certainly none would come with V integration. So I was tending strongly toward doing my own minimal thing for now and improving it gradually over time as informed by usage. It's probably another of those bad habits that I went ahead with this path based on this bias without even wanting to look closer at the existing options. But then, decisions have to be made and there isn't all the time in the world to make them. But then, that's something of a platitude that doesn't help figuring out what's enough. I'll leave it be for now before I get myself any further tangled up.

For the implementation language there were several options in my existing toolkit. I've done web stuff in Python before; the way it's usually done is dubious, with its own special WSGIs and such; of course there's no rule that you have to use those, but I didn't have a ready alternative. Moreover, there's no Python in JWRD's current server-side Web repertoire; it would be quite a large thing to add just for this, and substantially duplicating the functionality of PHP which we are using. At the same time, I'd rather be moving toward less PHP dependence than more; especially if there's no MySQL involved it's not clear that it would be adding much value here. There is my very own Gales Scheme, which I've also used for some relatively basic Web work. It's still a bit on the immature side though; the major lacks coming to mind as likely to cause trouble down the line, leaving me fighting or working around the tool in this application, are fine-grained error recovery (which was left unspecified in the Scheme standard, grr), adequate Unix system interfaces, and regular expressions or BNF style automated parsing. This left me looking at Awk at least for the prototype stage, or possibly going straight to C.

For connecting these to Apache there is the classic and simple mechanism of CGI - Common Gateway Interface, which works by running a separate, short-lived instance of your program for each request, passing the request or other server parameters as environment variables. Both Awk and C would provide a fast process startup time, dodging one of the drawbacks of CGI for slower-loading things like Python. I do have some distaste for the design^(ix) but figure I'm probably experienced enough to use it safely. SCGI (or FastCGI if you're into utterly pointless bloat, needless to say the PHP people went that way) is an alternative with none of those drawbacks, but requires some up-front decoding work to get at the variables, kind of ill-suited to the Awk quickie.

Next up there were the user-facing points to be decided of naming and URL scheme. To minimize potential naming conflicts I chose to support an overall URL prefix, going with "codeview" for my instance as a direct and concise statement of what it is. I considered something involving "V" but it's not yet clear that this would only be for V-based projects; certainly the initial tree and file viewing functions would work on just about anything as long as it can be expanded out into a complete tree per version. I'm not sure if I'll stick with "codeview" to name the project itself. There's apparently some MS-DOS era debugger squatting on the name - quite unjustly, as that would be a viewer for execution state, not code as such - but I rather doubt anyone would care about such a conflict.

After the URL prefix there's a function name, perhaps not strictly necessary yet but allowing for routing to different scripts through simple Apache directory configuration, much in the vein of "/prefix/scriptname.php?document=foo/bar" but without the spurious details so simply "/prefix/scriptname/foo/bar".

Remember that part about "complete tree per version"? This could add up to guzzling quite some disk space on the server for largely redundant data. For starters I can just import select versions as desired; later, a more efficient storage scheme could be implemented e.g. deduplicating based on V hashes, or fully integrating with V to generate the "pressed" files on demand from the canonical patches, or compressing the individual files. This could be done transparently and wouldn't require changing the URL scheme.

A related question is what if anything to do about web robots; this could make for quite a "spider trap", serving up slightly-different views of large data sets in the most expanded possible form. At the same time it's not clear that we *don't* want search engines finding it, so I plan to start out permissive with the option to look at robots.txt exclusions - or possibly better "nofollow" links, as that would allow direct human-linked stuff to still be visible - if the bandwidth or load starts to be a measurable issue.

These basic decisions more-or-less down, we now get to the part where the combination of my ignorance and the complexities and deficiencies of everything involved turn what felt like it ought to be the fabled ten-minute spot of work into a monster that devours the whole day and then some.

My first attempt was indeed as an Awk CGI script. Based on the pattern I'd seen in MP-WP, Mediawiki and possibly other PHP apps, I could at least pass the file path as a GET parameter and use mod_rewrite to clean up the visible URLs. Simply by checking docs and observing the CGI variables I found a more straightforward option, the PATH_INFO variable helpfully provided by Apache. There soon arose a question though of how to validate the path and handle abnormal conditions, such as simply nonexistent paths, or directories that might not have a trailing slash in the URL which are traditionally redirected to the canonical form. I found how to set the request status code - you send a special "Status:" header which CGI picks up - but checking filesystem conditions in Awk was proving clumsy at best.

So I changed tack to trying it in C. As soon as I needed to build a path string I was reminded that oh yeah, the standard library still hasn't invented something as basic as string concatenation in any way that works. No big deal, I grabbed my small extension library of string routines from previous work. Then I realized I didn't actually need it in that instance because I could "chdir" and use a relative path.

The core part of rendering code lines to HTML was actually pretty simple, with a character-wise reading loop to do the line counting and HTML escaping, barely even qualifying for the label of state machine. In this case I found it tidier than the "regexps are cool so let's do everything by string substitution" approach favored by Awk and friends. Some ways into this work though - probably on taking a break or something - it hit me that I was going about it all wrong. Generating the 404 Not Founds and directory redirects was after all redoing something that Apache already provides in the simpler case of serving static files. I'm essentially just serving up a static file tree with some post-processing; doesn't it have a thingy to hook a script into that stage of its normal functioning? Why yes, they call it "filters" and in particular external filters. Suddenly the apparent mismatch with Awk and reinvention of wheels disappeared. It was a thing of beauty, one of those rare moments where you just know you found the correct thing. Of course, there had to be some catches...

Somewhere in here I'd also realized that a raw file download option would be trivial to add by having Apache statically serve up the same file tree read by the script under a different URL path. This didn't directly translate when switching from CGI to filter - because filter configuration is done by filesystem path rather than URL path, there didn't seem to be a way to serve both a filtered and raw view of the same tree. A single Unix symlink came to the rescue, allowing me to present the same stuff to Apache at two different paths. Arguably a workaround, but at least an easy one.

The next pain to arise was that my browser would show a previously cached version of the page when I would reload to test a change, necessitating the "no really, reload all the way through whatever caches" command instead.^(x) With Apache handling more aspects of the request, it was adding its ETag and Last-Modified headers based on the underlying file, unaware that the filter could be completely invalidating this data. Besides the slight inconvenience now, I didn't want a situation down the line where I'd made some change to the formatting but people see it inconsistently even after explicit reload. After some frustrated rummaging in both Apache and Mozilla docs - mod_cache and mod_expires were unhelpful, while the browser's treatment of cache tuning issues struck me as entirely broken - I found an awkward if working fix, using mod_headers to remove the offending ones altogether, if not quite at the source then at least not too far downstream. For this one my search spilled all the way to StackOverflow, usually a good sign of desperation.

Once I had code files displaying beautifully, yet another puzzler arose for the case of directory listings. The filter script was being fed the auto-generated HTML index, as if it were a regular file! I first thought to try making it generate the indexes in a raw text format so I could then have the filter generate the HTML with consistent style and navigation links, but among the many auto-indexing options there was none for this. Furthermore, it wasn't clear how the filter could reliably distinguish index from file, and it got even worse in the case of a directory containing an index.html file, as its contents would shadow the auto-index! I did find a clean solution here, basically by aiming to work with rather than against that feature: the path where Apache checks for an explicit index prior to generating one can be changed from the relative "index.html" to an absolute, which can even be a CGI script. Thus the two different functions of marking up files and generating directory listings can be handled by nicely separate scripts.

So it just remained how to generate those listings; the "awk is bad at filesystem operations" came up again but it was easy enough to do by shell command calls, especially with the aid of my trusty shell quoting function from past work.

Finally, I spent what felt like way too much time on fine-tuning the page titles and navigation linkages. Feelings make poor chronometers though and I didn't specifically track it so it's hard to say for sure. At one point I was trying for a full "breadcrumb" path, i.e. where you can click any component to jump to its level of the tree, but as the code was getting a bit hairy I just went with the plain "Up" link for this stage. Some complications came from the fact that while physically the files are all in one big tree, logically the top two levels are fixed as project and version, where each named version forms the root of its own source tree. I had made things harder on myself by splitting out the project and version listings into separate functions - in the sense of their own URL prefix and script - but couldn't justify this as they were almost entirely duplicated from the subdirectory listing generator plus extra code to deal with the "wrong" parts of the tree that had to be blocked off or redirected to another script. Those top-level views may still end up specializing but there's no need to complicate the URL scheme to do it.

In all, the present code comes to 109 lines of Awk and 64 lines on the unused C branch, or 167 if you count that string library. Very little really, dwarfed entirely in both space and time by this writeup, which has dragged out to a three-day affair; perhaps I've gone entirely from one extreme to the other and can now work on finding a happier medium, or at least a more sustainable pace for building that writing habit.

Oh, and supposedly now I can finally write that report I originally set out to on Wednesday. Don't hold your breath though as I've got more loudly whistling kettles to tend to now. In the mean time, how about having a look around?

This term promises a return to the main arc after having followed a tangent. [^]
That "main" arc may of course turn out to be just an osculation of a higher-order curve. [^]
The proper fresh grind is more important than freshness of roast for preserving flavor and quality generally, and for the same reason it's ground in the first place: the high surface area that allows fast extraction likewise allows fast oxidation. If you're one of those who sweeten their coffee because they think its natural taste is bitter, it probably means you've been doing it wrong. [^]
Meaning the locals of course, but also meaning me, since I suppose the serious gentleman would be paying visits to Fincas and buying direct. [^]
By this point I trust the concept of recursion is quite clear. [^]
Apparently it's back now, doubtless after heroic rescue efforts by gentlemen of honor and wealth and privilege. Whatevers, "fool me once... fool me can't get fooled again" as we arbustos say in Tejas. I'm not complaining though, I had no basis to assume it would continue to exist for my benefit. Just like all the kids these days ~~using~~ being used by Microsoft Github. [^]
Honestly it still feels a little weird - though proper - to write that, even if I turned out in the end to be the most serious pledger. Might be an American thing; the poor word's been well encrusted with barnacles of negative association. [^]
There's a reference here to something I was just recently reading but can't now find. [^]
It overloads the purpose of environment variables, as these are normally used to alter the functionality of all manner of programs, passed permissively down the process tree by default; while CGI in practice fails to clearly define the set of variable names so it's possible to get "action at a distance", where one component causes unintended and unpredictable changes to another unrelated component. "HTTPoxy" was a popularized attack drawing attention to one instance of this. It's the exact situation of "dynamically scoped" variables in programming languages, a more subtle issue than the perhaps more familiar "global versus local". It was a debate in the early Lisp world; Common Lisp and Scheme made the jump to lexical scoping by default and basically nobody seriously wants to go back. [^]
Shift-F5 or Shift-Ctrl-R in Linux Firefox. Funny how there doesn't seem to be any mouse interface for it. [^]

6 Comments »

A pleasing read and slick result, congrats !

For the uninformed Calvin Ayre's of the world, who false claim Bitcoin can have shartcontracts, "There are no loops.".

While the report might be a bit late, at least you have this to point to. Cheers to happy mediums and worthwhile habits built at sustainable paces.

Comment by Robinson Dorion — 2021-05-23 @ 18:36
Glad you liked. And let's hear it for the first over-3kword (indeed nearly 4k) Fixpoint article excluding chat logs!

Comment by Jacob Welsh — 2021-05-23 @ 19:52
It looks quite good actually, the direct links to individual lines are absolutely great and I quite like the use of Apache's external filter module with all the flexibility it allows at minimum cost. Basically it looks already usable and a good start (the sort that be grown as needed) to solving this real problem of code-on-blog, short of running the scripts separately to just produce the static html and then simply serve that (that would be the approach I took to make and update when needed the Euloran Cookbook).

One small question, but since it's something affecting the links so rather costly to change once set: why so many levels in the path there already? At a first look, it seems to me like a bit of a mixture between projects (bitcoin_system_compiler, I guess) and possibly categories of projects (bitcoin/), is that the idea?

Fwiw - I quite enjoyed the unwinding of the coffee-spirals (although yes, it does sound like simply a ton of unwritten articles and otherwise a bit of the "can't start on x because..." trap, perhaps).

Comment by Diana Coman — 2021-05-24 @ 15:54
Thanks @Diana Coman. I did think of pregenerating html but saw it as a kind of caching optimization that has a cost and isn't clearly needed as yet (excluding the considerations of those who like the approach because "running apache is too hard" and such). For instance, even with just these two html and raw views you'd already be more than doubling the required storage space; and more care would be needed to avoid adding more manual steps or delays to the update process.

In the present example bitcoin is the project name and bitcoin_system_compiler is the version name derived from the vpatch; note that most of the old bitcoin patch names didn't include the project name prefix. Then there's yet another "bitcoin" because of the conventional V-tree structuring. It does all seem rather redundant, and I had tried to push back on that in the keksum genesis but the conclusion seemed to be that it was premature as there'd been insufficient collaboration between V projects to judge how much structure would prove necessary.

Comment by Jacob Welsh — 2021-05-24 @ 17:25
Whoops... *.php files were getting interpreted rather than source displayed, in both the tree and raw views. Fixed by adding a "SetHandler None", except for some reason it had to be in a "Location" rather than a "Directory" config block.

Comment by Jacob Welsh — 2021-07-06 @ 23:22
[...] in Spain. That'll be the harder one to kick, but let's see how it goes. Jacob seems to have kicked it just fine. So instead of Panamanian coffee to start the day, I'm substituting in herbal [...]

Pingback by Sober October « Dorion Mode — 2021-10-07 @ 20:28

RSS feed for comments on this post. TrackBack URL

Fixpoint

2021-05-23

Of coffee and code

6 Comments »

Leave a comment