shimweasel.com

You need a novelty budget

2018-08-25T00:00:00+00:00

We measure a lot of things in software engineering these days. Test coverage, time-to-deploy, bugs per line - they’re all good things to keep an eye on. They’re all proxies for risk of failure, either from moving too slowly, or from bugs and downtime destroying the business.

Something that’s not often explicitly controlled, however, is Novelty. One of the dirty secrets of programming is that almost every production codebase contains some dependency that the developers have never used before. Perhaps they’ve written a trivial project to play with it, but mostly they’re relying on community feedback, or if you want to be dismissive, fashion. Civil engineers have solid rules for the way a bridge ought to be constructed: we switch web application frameworks on an annual basis.

Why are we indulging in so much novelty anyway?

There are reasons for this churn. The widespread availability of libraries and frameworks is one of the reasons creating a startup is cheaper and easier than ever before. Your app sits on top of a vast pyramid of code you’ve never seen, and if you want to keep up, you can’t just ignore better options when they come along: your company will suffer, and so will your career (unless you enjoying maintaining pre-millennial line-of-business COBOL apps).

Avoiding the Scylla of endless novelty and the Charybdis of stasis

My technique is to use something called a “novelty budget”. It’s a rough, informal thing: in essence, you decide how much tolerance you have for library code you’ve never deployed on a new project before, and apportion it out in the ways that you think will give you the biggest bang-for-buck. This means that you accept that you’ll pay a cost in missing documentation, unpolished tooling and ungooglable errors for that part of the project, and you make up for it by using the most solid, boring, dependable pieces you can for the rest of it. The name of the game is reducing risk, not optimising for the best possible case you can imagine.

The essence of this approach is that you are trying to balance average and worst case time-to-completion. Any novel component is going to increase your worst case time: the hope is that it improves the average, both now and over the evolution of your app in the coming months and years.

The default tech stack

Let’s assume you’ve decided where you’re spending the biggest chunk of your budget and put that piece aside: I can’t comment on why that’s a good expenditure for you. Whatever it is, every other piece of the project has to tighten its belt to compensate. (Obviously, if one of these is your primary expenditure, feel free to ignore me - this is just a set of defaults.)

Suggestions

choose something boring like Ubuntu or Debian, rather than a fancy hardened BSD variant or exotic unikernel.
Don’t adopt a graph database or fancy distributed database until you’ve satisfied yourself that PostgreSQL or SQLite are really not sufficient. (My defunct startup MeanPath scraped 160 million websites a day using SQLite. The limits are probably higher than you think.)
Try a serverside-rendered site before adopting a SPA framework - you might find that a limited amount of scripting is all you need.
Devops automation is a real timesaver, but don’t feel that you have to go full Docker/Kubernetes right off the bat. While you definitely need a process to build a deployable artifact, it’s entirely possible that artifact is something more akin to a git repo that gets pushed to heroku endpoint than full-on enterprise-grade automation.

Process

When I can, I run my development processes on GitHub via pull reviews, issues, and the rest of that stack: possibly there is a better solution for parts of it, but the more you can stay in the boring zone, the lower your chances of a catastrophic failure. Writing your own process management software is unlikely to be a good idea unless that’s your company’s reason to exist.

Startups are different though, right?

There’s a seductive argument that since the chances of any given startup’s success are small anyway, you might as well crank all the knobs to 11 just to increase the variance. I think this is a bad idea: contrary to the founder legend, most startups are not solving difficult technical problems. The job of the founding engineers is to get something out there that works well-enough, and to do so quickly enough to test the hypothesis on which the startup was founded. Once you get traction, you might hit scaling problems: at that point, you’ll have either revenue or funding to fix them.

Haskell case study

If you don’t care about Haskell, skip this section.

Concretely, my biggest novelty budget expenditure for a project is frequently Haskell. Haskell has some obvious benefits for me: I know- it well, and I can bake out a lot of potential flaws just by leaning hard on the type system. It has a cost, though: IDE tooling is not as seamless as something like Java or Ruby, sometimes libraries are missing, sometimes you hit baffling type errors (especially with more advanced libraries).

The principle of a novelty budget applies within Haskell too, both at a language level and in your choice of libraries.

Language

Haskell gives you a lot of rope to hang yourself, if you’re so inclined: there’s a whole zoo of extensions, and the emphasis in community blog posts on sexy types means that you can easily get the impression that if your app doesn’t have a type-indexed generic free monad at its core, you might as well be writing Perl. It just isn’t true. Sum types, parametric polymorphism and typeclasses already put the language way ahead of almost everything out there, and you can build very solid apps without using any extensions. Treat it as an a la carte menu, not a buffet. (I don’t include things like OverloadedStrings or LambdaCase here - they are minor syntactic conveniences that don’t add significantly to cognitive load.)

Libraries

I tend to use the Yesod/Persistent+Esqueleto stack. It has problems but almost any reasonable thing someone might do with a website has been attempted in Yesod, and there’s usually a solution. There are fascinating database experiments out there with Ferry and Opaleye, but they don’t have the volume of use. Similarly, Servant is a brilliant piece of work, but in my test projects, I inevitably hit some small but critical thing that will not get fixed for days, or weeks, or months, and I can’t afford that time on a commercial project. (Yes, one option is to fix it yourself or sponsor the author to do it: for whatever reason, I’ve found that it’s usually far harder to get payments authorised than it is to spend the equivalent amount of dev time. That’s a company management problem I have no idea how to solve.)

Another approach I’ve seen people have success with is to use WAI directly or very minimal frameworks like Scotty. While they don’t have all the bells and whistles of Yesod, you are very unlikely to end up hitting a hard stop.

Conclusion

I hope I’ve convinced you that this is at least a concept worth thinking about. I should mention that I’ve mostly worked at smallish startups, and it’s entirely possible it doesn’t generalise to AmaGooBookSoft - would love to hear from anyone in that environment.

Feel free to tell me I’m completely off base.

PS. I’m not claiming to have invented this concept - it’s been a term of art in my development conversations for years. I was frankly surprised to find nothing about it written down and would welcome hearing about anything I’ve missed.

References and apologies

Similar
A contrary view from the enterprise

Thanks to Matt Olson, David Maciver and Alec Heller for thoughtful comments on a draft.

Thanks also to my other reviewers who I currently can’t find because Twitter search is terrible, and apologies that this took almost a year to actually find time to polish and publish.

The ThinkPad: an elegant weapon

2018-08-19T00:00:00+00:00

I like ThinkPads.

They’ve never been cool. Ownership marked you as one of two equally unhip tribes - either a drone in a suit, bashing out spreadsheets and TPS reports kilometres up while some entitled bastard makes hamburger mince out of your knees, or a 400lb hacker well-actuallying usenet threads and prefixing “GNU/” on every project in sight like a crazed graffiti artist. Nobody’s ever headlined Lollapalooza on a ThinkPad.

And yet. If laptops were violins, the Thinkpad is what would happen if Stradivarius decided to make his next project bulletproof for funsies: Linux-friendly at a time when debugging X11 configs and sound drivers was a mandatory rite of passage, rugged enough not to be precious about, and sporting a tall screen for extra code space. A ThinkPad was a reliable comrade in the war on complexity.

And the keyboards! It’s as if they’d actually talked to users who typed for a living - hardware buttons for sound control and brightness, a proper function key row, and a bouncy, kinetic key action that let you know the machine was listening.

The last usable keyboard on a laptop was on the T420, in 2011. Apple lured every other manufacturer into a doomed arms race of making laptops thinner and lighter, far past any possible utility. I fully expect to see a monomolecular laptop running MacOS X in the near future, and its keyboard will fail every time Schrödinger’s cat meows.

I don’t mean to discount other brands. My tankish HP, my cute little ASUS netbook, and even my MacBook Pro were instruments of creation. They had keyboards with satisfying, chunky actions, and they invited you to do something cool, not just passively imbibe the work of others.

But somehow, despite the proliferation of tablets, phablets, and voice-activated TVs, the market has decided that the one portable device which was useful for producing things is now primarily for Netflix and Youtube. Who cares if the keyboard causes you to mistype a few words in the comments section? Nobody’s going to be able to distinguish it from the background roar of illiteracy.

The ThinkPad fought the good fight for a long time, longer than almost every other, but even Lenovo folded in the end. The 25th anniversary edition is almost mockery, planting a reasonable keyboard on an otherwise painfully mediocre machine.

I haven’t given up, though. While I write this on a sticker-covered T410 and my travel T420s sits at home charging, my mailbox awaits a T70. I paid $1500 5 months ago for this machine: an unofficial, homebrew chimera, modern innards transplanted into a T60 chassis by the incommunicative wizards of LCDFans, a fever dream of a laptop that might never arrive: this is how deep the yearning goes.

(Thanks to the stubbornly unwebby Alec Heller for edits.)

notes on a better migration system for Persistent

2018-03-18T00:00:00+00:00

Problems with Persistent

I use Persistent for a lot of my Haskell database work. It has some great qualities - I really appreciate the type-based linkage between the database types and my Haskell types, and while its query language is anemic, Esqueleto does a good job of modelling most of SQL (or at least enough that I can get useful work done without constant frustration.)

The biggest problem I hit is with the migration system. It’s computed by comparing the current structure of tables with what the Haskell model thinks they should be, and has some limited support for detecting what changes need to be made: however, it doesn’t allow you to consider two versions of the same database in one program, which makes using Haskell functions to populate new or changed fields & tables impossible.

In previous codebases I’ve ended up serialising the SQL at development time and creating ActiveRecord migrations (hat-tip to Chris Allen for that trick), allowing me to at least use SQL to populate new columns, but I frequently ended up needing to write functionality again in SQL just for the migration.

(It’s also possible that you can solve these problems by not using Persistent, but so far every other database access library I’ve tried either takes a strings-in/strings-out approach which leaves me worried that the code and the database have fallen out of sync, or have used a plethora of fancy types that I find hard to work with.)

Desiderata for a migration library

Should work with Persistent’s datatypes and migrations. It needs to know what DDL changes that persistent would try to make. This doesn’t mean it has to do the same thing (and often won’t - can’t do triggers, constraints, etc): it just means that when complete, persistent should agree that there are no migrations to be run.
It should provide a predicate to that effect, which should be easy to run in a test suite as a sanity check, as well as on startup.
Migrations should be easy to run on startup, in the context of Persistent. this one is probably controversial, but fits my use case. If the migration fails, we should have a way of signalling to the environment that we are a failed deploy, and that the migration should be aborted and the last version of the code substituted back in. (Keter has health checks, though not terribly exhaustive ones - I think it just checks that the given HTTP port is open, so killing that listener might be enough.)

We also need to make sure that if multiple web backends try to run the same migration concurrently that only one makes it through, and that the others don’t interpret their inability to run the already-run migration as evidence that they’re broken.
DDL migrations should be in plain SQL. Data-changing migrations can be in Haskell or SQL: this allows us to back-populate columns using haskell functions with all the context we’re used to. Because we aren’t trying to talk about different structural states of the database at the level of types, this will usually require a two step migration process: add the fields in a nullable way and populate them using Haskell functions, then in the second step, make them non-nullable or foreign-keyed or whatever other database-level constraint needs to apply.

This also implies that you need nested transactions: if the first step succeeds, but the populated fields do not satisfy the constraints you expected them to, the database needs to be rolled back to before the migration started. (There can actually be a third step, where you update your Haskell types to reflect the new constraints: this might be as simple as switching Maybe a to a - this can fail too, if your SQL migration wasn’t correct, but it has no database effects and can be rolled back without having to worry about the database.)

Resolutions, Plans and Systems

2017-12-31T00:00:00+00:00

Traditionally, the new year is when we all make public proclamations of how we’re going to be better humans: this year, we will read more, run more, listen more, eat healthier, and generally be our best selves.

It’s even more traditional to abandon these resolutions about mid-January, so last year, I decided to try something a bit different. First, I was going to try a new thing each month and stick with it for the month. If it worked, I’d continue it. Second, I wasn’t going to talk about it, for the reasons alluded to here *.

Results

So, what happened last year? What worked, what didn’t?

January: Lifting. ✘

I wanted to get back to the steady cadence of lifting weights I had in Vietnam, with a concrete goal of getting back to 350lb squats.

Illness, sleep deprivation and generally being responsible for Arthur in the mornings foiled this, and my current 3x5 squat sets are about 285 lb. He’s now pretty happy to play in the pen down in my basement gym, so this one might get another go this year.

February: Keto. ✔

My February resolution was to lose my gut and to keep it off in a sustainable way. My weight has varied enormously over the last twenty years - from 96kg (after finishing a 1000km bicycle ride through Thailand, Cambodia and Vietnam) to 145kg (in Vietnam, where I was lifting heavy weights regularly but also drinking beer and eating delicious food perhaps a little more regularly than necessary). In 2016, I had a son, who turned out to be both enormous and possessed of approximately three Tasmanian devils’ worth of energy: the idea of being a puffing Fat Dad to such a whirlwind was too horrible to be contemplated.

After a bit of research, I ended up deciding my experiment was going to be with the ketogenic diet. I’ve done paleo before and had some moderate success, but it always buckled under work or life stress. Keto had some very appealing properties: unlimited bacon and butter, for one thing, but also, clearcut boundaries. For some perverse reason, it’s easier for me to regard certain foods as not-food than it is to consume them in a moderate, rationed fashion.

At the beginning of the year, I was 135kg. The software I was using to track it is sadly gone with an old phone, along with the data, but I had lost about 5kg in the first month, and I felt a lot better, so I adopted it as a regular habit. Today I’m orbiting around 120, give or take a kilo, and am more or less comfortable there - I don’t have a sixpack, but my shoulders are wider than my waist. Good enough. (The fact that I appear to be stable is cause for at least as much celebration as the original weight loss - I needed an approach that would work even under stress.)

As an aside, the transition back to carbohydrate-heavy foods for cheat days turned out to be so unpleasant that I cut them out entirely. I’ll occasionally indulge in a beer on a very special occasion, but I always know I’ll pay for it later.

March: No internet arguments. ✔

This one needs to be unpacked a little bit - the intent wasn’t to conduct my online life as a Kumbaya circle. I had noticed, however, that I had been allowing myself to engage emotionally in online arguments (not discussions) with strangers who were not interested in improving their understanding - whether it was Trump supporters who thought I didn’t have the right to comment on gun control until I’d served a tour of duty in the US armed forces, or Clojure enthusiasts who’d downloaded GHC one time then decided that Haskell’s syntax was icky, I was wasting time and emotional energy engaging with people who were not interested in improving their understanding, and didn’t have any intention of changing their mind if their arguments turned out to be wrong. My time is better invested in my wife, my son, and my friends (both online and off).

This worked, mostly. Every now and then I slipped and let an intemperate comment fly, or misread someone as wanting a discussion of policy when they really just wanted a cheer squad, but I didn’t get sucked into the spiraling, angry threads I used to disappear into. Removing Reddit and Hacker News from regular reading helped there too.

April: language learning. ✘

This one was a flop. I started learning Mandarin with Memrise, an excellent tool that helped me get my Vietnamese to haltingly conversational (provided my interlocutor was happy to talk about the weather and how beautiful Da Nang was.) The intent was to try to replace time spent on Twitter, Facebook, online chess and various mobile games (curse you, Polytopia, why must you be so fascinatingly tactical?) with something useful that could be done in isolated pockets of time.

I’m not exactly sure why this failed to become a habit. It’s possible that without the anticipated payoff of actually going to China and being able to talk to locals, it just wasn’t exciting enough to overcome the difficulty of learning all the characters. If I was going to try this again, I think I’d focus on conversational Mandarin and stick to the pinyin script, and perhaps book a trip up front to act as a forcing function.

June: Harmonica. Partial credit.

I wanted something musical I could do with Arthur. The goal was to practice 10 minutes a day, and to be able to play three recognisable songs.

I didn’t come close to the practice requirements past that first month (and I’m pretty sure I annoyed my in-laws by wandering off and practicing while on holiday), and I didn’t get 3 songs, but I can do a pretty reasonable ABC song and improvise something that sounds less squawkily mournful than my initial attempts. Arthur has also started squeaking away on it which is honestly the cutest thing I’ve ever seen.

July: expand career skills. ✘

Unambiguous failure. I started an online course on deep learning, but failed to budget enough time to make progress. Can’t see how this one’s going to work in the future without completely ignoring my family in the evenings, either.

Rest of year

Past July, I didn’t start any new experiments, though I managed to maintain some of those I had started.

Conclusion

So, 2017 was a bit of a mixed bag, but I’m basically happy with it. A healthier diet and emotional equanimity have done wonders for my state of mind, my relationships, and my general happiness with the world, and the failures have helped me scope out my current surrounding terrain for self-improvement. The things that worked seem to have worked because contravening them was clear: it’s pretty obvious when you’re biting down on a slice of pizza or a bad-faith political argument, but less immediately obvious that you’re playing a mobile game rather than pulling out the harmonica or practicing your flash cards.

As you might expect, I’m not going to tell you the content of this year’s goals, but the overriding theme will be to have systems in place to make it as obvious as possible in the moment that I’m contravening my stated goals. I’ll also be trying to make them about processes rather than results: with the best will in the world, I may never lift 500 lb, but I can make sure I lift three times a week.

If you got this far, thanks for reading, and I hope you have a wonderful 2018.

* as ever, it’s more complicated than that

A Modest Scraping Proposal

2017-07-13T00:00:00+00:00

Why scraping libraries in Haskell aren’t good enough

Every time I mention that Haskell webscraping libraries are a bit lacking, somebody points to (http-conduit|wreq) and (tagsoup|taggy) and suggests that it’s just a matter of gluing them together. This is akin to saying that your home-made car is just as good as any other car, and pointing to a brake pad and a steering column.

A robust scraping pipeline needs to handle caching, respect robots.txt, accept input and process output in a variety of formats, gather metadata about the fetched data, handle quotas & rate limits, and log sites/second, error rates and unexpected failures in real time to a dashboard or similar. (Some of these are derived from scrapy, the python scraping framework - others are things I’ve needed along the way. As to why you shouldn’t just use scrapy: it’s slow, untyped, and prone to running out of memory.)

I may actually write this library at some point, but would be just as happy if someone read this and beat me to it.

The shape of the problem

What I mean by scraping is something at least semi-adversarial. Fetching items from the Twitter API is not scraping: it may be true that the owner of the website doesn’t mind you scraping them (which is why we check robots.txt), but they will go to absolutely no effort to avoid breaking your scrapers.

There are many ways to scrape. Sometimes you can just enumerate a list of URLs: more commonly, each fetch gives you both a list of new URLs to fetch and a list of results. This allows a recursive approach to fetching, and also requires some kind of store so you don’t fetch the same URL twice and end up in an endless loop.

Robots

While you can get away with quick and dirty scraping without checking robots.txt (and let’s face it, we’ve all run curl in a loop before), it’s pretty rude to do large scrapes without checking the original provider is OK with it. Thankfully some uncommonly handsome and benevolent developer has done the heavy lifting of parsing and processing robots.txt files - any broad spectrum scraping solution ought to incorporate https://hackage.haskell.org/package/robots-txt .

Caching

Scraping is an inherently trial-and-error driven process. The web is messy and inconsistent, and it isn’t uncommon to discover your XPath query (or equivalent) is not quite right hundreds of thousands of pages in. Proper caching of both raw fetches and processed results means you can fix small errors in code and selectively redo failing parses - it also means that if the site under scrape has intermittent failures, you can refetch missing pages without starting the scrape from scratch.

More or less anything can be used as a persistent store, so long as it’s indexed on the URL. I used multiple sqlite databases as a backend when I was scraping 160 million sites a day: for smaller scrapes, a single PostgreSQL instance should be fine. Your bottleneck is going to be the fetching, not local writes.

Input

A good scraping library should provide a way of querying common data formats like XML, HTML, JSON, CSV and plain text. The standard way of dealing with these formats in Haskell is to use Aeson to parse JSON to a datatype representing what you expect to see: I like this method for normal HTTP interactions, but scraping is a process of extracting a relatively tiny amount of information from a large chunk of text. Something like lens-aeson lets you dig into a json structure in an ad-hoc way

foo ^? key "hah" . nth 1 . key "hfasd"

Because you typically add scraped fields one at a time, this avoids the necessity of adding both a record field and an aeson parser every time you add a field.

Similar reasoning applies to the other formats.

Output

It should provide a way of serialising results to CSV, XML or line-oriented JSON (or standard JSON if you’re feeling masochistic).

This implies that we should store our results in an output-neutral format - in keeping with Scrapy’s nomenclature, I’ll call each result an Item, and think of it as a map of column names to values.

Metadata

It’s often interesting to know metadata about the fetch process itself: a scraping library ought to allow access to the path of URLs that got you to the current one as well as the robots.txt that allowed access and any headers from the server response.

Rate limiting

Rate limiting is important for a few reasons. It’s not at all uncommon to accidentally overload a site by scraping too aggressively - I’ve even accidentally taken down DNS servers in my time. A good scraping library should let you limit global requests/second as well as reqs/second to a particular domain.

It’s also important to have configurable quotas for individual fetches. I’ve had a PHP site on a fast server spew hundreds of megabytes of error messages a second at me, all of which got dutifully loaded into the database. Don’t be like me, be smart.

Logging

Scrapes often go wrong hours in. It’s really useful to have something like scrapinghub.com that gives you a visualisation of how many requests/s you’re getting, and items/s is pretty useful too, but scrapinghub only works with python+scrapy, so that’s no help.

We’re getting a little out of scope here, logging is its own thing that can get almost arbitrarily complicated, but at the very least, you’d want something that can spit structured info to stderr, and to be able to set a constraint that if items/s, requests/s or bytes/s fall below a threshold over the last minute or so.

Testing

You need at least two kinds of tests: individual offline tests for each kind of page you want to fetch, as well as a rougher online test that given a starting point, fetching provides about the number of results you expect. This is necessarily hazy, but if you usually do 2 fetches for each full item, and that suddenly blows out to 100, it usually indicates that the formatting has changed on some page you rely on, and you need to fix it.

Security

You might think there isn’t much to say about security in a scraper - after all, it’s the one calling all the shots. However, it’s important to think about what implicit (eg. IP range) and explicit (password, token, etc) credentials you’re using when scraping. It’s entirely possible for a site that knows it’s going to be scraped to redirect you to a resource for which you have privileged access: if the result of your scrape will become public later, you’ve just exposed private data.

The library should have a default configuration under which it won’t scrape private IPs, use creds, or even go off-domain (though domain.com -> www.domain.com should be fine) - that way you only get potentially dangerous behaviour if you explicitly ask for it.

(Redirects are a thorny problem in general: a library should catch redirect loops and allow the user to set policies on off-site redirects, maximum redirect chain lengths, etc)

Distribution

I’ll be a heretic here and say you probably don’t need distribution. I scraped the front page of 160 million domains every day with 13 machines: if you’re not working at that scale, it really doesn’t matter.

If you really needed to extend the design, you’d want to set it up so that you had a central blocking queue that started out with just the initial URLs. Scrapers would connect to it to get URLs, and at each step acknowledge that they’ve fetched that URL, along with the results of the fetch - they’d never do more than one level of fetching. After n minutes without an acknowledgment, the URL can be sent out again, though you might want to return the history in the response, to avoid redundant fetching.

I implemented this model over ZMQ, but HTTP would be fine and probably simpler - it’s going to be chatty no matter what, so avoid it if you can.

Queue control

One of the problems I faced with Scrapy was that as it used an evented framework rather than an explicit queue, it was extremely easy to blow out memory. In one example, my source data was a set of gzipped files that listed other gzipped files that then contained links to the data I needed. Because I didn’t have control of the queue, it ended up loading every single one into memory before it ever got to the juicy part with the final data.

Putting it into code

So, notionally, our scraper takes data in some known format and extracts some of that information into two things: a (possibly empty) list of new URLs to look at (along with a tag to indicate what kind of URL it is) and a list of dictionaries that represents the actual information you’re interested in from that page. (Commonly this will actually be a single element, but sometimes if you’re scraping search results, you’ll get many notionally separate chunks of data on a single page.)

What might this look like?

-- user code
data PageType
  = InitialIndex
  | IntermediateIndex
  | ActualData
  deriving (Eq,Ord)

This is what defines the shape of the scrape, as seen by the user. Separating out the different kinds of pages we see means that we can parse them differently, without relying on fragile regular expressions on the URL - it also means we can prioritise some fetches above others, as well as restrict download slots on a page-type basis.

-- library code - this needs to be fleshed out.
-- something slightly less general than Aeson's Value type: just scalars.
data Column = ColText Text
            | ColInt  Integer
            | ...

type Item = [(Text,Column)]

-- Quotas, number of concurrent threads, logging, robots, etc
data Config a =
  Config
  { threads :: Int
  , downloadSlots :: a -> Int
  ...
  }

data Metadata = ...

-- here a is instantiated to PageType
scrape :: Config a
       -> [(a, URL)]
       -> (a -> Metadata -> ([(a,URL)], [Item]))
       -> ([Item] -> IO ())
       -> IO ()

(This ought to be in something like Conduit or Pipes for real code: it’s presented in this simpler form for clarity only.)

Dissecting the type of scrape, we give it some configuration information, an initial list of URLs (each with an associated tag), an extractor function, and a way to do something with each row.

Notice the downloadSlots tag in the Config record? That’s there so that we can control intermediate fetches, and avoid the memory blowout I described earlier. You might think it’s enough to have an Ord instance for PageType and use that: unfortunately, what typically happens is that the first few hundred pages will all be InitialIndex and IntermediateIndex pages; while ActualData requests will be prioritised, by the time any are actually processed, we might have already blown out memory. The downloadSlots refers to how many of that kind of download can be downloading at once: that combined with the Ord instance on PageTypes means we can keep a queue of fetchable things ordered in a sensible way that minimises the amount of memory required.

Conclusion

One reviewer quite reasonably raised the concern that this is an extremely un-Haskelly library. Our platonic ideal of a Haskell library is something that exports a single, coherent concept in such a way that it never needs to be reimplemented. This is not that: scraping is a dirty, error-prone, highly contingent endeavour. The goal here is to package up a pile of hard-won knowledge about where the facerakes are and make them easier to avoid.

(In other news, if you actually need this or other Haskell work done, I am available for hire: mwotton@gmail.com.)

(Thanks to @tureus, @jfischoff, @mxavier and @thumphriees for their detailed feedback, and thanks to everybody else who read it, even if you couldn’t think of improvements.)

(discussion here)

a fallible guide to persistent-template

2017-05-18T00:00:00+00:00

Template Haskell is often considered a smell in Haskell, responsible for everything from slow compile times to impenetrable error messages to the cow’s milk turning sour. Some of this is warranted, but if you need to check anything based on compile-time information, it is also the only game in town.

Persistent is a good example of where TH is arguably justified. Defining our database tables and our types in the same format means that there is a single source of truth, with no way for the code to fall out of touch with the database.

Nonetheless, sometimes you do want to break this model; in this case, I have a type (Status from twitter-types) that I’d like to persist in the database. I can’t include this definition in the standard quasiquoted file: the type already exists, I can’t define it again. This leaves me with two options:

use a type with the same structure as the Status I get from twitter-types, and write some conversion functions for converting between persisted and original statuses. Probably about twenty lines and ten minutes’ work.
read the source of persistent-template and work out a way of controlling the declaration of datatypes on a table-by-table basis.

So! Let’s dig into how it’s currently set up.

Our top level entry point is “share”.

share :: [[EntityDef] -> Q [Dec]]
      -> [EntityDef]
      -> Q [Dec]

In the first argument, we have a list of “builders”, in a sense - we’ll pass a list of EntityDefs to each of them, and each will create some declarations (Q [Dec]). In the normal case, we’ll pass ‘mkPersist sqlSettings’ and ‘mkMigrate “migrateAll”’.

In the second, we have a list of EntityDefs, typically (but not necessarily) coming from the ‘persistLowerCase’ quasiquoter.

Finally, our result type is a Q [Dec]. This is a little bit strange: ‘share’ is called at the top level, and it looks like a bare top-level expression, not a binding declaration. Digging through the docs (and https://stackoverflow.com/documentation/haskell/5216/template-haskell-quasiquotes#t=201705181813575250883) , it turns out that when we have a Q [Dec] at the top level, we can omit the standard $( … ) syntax that we’d usually use for introducing TH to normal Haskell code. Almost a little too convenient, but ok.

This tells us that we are going to have to make at least two changes. Our EntityDef will need to carry the information of whether a datatype needs to be created for the Entity or not. We can then monkey with mkPersist to look at that information and only generate a datatype if we have requested one. (If all goes well, mkMigrate ought to work unchanged.)

(For the moment, we’ll change this in-place: the question of whether this is to be a mergeable fork or an add-on can be dodged for now.)

So: first order of business is EntityDef. ag 'data EntityDef' shows us that it is defined in Database.Persist.Types.Base, meaning we’ll have to edit persistent as well as persistent-template. We’ll add an ‘isStandalone’ field to the EntityDef declaration that will default to False. When we do that, we try compiling to see what broke:

    /home/mark/projects/persistent/persistent/Database/Persist/Quasi.hs:299:5: error:
        • Couldn't match expected type ‘EntityDef’
                      with actual type ‘Bool -> EntityDef’
        • Probable cause: ‘EntityDef’ is applied to too few arguments
          In the second argument of ‘($)’, namely
            ‘EntityDef ...

which makes sense. Looking at mkEntityDef, we can define a dummy value for isStandalone, just to check everything else is hunky-dory, and it compiles. Great. (Parenthetically, notice that we are never too far from a type-correct system, even if it doesn’t quite do what we want yet. Get too far into the weeds and it is very difficult to extract yourself.)

Now we want to see where we can stash a declaration that a particular entity is standalone.

mkEntityDef :: PersistSettings
            -> Text -- ^ name
            -> [Attr] -- ^ entity attributes
            -> [Line] -- ^ indented lines

It’s not the PersistSettings, they’re global. You could cram it into the name of the model as some kind of godawful in-band signalling, but this path always leads to pain. Let’s keep looking. [Line] doesn’t seem right, they’re the individual fields of the table, and we’re trying to find where to specify a feature of the entity as a whole. By a process of elimination, it has to be [Attr]. Let’s go grepping again. (I should probably set tags up instead of dumb text search, but grep/ag/ack are simple and dependable.)

Oh, there’s a surprise - ‘ag “data Attr”’ and ‘ag “newtype Attr”’ yield nothing. Turns out it’s just a type synonym for Text. That’s unexpected - it’ll mean we have to take a bit of attribute namespace, but I suppose that was inevitable anyway. This means we can actually back out our earlier changes and just pass “standalone” as a textual attribute.

With that, we just need to monkey with mkEntity a little. ‘dataTypeDec’ looks promising: if we check for “standalone” in the attributes, we should be just about home free.

+    let inclDtd = if "standalone" `elem` (entityAttrs t)
+                  then []
+                  else [dtd]
     return $ addSyn $
-       dtd : mconcat fkc `mappend`
+       (mconcat (inclDtd:fkc)) `mappend`

Compiles, shipit!

We should probably check that this works the way we expect.

share [mkPersist sqlSettings] [persistLowerCase|
Bar standalone json
    name Text
|]

Gratifyingly and hopefully unsurprisingly, this fails in an obvious way:

/home/mark/projects/persistent/persistent-template/test/main.hs:62:1: error:
    Not in scope: type constructor or class ‘Bar’
    Perhaps you meant ‘Baz’ (line 53)

Let’s define it then!

data Bar = Bar { barName :: Text }

Clean compile, and all existing tests green. We haven’t actually tested this functionality yet, but inside the persistent lib is a difficult place to do it - we’ll add a more thorough test in our app, given that we had to write that code anyway.

Checking against my app, I realise that I want to store a Status (which fortuitously is named according to Persistent conventions - if it weren’t, I’d need to turn mpsPrefixFields off in sqlSettings.)

I define Status in the model quasiquoter, using all the same fields - we want to get back to it working, and then turn on standalone.

Now I get a bunch of errors about not having PersistField defined for a range of types. Excellent! These are easy to define with StandaloneDeriving (as Show is already defined by derivation)

derivePersistField "Entities"
deriving instance Read Entities
deriving instance Read HashTagEntity
deriving instance Read UserEntity
deriving instance Read x => Read (Entity x)
...

We continue doing this more or less mechanically until we get to this odd complaint:

    • No instance for (persistent-2.7.0:Database.Persist.Sql.Class.PersistFieldSql
                         TW.Status)

We don’t particularly want to persist a Status as a field: the whole point was to model it explicitly in a table. Why are we getting this? Ah! The Status type has a self-reference in it! quotedStatus is a Maybe Status!

It’s at this point I’m inclined to say “ok, this was a bad idea, a single isomorphism isn’t so bad”, but I want to finish this post, so let’s follow the rabbit down the hole and define derivePersistField "Status" too. Might as well be hung for a sheep as a lamb.

Now we run into an odd problem:

/home/mark/projects/owlstacks.com/Model.hs:23:1: error:
    Duplicate instance declarations:
      instance PersistFieldSql Status -- Defined at Model.hs:23:1
      instance PersistFieldSql Status -- Defined at Orphans.hs:68:1
   |
23 | share
   | ^^^^^...

/home/mark/projects/owlstacks.com/Model.hs:23:1: error:
    Duplicate instance declarations:
      instance PersistField Status -- Defined at Model.hs:23:1
      instance PersistField Status -- Defined at Orphans.hs:68:1
   |
23 | share

Now it’s starting to become a little clearer: we need the PersistField instance in the Orphans file, which means we also have to be able to disable instance generation in the quasiquoting code. One more time unto the breach, dear friends: we’re going back into TH.hs

inside mkPersist, we need to monkey with the defined instances

standalone t =  "standalone" `elem` (entityAttrs t)

and

filter (not . standalone)

inside mkPersist. Ok! Now we have “multiple declarations of StatusId”, and here we come to a bit of a grinding halt. Status is one type from twitter-types, StatusId is another. Persistent expects to be able to nab that piece of the namespace to describe the primary key of the table in Haskell, and it isn’t configurable.

I could write more code to get around this, but we have proceeded well past the point where this exercise could be expected to yield useful code, and I think I’m going to call it here. Persistent is a pretty opinionated framework: if you have significantly different needs, or want to serialise types that come from elsewhere to the database without introducing an intermediary type, it looks like a better idea to just use something else. Apologies for the abrupt end, it’s as much a surprise to me as anyone else.

haskell profiling without pain

2017-04-27T00:00:00+00:00

In general, stack is pretty good at caching build artifacts - the first build might be glacial, but incremental builds are snappy. This all goes out the window when you start playing with flags like –profile, though - stack sees that something has changed and dutifully rebuilds every-bloody-thing. This is a huge disincentive to casually run profiling builds.

thankfully, stack keeps all its bits and bobs inside ./.stack-work, so all you need to do is create a shadow directory like so:

mkdir foo-prof
cd foo-prof
lndir ../foo
cd foo-prof
rm -rf .stack-work
stack build --profile

(on ubuntu, you can get lndir from xutils-dev.)

now, you have a directory with a separate .stack-work, and can mess around in that directory for all your profiling needs without slowing everything else down to a shuddering halt.

(nb: this will not cope terribly gracefully with new files being added. a workaround is to only symlink the top level files and directory instead of a full shadow tree: that way, only top level changes will break things.)

(another possibility would be to set STACK_WORK=.stack-work-profiling whenever you run profiling commands, but you’re going to screw it up eventually - probably better to have totally separate implicit contexts.)

cabal flags for dev flow

2017-03-10T00:00:00+00:00

It’s good practice to make sure your code is warning-free before committing or releasing. In the throes of editing, though, it can be extremely annoying to have to fix a swathe of trivial warnings when you really just want to see if things basically make sense (especially if you’re fond of undefined-driven development, and are using a prelude which marks undefined as deprecated).

I used to just edit the cabal file to add and delete “-Werror” according to what mode I was in, but this was both annoying to do, and potentially incorrect: if you add -Werror in when your code is already built, cabal will not automatically rebuild local modules.

Thankfully, cabal flags do force a rebuild. This means you just need this at the top level in mycoolproject.cabal:

flag lib-Werror
  default: True
  manual: True

and this in your executable & library stanzas:

  if flag(lib-Werror)
    ghc-options: -Werror

Now, when you want to run cajun style, you can edit your command line to something like

stack build --flag mycoolproject:-lib-Werror

turning this into a convenient makefile action is left as an exercise to the reader - extending to hpack-style is similar.

(This tiny but pleasant tip is due to Michael Xavier.)

Haskell Testing Desiderata

2016-10-27T00:00:00+00:00

A few days ago I wrote about testing in Haskell. It’s not quite the whole story, though, and I wanted to dig into what I’d like to see.

Current solutions

I currently use Intero in Emacs, and switch between ghcid and stack build --file-watch --test --fast. None of them is quite perfect, as I’ll describe.

Desiderata

Interpreter-based

We are aiming for subsecond response times. In a big project (such as my current betterteam work project, we have about 21000 lines of Haskell, we just don’t have time to wait for the compiler to reload almost everything. Intero and ghcid work well here, stack build does not.

Responsivity

By responsivity, I mean that when anything that could affect the semantics of the code changes, the relevant tests are fired. This is really important, because you need to be able to trust that your test-runner is telling you the full, up-to-date truth. Without this, you need to keep the possibility in your mind that your tools have started to lag the truth of the situation.

Unfortunately, the ghcid solution I mentioned yesterday doesn’t handle this case quite perfectly: if I have a large project with sub-projects set up through stack and change something in a subproject, ghcid will not notice the problem. (It has improved recently, in that it watches stack.yaml and the appropriate cabal file, but widespread damage detection is a little outside its remit.)

Intero has a similar problem, and while I find it really useful, I frequently have to restart it so it will notice added packages in the stack.yaml and cabal file, or subproject changes.

In contrast, stack build --file-watch --test --fast, my other go-to, handles this flawlessly.

Parsimony

We shouldn’t do more work than is necessary. This also sounds pretty obvious, but the file-based nature of Haskell code militates against us here: if module A depends on module B, and we changed a line of code in module B that module A does not in fact use, it’s difficult to tell that not only does module A not need to be recompiled, but its tests do not need to be rerun. Barring the introduction of something like Unison’s addressing-by-hash-of-toplevel-value approach, we’re going to have to cop this slowdown, and just try to keep our files small & coherent enough that a minimum of pointless work needs to be done.

As far as just loading the code goes, the interpreter approaches do ok (unless you have significant Template Haskell, in which case heaven help you.) The compilation approach is usually pretty slow, at least if we keep the subsecond goal in mind.

I haven’t seen anything capable of only running the tests that have been affected. The closest is Hspec’s –rerun flag, which keeps a running tally of failing tests and only runs them until they all pass. This helps a lot, especially if you haven’t put the work in to make your tests fast (we have a fair number of integration tests that interact with the database, and currently it takes 180s to run everything).

(Quick plug: if you use the stack build --file-watch --test --fast approach, –rerun doesn’t work out of the box, as it relies on environment variables. My little hspec-stack-rerun project works around this: with any luck hspec’s upcoming support for config files will make it unnecessary soon.)

Fantasyland project

In the world I see when I shut my eyes, I have a process watching for file changes that can compute a parsimonious set of code to be reloaded and tests to be run. That would talk to a running ghci process (probably something similar to ghcid or intero), instructing it first which files to reload, and then which tests to run. It might even be aware of subprojects to the extent that it would only run failing tests in the lowest level subproject until it passes, to let you stay drilled in, then run whatever tests have been possibly compromised at higher levels until everything passes. I want my testrunner to be Tenzing Norgay, helping me through the rough bits and making up for my laziness, sloppiness and impatience.

Then I open my eyes, sigh, and hit M-x intero-restart.

Fast tests and static languages

2016-10-24T00:00:00+00:00

This is to correct a small error in David R. Maciver’s piece on static typing

Most of the piece is sensible - I just wanted to dispel an apparently widespread misunderstanding about build times.

There are absolutely statically typed languages where build times
are reasonable but this tends to be well correlated with them having
bad type systems. e.g. Go is obsessed with good build times, but Go is
also obsessed with having a type system straight out of the 70s which
fights against you at every step of the way. Java’s compile times are
sorta reasonable but the Java type system is also not particularly
powerful. Haskell, Scala or Rust all have interesting and powerful
type systems and horrible build times. There are counter-examples –
OCaml build times are reportedly pretty good – but by and large the
more advanced the type system the longer the build times.

Clearly David is setting up a tradeoff, but the tradeoff is not a real one. Yes, if you run your tests by compiling your code then running it, it can be slow: my robots-txt library takes about 8 seconds on my machine to build and run the 229 tests. However, every modern static language worth its salt has an interpreter, same as every dynamic language, and it is generally much faster to run your tests through that and skip all the time taken by codegen and linking (unless the tests themselves take a long time, which is something I can’t help you with.)

If i instead run ghcid, using the invocation from the makefile:

	ghcid --warnings --test=:main

it will watch my files, and run them whenever something changes. If I change the actual code, it takes 0.18s or so: just changing the test takes 0.04s. If I change robots.cabal or stack.yaml, it does take a few seconds to reload everything: fiddling with those is rare enough that I’m not too worried. This is if anything much faster than my experience of running similar systems under Ruby.

None of this is intended to be dumping on David. There’s no reason he’d know about what is very much common practice in the community: contrary to the common perception of Haskellers being unable to shut up about the language, I think we have a bad habit of only discussing the high-tech stuff (Free Monads! Phantom Types! Template Haskell!) and ignoring the normal practical things that make developing pleasant.

If this has been at all illuminating to you, please consider blogging about your praxis too. Maybe I’ll learn something.