<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://shimweasel.com/feed.xml" rel="self" type="application/atom+xml" /><link href="http://shimweasel.com/" rel="alternate" type="text/html" /><updated>2026-02-09T04:14:31+00:00</updated><id>http://shimweasel.com/feed.xml</id><title type="html">shimweasel.com</title><subtitle>blog</subtitle><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><entry><title type="html">Agent posts are moving to lambdamechanic.com</title><link href="http://shimweasel.com/2026/02/09/agent-posts-moving-to-lambdamechanic" rel="alternate" type="text/html" title="Agent posts are moving to lambdamechanic.com" /><published>2026-02-09T00:00:00+00:00</published><updated>2026-02-09T00:00:00+00:00</updated><id>http://shimweasel.com/2026/02/09/agent-posts-moving-to-lambdamechanic</id><content type="html" xml:base="http://shimweasel.com/2026/02/09/agent-posts-moving-to-lambdamechanic"><![CDATA[<p>I’m going to be moving my technical blogging about agents to <a href="https://lambdamechanic.com/blog">lambdamechanic.com/blog</a>.</p>

<p>If you want to follow along, there’s an RSS/Atom feed here:</p>

<p><a href="https://lambdamechanic.com/feed.xml">https://lambdamechanic.com/feed.xml</a></p>

<p>I’ll be writing posts on agentic techniques, and the first one is up now: <strong>Tooltest</strong>, my new MCP fuzzing tool:</p>

<p><a href="https://lambdamechanic.com/blog/tooltest-a-fuzz-checker-for-your-mcps/">Tooltest: a fuzz-checker for your MCPs</a></p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="agents" /><category term="llm" /><category term="mcp" /><summary type="html"><![CDATA[I’m going to be moving my technical blogging about agents to lambdamechanic.com/blog.]]></summary></entry><entry><title type="html">Against Vibecoding (Sorta)</title><link href="http://shimweasel.com/2025/03/24/against-vibecoding" rel="alternate" type="text/html" title="Against Vibecoding (Sorta)" /><published>2025-03-24T00:00:00+00:00</published><updated>2025-03-24T00:00:00+00:00</updated><id>http://shimweasel.com/2025/03/24/against-vibecoding</id><content type="html" xml:base="http://shimweasel.com/2025/03/24/against-vibecoding"><![CDATA[<p>I’ve been writing code professionally for something like 25 years, which I think classifies me as a greybeard if not an outright grognard at this point. As such, it’s almost expected that I have a grumpy rant about Kids These Days and their Gee Pee Tees and Claudes. And I do, sorta.</p>

<p>The thing is, a hackable framework for using LLMs for development is easily the most transformative change for me since I discovered real type systems, and potentially might be even better. I am, as they say, forced to stan.</p>

<h2 id="so-whats-wrong-with-vibecoding-anyway">So what’s wrong with vibecoding, anyway?</h2>

<p>The problem with a capable tool is that it lets you make bigger messes. Once you vibecode yourself up an app that’s bigger than the model can handle, you will hit the same wall that every dev hits eventually: that you don’t understand what’s going on, and you’re afraid of making changes because you might break it. You’ll just hit it way later, at a point where the app is beyond the model, by definition, and certainly beyond its putative author, who has been lounging around eating grapes and barking occasional complaints.</p>

<h2 id="whats-it-good-for">What’s it good for?</h2>

<p>Prototyping. Personal apps. Go nuts, the upside you get from these things is almost all gravy and the downside is minimal. If a personal app goes down there’s exactly one person who cares. Vibecoding is an absolutely wonderful tool for exploring a problem space, but you really do have to throw the prototype away afterwards.</p>

<h2 id="whats-the-alternative">What’s the alternative?</h2>

<p>This will sound terribly boring and square, but it’s using LLMs and agent frameworks alongside standard software development practices.</p>

<h3 id="types">Types!</h3>

<p>I expect typed languages to become even more popular with the advent of LLMs. The tradeoff has always been about how hard it can be to get a type-correct program together, but having a strict judge to check the output of a tireless LLM is all carrot, no stick. Whatever environment you’re in, you can probably add some basic types: Python has type hints, JavaScript has throwing your app out and rewriting it in TypeScript. I’m a fan of more advanced type systems like Haskell’s but there’s a lot of bang-for-buck in the basic stuff.</p>

<h3 id="tests">Tests!</h3>

<p>One of the orthodox development practices more honoured in the breach than the observance is test-first development, but it doesn’t have to be. It’s easier than ever to do: just ask the agent to write the tests and a stub, and see them fail before you ask it to write code that fits them.</p>

<p>The great thing about having tests with descriptive error messages is that the agent can probably actually solve the problem itself. The payoff isn’t some wispy idea of software virtue, and of eventual speed: the test you write now will help you in two minutes’ time, and forever after.</p>

<p>In that vein, let me heavily recommend generative testing, a la <a href="https://hypothesis.works/">Hypothesis</a>. I use (and ported) <a href="https://github.com/lambdamechanic/miniTSis">MiniTSis</a> but that’s only because I was working in TypeScript at the time, if you’re using Python then just use the progenitor. (Quick precis for the uninitiated: if you can state some properties and provide a generator for inputs, a generative testing tool can not only try thousands or millions of test cases against your code, it can reduce the test case to something you (or an LLM) might actually get some insight out of reading. Do not use QuickCheck-derived tools if you can avoid it, having an integrated shrinker and a test database like Hypothesis is a massive quality-of-life improvement.)</p>

<h3 id="packaging">Packaging!</h3>

<p>Whatever environment you’re in, set your project up as a proper packaged project from the start. You’re going to add package scripts and dependencies and tests as you go along, you might as well have them there and not have to move them around later. (Aider-driven development can have a tendency to have trouble moving files around - it’s not a primitive to say “rename this file”, all the LLM can do is SEARCH/REPLACE the content to a new file, and SEARCH/REPLACE the old one with the empty string. Start it in the right place if you can.)</p>

<h3 id="for-the-love-of-god-source-control">For the love of God, source control.</h3>

<p>Don’t lose your code to a bad prompt. The tool I’m about to discuss will by default make each edit into a commit. If you’re collaborating with others you may want to squash those commits but that’s a detail.</p>

<h2 id="how-i-use-llms-for-coding">How I use LLMs for coding</h2>

<p>The nitty gritty! My favourite tool at the moment is <a href="https://aider.chat">Aider</a>. It’s not as flashy as Cursor, but I’ve found using a terminal-based tool far more hackable &amp; usable, and it is completely open: you can use whatever model you want and the source is there if you want deep customisations.</p>

<h3 id="use-architect-mode">Use /architect mode</h3>

<p>The only* model that seems any good at editing using Aider’s SEARCH/REPLACE model is Sonnet 3.5, but some of the other models (Deepseek R1, grok-beta, o3-mini) are a bit smarter at actual code. If you’re writing simple code then probably Sonnet 3.5 by itself is fine. You can do this by running <code class="language-plaintext highlighter-rouge">aider --architect --model openrouter/x-ai/grok-beta --editor-model openrouter/anthropic/claude-3.5-sonnet</code>, modify as appropriate.</p>

<ul>
  <li>edit: several hours after I published this, deepseek v3 0324 came out, and it is about as good as Sonnet at editing, and much cheaper. Life moves pretty fast.</li>
</ul>

<h3 id="use-openrouterai">Use openrouter.ai</h3>

<p>You might have noticed that in the previous tip, my models have a suspicious prefix. Everyone and their dog has a model and a billing department: save yourself a little hassle and use <a href="https://openrouter.ai">OpenRouter</a>. It’s the same price, for certain models there are actually multiple providers and you can choose on the openrouter site whether you care more about latency or cost.</p>

<p>Getting a single bill rather than four and being able to try models as soon as they come out is also great.</p>

<h3 id="keep-your-context-slim">Keep your context slim</h3>

<p>Aider will only send the files you add, though it will occasionally use repository context to ask for more files to be added. You want to keep the chat context and the files in scope as minimal as you can: this keeps responses snappy and bills down.</p>

<p>In aid of that, I’ll frequently run <code class="language-plaintext highlighter-rouge">/reset</code>, <code class="language-plaintext highlighter-rouge">/run exe command_to_run_my_tests</code>, and <code class="language-plaintext highlighter-rouge">/load .LOADCOMMANDS</code>. This drops the context and calls a little shell script called <code class="language-plaintext highlighter-rouge">exe</code> you can grab <a href="https://gist.github.com/mwotton/bbd198f8ed76c231e0fb0eb644d07fb2">here</a> - all it does is look for editable files in the output of your command, and plunk them into <code class="language-plaintext highlighter-rouge">.LOADCOMMANDS</code> in the root folder of your project. <code class="language-plaintext highlighter-rouge">/load .LOADCOMMANDS</code> will then load any mentioned files into your context (so make sure your tests &amp; compile outputs are explicit about what happened and where!)</p>

<p>This workflow is a little janky, and I think eventually if Aider doesn’t offer some internal scripting, I’ll switch to writing my own REPL using <code class="language-plaintext highlighter-rouge">aider --message</code> as the engine. It works for now, though.</p>

<h3 id="dont-spam-whats-wrong-fix-too-often">Don’t spam “What’s wrong? Fix” <em>too</em> often</h3>

<p>When your test run ends with a failure, Aider will frequently fill out “What’s wrong? Fix” as a sample response.</p>

<p>It can be pretty tempting to just sit and pull the handle over and over, hoping you get a clean compile and test run. It even works often enough to be attractive! A big part of using these tools effectively, however, is noticing when they’re stuck and jumping in to give some context. Often it’s as easy as running the appropriate <code class="language-plaintext highlighter-rouge">/web</code> command, to give it access to a particular page in the documentation.</p>

<h3 id="use-undo-like-its-going-out-of-fashion">Use /undo like it’s going out of fashion</h3>

<p>Frequently, you’ll see the model merrily going down the wrong path. Don’t try to repair the damage, just run <code class="language-plaintext highlighter-rouge">/undo</code> and provide a better prompt: the first rule of finding yourself in a hole is to stop digging.</p>

<h3 id="keep-an-eye-on-the-leaderboards--try-out-models">Keep an eye on the leaderboards &amp; try out models</h3>

<p>The <a href="https://aider.chat/docs/leaderboards/">Aider</a> leaderboards are a really helpful tool for model selection. You probably won’t change your models daily but they do have strengths and weaknesses. For simple code, Sonnet 3.5 is absolutely rock-solid. It’s still good at the more complex stuff but I sometimes find grok-beta and R1 better. It can be quite personal, because it’s going to respond to your prompting style, so play around.</p>

<h3 id="voice-maybe">Voice, maybe?</h3>

<p>I haven’t used this as heavily as the rest of it, but voice coding is looking very promising. It doesn’t really matter any more that you can’t put in every weird character under the sun if you can just get the gist across to the model. This one’s an area of active experimentation for me, play with <code class="language-plaintext highlighter-rouge">/voice</code> if you’d like more.</p>

<h2 id="so-whats-your-point">So what’s your point?</h2>

<p>I don’t actually expect the juniors to read this, and they don’t need to anyway. There are a million articles and books on how to develop software systematically, and nobody reads them until they have a project blow up: Icarus has to lose his wings before he takes glue composition seriously.</p>

<p>I’m talking to my fellow grognards: this is not something you can safely ignore, like scrum or crypto. Coding agents are good today, and will only get better. The good news is that while some of your skills are obsolete, many are not: it’s still critical to notice when it’s going off track and to have the depth in the field to nudge it into a more promising path. All the software development principles and techniques you know like testing, modularity, interfaces and types still pay huge dividends, and multiplicatively so. Install aider, get an openrouter account, and go to town on a hobby project at the very least. Apart from anything else, it’s just a huge amount of fun.</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="llm" /><summary type="html"><![CDATA[I’ve been writing code professionally for something like 25 years, which I think classifies me as a greybeard if not an outright grognard at this point. As such, it’s almost expected that I have a grumpy rant about Kids These Days and their Gee Pee Tees and Claudes. And I do, sorta.]]></summary></entry><entry><title type="html">MiniTSis 6.0.1 release</title><link href="http://shimweasel.com/tech/2024/07/03/minitsis-601-release" rel="alternate" type="text/html" title="MiniTSis 6.0.1 release" /><published>2024-07-03T00:00:00+00:00</published><updated>2024-07-03T00:00:00+00:00</updated><id>http://shimweasel.com/tech/2024/07/03/minitsis-601-release</id><content type="html" xml:base="http://shimweasel.com/tech/2024/07/03/minitsis-601-release"><![CDATA[<h1 id="minitsis-601-release">MiniTSis 6.0.1 release</h1>

<p>I’m happy to announce the first public release of <a href="https://www.npmjs.com/package/minitsis">MiniTSis</a>, a property testing library for TypeScript with integrated shrinking and a test database.</p>

<h2 id="what-is-minitsis">What is MiniTSis?</h2>

<p>I recently worked on a large, complex client project and had to replicate some functionality in a different environment. Given that I had to match the existing implementation, a property test suite with an oracle seemed the right approach, so I built the first version using <a href="https://fast-check.dev/">fast-check</a>. Unfortunately, while it did work for finding bugs, the shrinking in fast-check is not integrated, which means you end up in the situation of knowing you have a problem in your implementation, but not being able to do anything about it without reading hundreds of lines of JSON.</p>

<p>After complaining vociferously, <a href="drmaciver.com">David</a> gently suggested that I stop whinging and just port <a href="https://github.com/DRMacIver/minithesis">Minithesis</a>. This is the result of that nudge, and I don’t think I’d have been able to finish the client project without it. Getting a fully-shrunk failing test case makes tracking down bugs far easier, and having a test database means that once MiniTSis has finished shrinking a failing test case, it will be automatically saved and tested until it passes again.</p>

<p>The shrinking algorithm is bitstream-based, which means that it can’t directly take advantage of the structure of generators. I may implement boundary tracking at some point: this would be for performance rather than correctness, though.</p>

<p>Bug reports and PRs at the <a href="https://github.com/lambdamechanic/miniTSis">GitHub repo</a></p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="tech" /><category term="Typescript" /><category term="MiniTSis" /><category term="web" /><summary type="html"><![CDATA[first public release of MiniTSis]]></summary></entry><entry><title type="html">You need a novelty budget</title><link href="http://shimweasel.com/2018/08/25/novelty-budgets" rel="alternate" type="text/html" title="You need a novelty budget" /><published>2018-08-25T00:00:00+00:00</published><updated>2018-08-25T00:00:00+00:00</updated><id>http://shimweasel.com/2018/08/25/novelty-budgets</id><content type="html" xml:base="http://shimweasel.com/2018/08/25/novelty-budgets"><![CDATA[<p>We measure a lot of things in software engineering these days. Test
coverage, time-to-deploy, bugs per line - they’re all good things to
keep an eye on. They’re all proxies for risk of failure, either from
moving too slowly, or from bugs and downtime destroying the business.</p>

<p>Something that’s not often explicitly controlled, however, is
<em>Novelty</em>. One of the dirty secrets of programming is that almost
every production codebase contains some dependency that the developers
have never used before. Perhaps they’ve written a trivial project to
play with it, but mostly they’re relying on community feedback, or if
you want to be dismissive, fashion. Civil engineers have solid rules
for the way a bridge ought to be constructed: we switch web
application frameworks on an annual basis.</p>

<h3 id="why-are-we-indulging-in-so-much-novelty-anyway">Why are we indulging in so much novelty anyway?</h3>

<p>There are reasons for this churn. The widespread availability of
libraries and frameworks is one of the reasons creating a startup is
cheaper and easier than ever before. Your app sits on top of a vast
pyramid of code you’ve never seen, and if you want to keep up, you
can’t just ignore better options when they come along: your company
will suffer, and so will your career (unless you enjoying maintaining
pre-millennial line-of-business COBOL apps).</p>

<h3 id="avoiding-the-scylla-of-endless-novelty-and-the-charybdis-of-stasis">Avoiding the Scylla of endless novelty and the Charybdis of stasis</h3>

<p>My technique is to use something called a “novelty budget”.
It’s a rough, informal thing: in essence, you decide how much
tolerance you have for library code you’ve never deployed on a new
project before, and apportion it out in the ways that you think will
give you the biggest bang-for-buck. This means that you accept that
you’ll pay a cost in missing documentation, unpolished tooling and
ungooglable errors for that part of the project, and you make up for
it by using the most solid, boring, dependable pieces you can for the
rest of it. The name of the game is reducing risk, not optimising for
the best possible case you can imagine.</p>

<p>The essence of this approach is that you are trying to balance average
and worst case time-to-completion.  Any novel component is going to
increase your worst case time: the hope is that it improves the
average, both now and over the evolution of your app in the coming
months and years.</p>

<h3 id="the-default-tech-stack">The default tech stack</h3>

<p>Let’s assume you’ve decided where you’re spending the biggest chunk of
your budget and put that piece aside: I can’t comment on why that’s a
good expenditure for you. Whatever it is, every other piece of the project
has to tighten its belt to compensate. (Obviously, if one of these is
your primary expenditure, feel free to ignore me - this is just a set
of defaults.)</p>

<h4 id="suggestions">Suggestions</h4>

<ul>
  <li>
    <p>choose something boring like Ubuntu or Debian, rather than a fancy
hardened BSD variant or exotic unikernel.</p>
  </li>
  <li>
    <p>Don’t adopt a graph database or fancy distributed database until
you’ve satisfied yourself that PostgreSQL or SQLite are really not
sufficient. (My defunct startup MeanPath scraped 160 million
websites a day using SQLite. The limits are probably higher than you
think.)</p>
  </li>
  <li>
    <p>Try a serverside-rendered site before adopting a SPA framework - you
might find that a limited amount of scripting is all you need.</p>
  </li>
  <li>
    <p>Devops automation is a real timesaver, but don’t feel that you have
to go full Docker/Kubernetes right off the bat.  While you
definitely need a process to build a deployable artifact, it’s
entirely possible that artifact is something more akin to a git repo
that gets pushed to heroku endpoint than full-on enterprise-grade
automation.</p>
  </li>
</ul>

<h3 id="process">Process</h3>

<p>When I can, I run my development processes on GitHub via pull reviews,
issues, and the rest of that stack: possibly there is a better
solution for parts of it, but the more you can stay in the boring
zone, the lower your chances of a catastrophic failure. Writing
your own process management software is unlikely to be a good idea
unless that’s your company’s reason to exist.</p>

<h3 id="startups-are-different-though-right">Startups are different though, right?</h3>

<p>There’s a seductive argument that since the chances of any given
startup’s success are small anyway, you might as well crank all the
knobs to 11 just to increase the variance. I think this is a bad idea:
contrary to the founder legend, most startups are not solving
difficult technical problems. The job of the founding engineers is to
get something out there that works well-enough, and to do so quickly
enough to test the hypothesis on which the startup was founded. Once
you get traction, you might hit scaling problems: at that point,
you’ll have either revenue or funding to fix them.</p>

<h3 id="haskell-case-study">Haskell case study</h3>

<p>If you don’t care about Haskell, skip this section.</p>

<p>Concretely, my biggest novelty budget expenditure for a project is
frequently Haskell. Haskell has some obvious benefits for me: I know-
it well, and I can bake out a lot of potential flaws just by leaning
hard on the type system. It has a cost, though: IDE tooling is not as
seamless as something like Java or Ruby, sometimes libraries are
missing, sometimes you hit baffling type errors (especially with more
advanced libraries).</p>

<p>The principle of a novelty budget applies within Haskell too, both at
a language level and in your choice of libraries.</p>

<h4 id="language">Language</h4>

<p>Haskell gives you a lot of rope to hang yourself, if you’re so
inclined: there’s a whole zoo of extensions, and the emphasis in
community blog posts on sexy types means that you can easily get the
impression that if your app doesn’t have a type-indexed generic free
monad at its core, you might as well be writing Perl. It just isn’t
true. Sum types, parametric polymorphism and typeclasses already put
the language way ahead of almost everything out there, and you can
build very solid apps without using any extensions. Treat it as an a
la carte menu, not a buffet. (I don’t include things like
OverloadedStrings or LambdaCase here - they are minor syntactic
conveniences that don’t add significantly to cognitive load.)</p>

<h4 id="libraries">Libraries</h4>

<p>I tend to use the Yesod/Persistent+Esqueleto stack. It has problems
but almost any reasonable thing someone might do with a website has
been attempted in Yesod, and there’s usually a solution. There are
fascinating database experiments out there with Ferry and Opaleye, but
they don’t have the volume of use. Similarly, Servant is a brilliant
piece of work, but in my test projects, I inevitably hit some small
but critical thing that will not get fixed for days, or weeks, or
months, and I can’t afford that time on a commercial project. (Yes,
one option is to fix it yourself or sponsor the author to do it: for
whatever reason, I’ve found that it’s usually far harder to get
payments authorised than it is to spend the equivalent amount of dev
time. That’s a company management problem I have no idea how to
solve.)</p>

<p>Another approach I’ve seen people have success with is to use WAI
directly or very minimal frameworks like Scotty. While they don’t have
all the bells and whistles of Yesod, you are very unlikely to end up
hitting a hard stop.</p>

<h3 id="conclusion">Conclusion</h3>

<p>I hope I’ve convinced you that this is at least a concept worth
thinking about. I should mention that I’ve mostly worked at smallish
startups, and it’s entirely possible it doesn’t generalise to
AmaGooBookSoft - would love to hear from anyone in that environment.</p>

<p>Feel free to tell <a href="https://twitter.com/mwotton">me</a> I’m completely off base.</p>

<p>PS. I’m not claiming to have invented this concept - it’s been a term of
art in my development conversations for years. I was frankly surprised
to find nothing about it written down and would welcome hearing about
anything I’ve missed.</p>

<h4 id="references-and-apologies">References and apologies</h4>

<ul>
  <li><a href="https://mcfunley.com/choose-boring-technology">Similar</a></li>
  <li>A <a href="https://mattjolson.github.io/2016/12/04/new-toy-syndrome.html">contrary view</a> from the enterprise</li>
</ul>

<p>Thanks to <a href="https://twitter.com/carnivorous8008">Matt Olson</a>, <a href="https://twitter.com/DRMacIver">David Maciver</a> and
Alec Heller for thoughtful comments on a draft.</p>

<p>Thanks also to my other reviewers who I currently can’t find because
Twitter search is terrible, and apologies that this took almost a year
to actually find time to polish and publish.</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="software" /><summary type="html"><![CDATA[We measure a lot of things in software engineering these days. Test coverage, time-to-deploy, bugs per line - they’re all good things to keep an eye on. They’re all proxies for risk of failure, either from moving too slowly, or from bugs and downtime destroying the business.]]></summary></entry><entry><title type="html">The ThinkPad: an elegant weapon</title><link href="http://shimweasel.com/2018/08/19/the-thinkpad-an-elegant-weapon" rel="alternate" type="text/html" title="The ThinkPad: an elegant weapon" /><published>2018-08-19T00:00:00+00:00</published><updated>2018-08-19T00:00:00+00:00</updated><id>http://shimweasel.com/2018/08/19/the-thinkpad-an-elegant-weapon</id><content type="html" xml:base="http://shimweasel.com/2018/08/19/the-thinkpad-an-elegant-weapon"><![CDATA[<p>I like ThinkPads.</p>

<p>They’ve never been cool. Ownership marked you as one of two equally
unhip tribes - either a drone in a suit, bashing out spreadsheets and
TPS reports kilometres up while some entitled bastard makes hamburger
mince out of your knees, or a 400lb hacker well-actuallying usenet
threads and prefixing “GNU/” on every project in sight like a crazed
graffiti artist.  Nobody’s ever headlined Lollapalooza on a ThinkPad.</p>

<p>And yet. If laptops were violins, the Thinkpad is what would happen if
Stradivarius decided to make his next project bulletproof for
funsies: Linux-friendly at a time when debugging X11 configs
and sound drivers was a mandatory rite of passage, rugged enough not
to be precious about, and sporting a tall screen for extra code
space. A ThinkPad was a reliable comrade in the war on complexity.</p>

<p>And the keyboards! It’s as if they’d actually talked to users who
typed for a living - hardware buttons for sound control and
brightness, a proper function key row, and a bouncy, kinetic key
action that let you know the machine was listening.</p>

<p>The last usable keyboard on a laptop was on the T420, in 2011. Apple lured every
other manufacturer into a doomed arms race of making laptops thinner
and lighter, far past any possible utility. I fully expect to see a
monomolecular laptop running MacOS X in the near future, and its
keyboard will fail every time Schrödinger’s cat meows.</p>

<p>I don’t mean to discount other brands. My tankish HP, my cute little
ASUS netbook, and even my MacBook Pro were instruments of
creation. They had keyboards with satisfying, chunky actions, and they
invited you to do something cool, not just passively imbibe the work
of others.</p>

<p>But somehow, despite the proliferation of tablets, phablets, and
voice-activated TVs, the market has decided that the one portable
device which was useful for producing things is now primarily for
Netflix and Youtube. Who cares if the keyboard causes you to mistype a
few words in the comments section? Nobody’s going to be able to
distinguish it from the background roar of illiteracy.</p>

<p>The ThinkPad fought the good fight for a long time, longer than almost
every other, but even Lenovo folded in the end. The 25th anniversary
edition is almost mockery, planting a reasonable keyboard on an
otherwise painfully mediocre machine.</p>

<p>I haven’t given up, though. While I write this on a sticker-covered
T410 and my travel T420s sits at home charging, my mailbox awaits a
T70. I paid $1500 5 months ago for this machine: an unofficial,
homebrew chimera, modern innards transplanted into a T60 chassis by
the incommunicative wizards of LCDFans, a fever dream of a laptop that
might never arrive: this is how deep the yearning goes.</p>

<p>(Thanks to the stubbornly unwebby Alec Heller for edits.)</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><summary type="html"><![CDATA[I like ThinkPads.]]></summary></entry><entry><title type="html">notes on a better migration system for Persistent</title><link href="http://shimweasel.com/programming/2018/03/18/notes-on-a-better-migration-system-for-persistent" rel="alternate" type="text/html" title="notes on a better migration system for Persistent" /><published>2018-03-18T00:00:00+00:00</published><updated>2018-03-18T00:00:00+00:00</updated><id>http://shimweasel.com/programming/2018/03/18/notes-on-a-better-migration-system-for-persistent</id><content type="html" xml:base="http://shimweasel.com/programming/2018/03/18/notes-on-a-better-migration-system-for-persistent"><![CDATA[<h2 id="problems-with-persistent">Problems with Persistent</h2>

<p>I use Persistent for a lot of my Haskell database work. It has some
great qualities - I really appreciate the type-based linkage between
the database types and my Haskell types, and while its query language
is anemic, Esqueleto does a good job of modelling most of SQL (or at
least enough that I can get useful work done without constant
frustration.)</p>

<p>The biggest problem I hit is with the migration system. It’s computed
by comparing the current structure of tables with what the Haskell
model thinks they should be, and has some limited support for
detecting what changes need to be made: however, it doesn’t allow you
to consider two versions of the same database in one program, which
makes using Haskell functions to populate new or changed fields &amp;
tables impossible.</p>

<p>In previous codebases I’ve ended up serialising the SQL at development
time and creating ActiveRecord migrations (hat-tip to Chris Allen for
that trick), allowing me to at least use SQL to populate new columns,
but I frequently ended up needing to write functionality again in SQL
just for the migration.</p>

<p>(It’s also possible that you can solve these problems by not using
Persistent, but so far every other database access library I’ve tried
either takes a strings-in/strings-out approach which leaves me worried
that the code and the database have fallen out of sync, or have used a
plethora of fancy types that I find hard to work with.)</p>

<h2 id="desiderata-for-a-migration-library">Desiderata for a migration library</h2>

<ol>
  <li>
    <p>Should work with Persistent’s datatypes and migrations. It needs to
know what DDL changes that persistent would try to make. This
doesn’t mean it has to do the same thing (and often won’t - can’t
do triggers, constraints, etc): it just means that when complete,
persistent should agree that there are no migrations to be run.</p>
  </li>
  <li>
    <p>It should provide a predicate to that effect, which should be easy
to run in a test suite as a sanity check, as well as on startup.</p>
  </li>
  <li>
    <p>Migrations should be easy to run on startup, in the context of
Persistent. this one is probably controversial, but fits my use
case. If the migration fails, we should have a way of signalling to
the environment that we are a failed deploy, and that the migration
should be aborted and the last version of the code substituted back
in. (Keter has health checks, though not terribly exhaustive ones -
I <em>think</em> it just checks that the given HTTP port is open, so
killing that listener might be enough.)</p>

    <p>We also need to make sure that if multiple web backends try to run
the same migration concurrently that only one makes it through, and
that the others don’t interpret their inability to run the
already-run migration as evidence that they’re broken.</p>
  </li>
  <li>
    <p>DDL migrations should be in plain SQL. Data-changing migrations can
be in Haskell or SQL: this allows us to back-populate columns using
haskell functions with all the context we’re used to. Because we
aren’t trying to talk about different structural states of the
database at the level of types, this will usually require a two
step migration process: add the fields in a nullable way and populate
them using Haskell functions, then in the second step, make them
non-nullable or foreign-keyed or whatever other database-level
constraint needs to apply.</p>

    <p>This also implies that you need nested transactions: if the first
step succeeds, but the populated fields do not satisfy the
constraints you expected them to, the database needs to be rolled
back to before the migration started. (There can actually be a
third step, where you update your Haskell types to reflect the new
constraints: this might be as simple as switching <code class="language-plaintext highlighter-rouge">Maybe a</code> to
<code class="language-plaintext highlighter-rouge">a</code> - this can fail too, if your SQL migration wasn’t correct, but
it has no database effects and can be rolled back without having to
worry about the database.)</p>
  </li>
</ol>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="programming" /><category term="haskell databases" /><summary type="html"><![CDATA[Problems with Persistent]]></summary></entry><entry><title type="html">Resolutions, Plans and Systems</title><link href="http://shimweasel.com/2017/12/31/resolutions-plans-and-systems" rel="alternate" type="text/html" title="Resolutions, Plans and Systems" /><published>2017-12-31T00:00:00+00:00</published><updated>2017-12-31T00:00:00+00:00</updated><id>http://shimweasel.com/2017/12/31/resolutions-plans-and-systems</id><content type="html" xml:base="http://shimweasel.com/2017/12/31/resolutions-plans-and-systems"><![CDATA[<p>Traditionally, the new year is when we all make public proclamations
of how we’re going to be better humans: this year, we will read more,
run more, listen more, eat healthier, and generally be our best selves.</p>

<p>It’s even more traditional to abandon these resolutions about
mid-January, so last year, I decided to try something a bit different.
First, I was going to try a new thing each month and stick with it for
the month. If it worked, I’d continue it. Second, I wasn’t going to
talk about it, for the reasons alluded to
<a href="https://sivers.org/zipit">here</a> <a href="#proviso">*</a>.</p>

<h2 id="results">Results</h2>

<p>So, what happened last year? What worked, what didn’t?</p>

<h3 id="january-lifting-">January: Lifting. ✘</h3>

<p>I wanted to get back to the steady cadence of lifting weights I had in
Vietnam, with a concrete goal of getting back to 350lb squats.</p>

<p>Illness, sleep deprivation and generally being responsible for Arthur
in the mornings foiled this, and my current 3x5 squat sets are about
285 lb. He’s now pretty happy to play in the pen down in my
basement gym, so this one might get another go this year.</p>

<h3 id="february-keto-">February: Keto. ✔</h3>

<p>My February resolution was to lose my gut and to keep it off in a
sustainable way. My weight has varied enormously over the last twenty
years - from 96kg (after finishing a 1000km bicycle ride through
Thailand, Cambodia and Vietnam) to 145kg (in Vietnam, where I was
lifting heavy weights regularly but also drinking beer and eating
delicious food perhaps a little more regularly than necessary). In
2016, I had a son, who turned out to be both enormous and possessed of
approximately three Tasmanian devils’ worth of energy: the idea of
being a puffing Fat Dad to such a whirlwind was too horrible to be
contemplated.</p>

<p>After a bit of research, I ended up deciding my experiment was going
to be with the ketogenic diet. I’ve done paleo before and had some
moderate success, but it always buckled under work or life stress.
Keto had some very appealing properties: unlimited bacon and butter,
for one thing, but also, clearcut boundaries. For some perverse
reason, it’s easier for me to regard certain foods as not-food than
it is to consume them in a moderate, rationed fashion.</p>

<p>At the beginning of the year, I was 135kg. The software I was using
to track it is sadly gone with an old phone, along with the data, but
I had lost about 5kg in the first month, and I felt a lot better, so
I adopted it as a regular habit. Today I’m orbiting around 120, give
or take a kilo, and am more or less comfortable there - I don’t have a
sixpack, but my shoulders are wider than my waist. Good enough. (The
fact that I appear to be stable is cause for at least as much
celebration as the original weight loss - I needed an approach that
would work even under stress.)</p>

<p>As an aside, the transition back to carbohydrate-heavy foods for
cheat days turned out to be so unpleasant that I cut them out
entirely. I’ll occasionally indulge in a beer on a very special
occasion, but I always know I’ll pay for it later.</p>

<h3 id="march-no-internet-arguments-">March: No internet arguments. ✔</h3>

<p>This one needs to be unpacked a little bit - the intent wasn’t to
conduct my online life as a Kumbaya circle. I had noticed, however,
that I had been allowing myself to engage emotionally in online
arguments (not discussions) with strangers who were not interested in
improving their understanding - whether it was Trump supporters who
thought I didn’t have the right to comment on gun control until I’d
served a tour of duty in the US armed forces, or Clojure enthusiasts
who’d downloaded GHC one time then decided that Haskell’s syntax was
icky, I was wasting time and emotional energy engaging with people who
were not interested in improving their understanding, and didn’t have
any intention of changing their mind if their arguments turned out to
be wrong. My time is better invested in my wife, my son, and my
friends (both online and off).</p>

<p>This worked, mostly. Every now and then I slipped and let an
intemperate comment fly, or misread someone as wanting a discussion of
policy when they really just wanted a cheer squad, but I didn’t get
sucked into the spiraling, angry threads I used to disappear into.
Removing Reddit and Hacker News from regular reading helped there too.</p>

<h3 id="april-language-learning-">April: language learning. ✘</h3>

<p>This one was a flop. I started learning Mandarin with
<a href="https://www.memrise.com">Memrise</a>, an excellent tool that helped me
get my Vietnamese to haltingly conversational (provided my
interlocutor was happy to talk about the weather and how beautiful Da
Nang was.) The intent was to try to replace time spent on Twitter,
Facebook, online chess and various mobile games (curse you,
<a href="https://polytopia.wikia.com/wiki/The_Battle_of_Polytopia_Wikia">Polytopia</a>,
why must you be so fascinatingly tactical?) with something useful that
could be done in isolated pockets of time.</p>

<p>I’m not exactly sure why this failed to become a habit. It’s possible
that without the anticipated payoff of actually going to China and
being able to talk to locals, it just wasn’t exciting enough to
overcome the difficulty of learning all the characters. If I was going
to try this again, I think I’d focus on conversational Mandarin and
stick to the pinyin script, and perhaps book a trip up front to act as
a forcing function.</p>

<h3 id="june-harmonica-partial-credit">June: Harmonica. Partial credit.</h3>

<p>I wanted something musical I could do with Arthur. The goal was to
practice 10 minutes a day, and to be able to play three recognisable
songs.</p>

<p>I didn’t come close to the practice requirements past that first month
(and I’m pretty sure I annoyed my in-laws by wandering off and
practicing while on holiday), and I didn’t get 3 songs, but I can do a
pretty reasonable ABC song and improvise something that sounds less
squawkily mournful than my initial attempts. Arthur has also started
squeaking away on it which is honestly the cutest thing I’ve ever seen.</p>

<h3 id="july-expand-career-skills-">July: expand career skills. ✘</h3>

<p>Unambiguous failure. I started an online course on deep learning, but
failed to budget enough time to make progress. Can’t see how this
one’s going to work in the future without completely ignoring my
family in the evenings, either.</p>

<h3 id="rest-of-year">Rest of year</h3>

<p>Past July, I didn’t start any new experiments, though I managed to
maintain some of those I had started.</p>

<h2 id="conclusion">Conclusion</h2>

<p>So, 2017 was a bit of a mixed bag, but I’m basically happy with it. A
healthier diet and emotional equanimity have done wonders for my state
of mind, my relationships, and my general happiness with the world,
and the failures have helped me scope out my current surrounding
terrain for self-improvement. The things that worked seem to have
worked because contravening them was clear: it’s pretty obvious when
you’re biting down on a slice of pizza or a bad-faith political
argument, but less immediately obvious that you’re playing a mobile
game rather than pulling out the harmonica or practicing your flash
cards.</p>

<p>As you might expect, I’m not going to tell you the content of this
year’s goals, but the overriding theme will be to have systems in
place to make it as obvious as possible in the moment that I’m
contravening my stated goals. I’ll also be trying to make them about
processes rather than results: with the best will in the world, I may
never lift 500 lb, but I can make sure I lift three times a week.</p>

<p>If you got this far, thanks for reading, and I hope you have a
wonderful 2018.</p>

<p><a name="proviso">*</a>
as ever, it’s <a href="https://www.bassam.com/single-post/CSI-TED-Talks-What-Derek-Sivers-Was-Really-Saying">more complicated than that</a></p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><summary type="html"><![CDATA[Traditionally, the new year is when we all make public proclamations of how we’re going to be better humans: this year, we will read more, run more, listen more, eat healthier, and generally be our best selves.]]></summary></entry><entry><title type="html">A Modest Scraping Proposal</title><link href="http://shimweasel.com/2017/07/13/a-modest-scraping-proposal" rel="alternate" type="text/html" title="A Modest Scraping Proposal" /><published>2017-07-13T00:00:00+00:00</published><updated>2017-07-13T00:00:00+00:00</updated><id>http://shimweasel.com/2017/07/13/a-modest-scraping-proposal</id><content type="html" xml:base="http://shimweasel.com/2017/07/13/a-modest-scraping-proposal"><![CDATA[<h1 id="why-scraping-libraries-in-haskell-arent-good-enough">Why scraping libraries in Haskell aren’t good enough</h1>

<p>Every time I mention that Haskell webscraping libraries are a bit
lacking, somebody points to (http-conduit|wreq) and (tagsoup|taggy)
and suggests that it’s just a matter of gluing them together. This is
akin to saying that your home-made car is just as good as any other
car, and pointing to a brake pad and a steering column.</p>

<p>A robust scraping pipeline needs to handle caching, respect
robots.txt, accept input and process output in a variety of formats,
gather metadata about the fetched data, handle quotas &amp; rate limits,
and log sites/second, error rates and unexpected failures in real time
to a dashboard or similar. (Some of these are derived from scrapy, the
python scraping framework - others are things I’ve needed along the
way. As to why you shouldn’t just use scrapy: it’s slow, untyped, and
prone to running out of memory.)</p>

<p>I may actually write this library at some point, but would be just as
happy if someone read this and beat me to it.</p>

<h2 id="the-shape-of-the-problem">The shape of the problem</h2>

<p>What I mean by scraping is something at least semi-adversarial.
Fetching items from the Twitter API is not scraping: it may be true
that the owner of the website doesn’t mind you scraping them (which is
why we check robots.txt), but they will go to absolutely no effort to
avoid breaking your scrapers.</p>

<p>There are many ways to scrape. Sometimes you can just enumerate a list
of URLs: more commonly, each fetch gives you both a list of new URLs
to fetch and a list of results. This allows a recursive approach to
fetching, and also requires some kind of store so you don’t fetch the
same URL twice and end up in an endless loop.</p>

<h2 id="robots">Robots</h2>

<p>While you can get away with quick and dirty scraping without checking
robots.txt (and let’s face it, we’ve all run curl in a loop before),
it’s pretty rude to do large scrapes without checking the original
provider is OK with it. Thankfully some uncommonly handsome and
benevolent developer has done the heavy lifting of parsing and
processing robots.txt files - any broad spectrum scraping solution
ought to incorporate https://hackage.haskell.org/package/robots-txt .</p>

<h2 id="caching">Caching</h2>

<p>Scraping is an inherently trial-and-error driven process. The web is
messy and inconsistent, and it isn’t uncommon to discover your XPath
query (or equivalent) is not quite right hundreds of thousands of
pages in. Proper caching of both raw fetches and processed results
means you can fix small errors in code and selectively redo failing
parses - it also means that if the site under scrape has intermittent
failures, you can refetch missing pages without starting the scrape
from scratch.</p>

<p>More or less anything can be used as a persistent store, so long as
it’s indexed on the URL. I used multiple sqlite databases as a backend
when I was scraping 160 million sites a day: for smaller scrapes, a
single PostgreSQL instance should be fine. Your bottleneck is going to
be the fetching, not local writes.</p>

<h2 id="input">Input</h2>

<p>A good scraping library should
provide a way of querying common data formats like XML, HTML, JSON,
CSV and plain text. The standard way of dealing with these formats in
Haskell is to use Aeson to parse JSON to a datatype representing what
you expect to see: I like this method for normal HTTP interactions,
but scraping is a process of extracting a relatively tiny
amount of information from a large chunk of text. Something like
lens-aeson lets you dig into a json structure in an ad-hoc way</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>foo ^? key "hah" . nth 1 . key "hfasd"
</code></pre></div></div>

<p>Because you typically add scraped fields one at a time, this avoids
the necessity of adding both a record field and an aeson parser every
time you add a field.</p>

<p>Similar reasoning applies to the other formats.</p>

<h2 id="output">Output</h2>

<p>It should provide a way of serialising results to CSV, XML or
line-oriented JSON (or standard JSON if you’re feeling masochistic).</p>

<p>This implies that we should store our results in an output-neutral
format - in keeping with Scrapy’s nomenclature, I’ll call each result an
Item, and think of it as a map of column names to values.</p>

<h2 id="metadata">Metadata</h2>

<p>It’s often interesting to know metadata about the fetch process
itself: a scraping library ought to allow access to the path of URLs
that got you to the current one as well as the robots.txt that allowed
access and any headers from the server response.</p>

<h2 id="rate-limiting">Rate limiting</h2>

<p>Rate limiting is important for a few reasons. It’s not at all uncommon
to accidentally overload a site by scraping too aggressively - I’ve
even accidentally taken down DNS servers in my time. A good scraping
library should let you limit global requests/second as well as
reqs/second to a particular domain.</p>

<p>It’s also important to have configurable quotas for individual fetches.
I’ve had a PHP site on a fast server spew hundreds of megabytes of
error messages a second at me, all of which got dutifully loaded into
the database. Don’t be like me, be smart.</p>

<h2 id="logging">Logging</h2>

<p>Scrapes often go wrong hours in. It’s really useful to have something
like scrapinghub.com that gives you a visualisation of how many
requests/s you’re getting, and items/s is pretty useful too, but
scrapinghub only works with python+scrapy, so that’s no help.</p>

<p>We’re getting a little out of scope here, logging is its own thing
that can get almost arbitrarily complicated, but at the very least,
you’d want something that can spit structured info to stderr, and to
be able to set a constraint that if items/s, requests/s or bytes/s
fall below a threshold over the last minute or so.</p>

<h2 id="testing">Testing</h2>

<p>You need at least two kinds of tests: individual offline tests for
each kind of page you want to fetch, as well as a rougher online test
that given a starting point, fetching provides about the number of
results you expect. This is necessarily hazy, but if you usually do 2
fetches for each full item, and that suddenly blows out to 100, it
usually indicates that the formatting has changed on some page you
rely on, and you need to fix it.</p>

<h2 id="security">Security</h2>

<p>You might think there isn’t much to say about security in a scraper -
after all, it’s the one calling all the shots. However, it’s important
to think about what implicit (eg. IP range) and explicit (password,
token, etc) credentials you’re using when scraping. It’s entirely
possible for a site that knows it’s going to be scraped to redirect
you to a resource for which you have privileged access: if the result
of your scrape will become public later, you’ve just exposed private data.</p>

<p>The library should have a default configuration under which it won’t
scrape private IPs, use creds, or even go off-domain (though
domain.com -&gt; www.domain.com should be fine) - that way you only get
potentially dangerous behaviour if you explicitly ask for it.</p>

<p>(Redirects are a thorny problem in general: a library should catch
redirect loops and allow the user to set policies on off-site
redirects, maximum redirect chain lengths, etc)</p>

<h2 id="distribution">Distribution</h2>

<p>I’ll be a heretic here and say you probably don’t need distribution.
I scraped the front page of 160 million domains every day with 13
machines: if you’re not working at that scale, it really doesn’t
matter.</p>

<p>If you really needed to extend the design, you’d want to set it up so
that you had a central blocking queue that started out with just the
initial URLs. Scrapers would connect to it to get URLs, and at each
step acknowledge that they’ve fetched that URL, along with the results
of the fetch - they’d never do more than one level of fetching. After
n minutes without an acknowledgment, the URL can be sent out again,
though you might want to return the history in the response, to avoid
redundant fetching.</p>

<p>I implemented this model over ZMQ, but HTTP would be fine and probably
simpler - it’s going to be chatty no matter what, so avoid it if you can.</p>

<h2 id="queue-control">Queue control</h2>

<p>One of the problems I faced with Scrapy was that as it used an evented
framework rather than an explicit queue, it was extremely easy to blow out memory. In one example,
my source data was a set of gzipped files that listed other gzipped
files that <em>then</em> contained links to the data I needed. Because I
didn’t have control of the queue, it ended up loading every
single one into memory before it ever got to the juicy part with the
final data.</p>

<h2 id="putting-it-into-code">Putting it into code</h2>

<p>So, notionally, our scraper takes data in some known format and
extracts some of that information into two things: a (possibly empty)
list of new URLs to look at (along with a tag to indicate what kind of
URL it is) and a list of dictionaries that represents the actual
information you’re interested in from that page. (Commonly this will
actually be a single element, but sometimes if you’re scraping search
results, you’ll get many notionally separate chunks of data on a single page.)</p>

<p>What might this look like?</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- user code
data PageType
  = InitialIndex
  | IntermediateIndex
  | ActualData
  deriving (Eq,Ord)
</code></pre></div></div>

<p>This is what defines the shape of the scrape, as seen by the user.
Separating out the different kinds of pages we see means that we can
parse them differently, without relying on fragile regular expressions
on the URL - it also means we can prioritise some fetches above
others, as well as restrict download slots on a page-type basis.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-- library code - this needs to be fleshed out.
-- something slightly less general than Aeson's Value type: just scalars.
data Column = ColText Text
            | ColInt  Integer
            | ...

type Item = [(Text,Column)]

-- Quotas, number of concurrent threads, logging, robots, etc
data Config a =
  Config
  { threads :: Int
  , downloadSlots :: a -&gt; Int
  ...
  }

data Metadata = ...

-- here a is instantiated to PageType
scrape :: Config a
       -&gt; [(a, URL)]
       -&gt; (a -&gt; Metadata -&gt; ([(a,URL)], [Item]))
       -&gt; ([Item] -&gt; IO ())
       -&gt; IO ()
</code></pre></div></div>

<p>(This ought to be in something like Conduit or Pipes for real code: it’s
presented in this simpler form for clarity only.)</p>

<p>Dissecting the type of scrape, we give it some configuration
information, an initial list of URLs (each with an associated tag), an
extractor function, and a way to do something with each row.</p>

<p>Notice the downloadSlots tag in the Config record? That’s there so
that we can control intermediate fetches, and avoid the memory blowout
I described earlier. You might think it’s enough to have an Ord
instance for PageType and use that: unfortunately, what typically
happens is that the first few hundred pages will all be InitialIndex
and IntermediateIndex pages; while ActualData requests will be
prioritised, by the time any are actually processed, we might have
already blown out memory. The downloadSlots refers to how many of that
kind of download can be downloading at once: that combined with the
Ord instance on PageTypes means we can keep a queue of fetchable
things ordered in a sensible way that minimises the amount of memory required.</p>

<h1 id="conclusion">Conclusion</h1>

<p>One reviewer quite reasonably raised the concern that this is an
extremely un-Haskelly library. Our platonic ideal of a Haskell library
is something that exports a single, coherent concept in such a way
that it never needs to be reimplemented. This is not that: scraping is
a dirty, error-prone, highly contingent endeavour. The goal here is to
package up a pile of hard-won knowledge about where the facerakes are
and make them easier to avoid.</p>

<p>(In other news, if you actually need this or other Haskell work done,
I am available for hire: mwotton@gmail.com.)</p>

<p>(Thanks to @tureus, @jfischoff, @mxavier and @thumphriees for their
detailed feedback, and thanks to everybody else who read it, even if
you couldn’t think of improvements.)</p>

<p>(discussion <a href="https://www.reddit.com/r/haskell/comments/6nazqm/notes_on_design_of_webscrapers_in_haskell/">here</a>)</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="haskell scraping" /><summary type="html"><![CDATA[Why scraping libraries in Haskell aren’t good enough]]></summary></entry><entry><title type="html">a fallible guide to persistent-template</title><link href="http://shimweasel.com/2017/05/18/a-field-guide-to-persistent-template" rel="alternate" type="text/html" title="a fallible guide to persistent-template" /><published>2017-05-18T00:00:00+00:00</published><updated>2017-05-18T00:00:00+00:00</updated><id>http://shimweasel.com/2017/05/18/a-field-guide-to-persistent-template</id><content type="html" xml:base="http://shimweasel.com/2017/05/18/a-field-guide-to-persistent-template"><![CDATA[<p>Template Haskell is often considered a smell in Haskell, responsible
for everything from slow compile times to impenetrable error messages
to the cow’s milk turning sour. Some of this is warranted, but if you
need to check anything based on compile-time information, it is
also the only game in town.</p>

<p>Persistent is a good example of where TH is arguably justified.
Defining our database tables and our types in the same format means
that there is a single source of truth, with no way for the code to
fall out of touch with the database.</p>

<p>Nonetheless, sometimes you do want to break this model; in this case,
I have a type (Status from twitter-types) that I’d like to persist in
the database. I can’t include this definition in the standard
quasiquoted file: the type already exists, I can’t define it again.
This leaves me with two options:</p>

<ol>
  <li>
    <p>use a type with the same structure as the Status I get from
twitter-types, and write some conversion functions for converting
between persisted and original statuses. Probably about twenty
lines and ten minutes’ work.</p>
  </li>
  <li>
    <p>read the source of persistent-template and work out a way of
controlling the declaration of datatypes on a table-by-table basis.</p>
  </li>
</ol>

<p>So! Let’s dig into how it’s currently set up.</p>

<p>Our top level entry point is “share”.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>share :: [[EntityDef] -&gt; Q [Dec]]
      -&gt; [EntityDef]
      -&gt; Q [Dec]
</code></pre></div></div>

<p>In the first argument, we have a list of “builders”, in a sense -
we’ll pass a list of EntityDefs to each of them, and each will create
some declarations (Q [Dec]). In the normal case, we’ll pass ‘mkPersist
sqlSettings’ and ‘mkMigrate “migrateAll”’.</p>

<p>In the second, we have a list of EntityDefs, typically (but not necessarily) coming
from the ‘persistLowerCase’ quasiquoter.</p>

<p>Finally, our result type is a Q [Dec]. This is a little bit strange:
‘share’ is called at the top level, and it looks like a bare top-level
expression, not a binding declaration. Digging through the docs (and
https://stackoverflow.com/documentation/haskell/5216/template-haskell-quasiquotes#t=201705181813575250883)
, it turns out that when we have a Q [Dec] at the top level, we can
omit the standard $( … ) syntax that we’d usually use for
introducing TH to normal Haskell code. Almost a little too convenient,
but ok.</p>

<p>This tells us that we are going to have to make at least two changes.
Our EntityDef will need to carry
the information of whether a datatype needs to be created for the
Entity or not. We can then monkey with mkPersist to look at that
information and only generate a datatype if we have requested one. (If all goes well,
mkMigrate ought to work unchanged.)</p>

<p>(For the moment, we’ll change this in-place: the question of whether
this is to be a mergeable fork or an add-on can be dodged for now.)</p>

<p>So: first order of business is EntityDef. <code class="language-plaintext highlighter-rouge">ag 'data EntityDef'</code>
shows us that it is defined in Database.Persist.Types.Base, meaning we’ll have to edit persistent as
well as persistent-template. We’ll add an ‘isStandalone’ field
to the EntityDef declaration that will default to False. When we do that, we try compiling to see
what broke:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    /home/mark/projects/persistent/persistent/Database/Persist/Quasi.hs:299:5: error:
        • Couldn't match expected type ‘EntityDef’
                      with actual type ‘Bool -&gt; EntityDef’
        • Probable cause: ‘EntityDef’ is applied to too few arguments
          In the second argument of ‘($)’, namely
            ‘EntityDef ...
</code></pre></div></div>

<p>which makes sense. Looking at mkEntityDef, we can define a dummy value
for isStandalone, just to check everything else is hunky-dory, and it
compiles. Great. (Parenthetically, notice that we are never too far
from a type-correct system, even if it doesn’t quite do what we want
yet. Get too far into the weeds and it is very difficult to extract
yourself.)</p>

<p>Now we want to see where we can stash a declaration that a particular
entity is standalone.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkEntityDef :: PersistSettings
            -&gt; Text -- ^ name
            -&gt; [Attr] -- ^ entity attributes
            -&gt; [Line] -- ^ indented lines
</code></pre></div></div>

<p>It’s not the PersistSettings, they’re global. You <em>could</em> cram it into
the name of the model as some kind of godawful in-band signalling, but
this path always leads to pain. Let’s keep looking. [Line] doesn’t
seem right, they’re the individual fields of the table, and we’re
trying to find where to specify a feature of the entity as a whole. By
a process of elimination, it has to be [Attr]. Let’s go grepping
again. (I should probably set tags up instead of dumb text search, but
grep/ag/ack are simple and dependable.)</p>

<p>Oh, there’s a surprise - ‘ag “data Attr”’ and ‘ag “newtype Attr”’
yield nothing. Turns out it’s just a type synonym for Text. That’s
unexpected - it’ll mean we have to take a bit of attribute namespace,
but I suppose that was inevitable anyway. This means we can actually
back out our earlier changes and just pass “standalone” as a textual attribute.</p>

<p>With that, we just need to monkey with mkEntity a little.
‘dataTypeDec’ looks promising: if we check for “standalone” in the
attributes, we should be just about home free.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+    let inclDtd = if "standalone" `elem` (entityAttrs t)
+                  then []
+                  else [dtd]
     return $ addSyn $
-       dtd : mconcat fkc `mappend`
+       (mconcat (inclDtd:fkc)) `mappend`
</code></pre></div></div>

<p>Compiles, shipit!</p>

<p>We should probably check that this works the way we expect.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>share [mkPersist sqlSettings] [persistLowerCase|
Bar standalone json
    name Text
|]
</code></pre></div></div>

<p>Gratifyingly and hopefully unsurprisingly, this fails in an obvious
way:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
/home/mark/projects/persistent/persistent-template/test/main.hs:62:1: error:
    Not in scope: type constructor or class ‘Bar’
    Perhaps you meant ‘Baz’ (line 53)
</code></pre></div></div>

<p>Let’s define it then!</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data Bar = Bar { barName :: Text }
</code></pre></div></div>

<p>Clean compile, and all existing tests green.
We haven’t actually tested this functionality yet, but inside the
persistent lib is a difficult place to do it - we’ll add a more
thorough test in our app, given that we had to write that code anyway.</p>

<p>Checking against my app, I realise that I want to store a Status
(which fortuitously is named according to Persistent conventions - if
it weren’t, I’d need to turn mpsPrefixFields off in sqlSettings.)</p>

<p>I define Status in the model quasiquoter, using all the same fields -
we want to get back to it working, and <em>then</em> turn on standalone.</p>

<p>Now I get a bunch of errors about not having
PersistField defined for a range of types. Excellent! These are easy
to define with StandaloneDeriving (as Show is already defined by derivation)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>derivePersistField "Entities"
deriving instance Read Entities
deriving instance Read HashTagEntity
deriving instance Read UserEntity
deriving instance Read x =&gt; Read (Entity x)
...
</code></pre></div></div>

<p>We continue doing this more or less mechanically until we get to this
odd complaint:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    • No instance for (persistent-2.7.0:Database.Persist.Sql.Class.PersistFieldSql
                         TW.Status)
</code></pre></div></div>

<p>We don’t <em>particularly</em> want to persist a Status as a field: the whole
point was to model it explicitly in a table. Why are we getting this?
Ah! The Status type has a self-reference in it! quotedStatus is a
Maybe Status!</p>

<p>It’s at this point I’m inclined to say “ok, this was a bad idea, a
single isomorphism isn’t so bad”, but I want to finish this post, so
let’s follow the rabbit down the hole and define <code class="language-plaintext highlighter-rouge">derivePersistField
"Status"</code> too. Might as well be hung for a sheep as a lamb.</p>

<p>Now we run into an odd problem:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/home/mark/projects/owlstacks.com/Model.hs:23:1: error:
    Duplicate instance declarations:
      instance PersistFieldSql Status -- Defined at Model.hs:23:1
      instance PersistFieldSql Status -- Defined at Orphans.hs:68:1
   |
23 | share
   | ^^^^^...

/home/mark/projects/owlstacks.com/Model.hs:23:1: error:
    Duplicate instance declarations:
      instance PersistField Status -- Defined at Model.hs:23:1
      instance PersistField Status -- Defined at Orphans.hs:68:1
   |
23 | share
</code></pre></div></div>

<p>Now it’s starting to become a little clearer: we need the PersistField
instance in the Orphans file, which means we also have to be able to
disable instance generation in the quasiquoting code. One more time
unto the breach, dear friends: we’re going back into TH.hs</p>

<p>inside mkPersist, we need to monkey with the defined instances</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>standalone t =  "standalone" `elem` (entityAttrs t)
</code></pre></div></div>
<p>and</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>filter (not . standalone)
</code></pre></div></div>

<p>inside mkPersist. Ok! Now we have “multiple declarations of StatusId”,
and here we come to a bit of a grinding halt. Status is one type from
twitter-types, StatusId is another. Persistent expects to be able to
nab that piece of the namespace to describe the primary key of the
table in Haskell, and it isn’t configurable.</p>

<p>I could write more code to get around this, but we have proceeded well
past the point where this exercise could be expected to yield useful
code, and I think I’m going to call it here. Persistent is a pretty
opinionated framework: if you have significantly different needs, or
want to serialise types that come from elsewhere to the database
without introducing an intermediary type, it
looks like a better idea to just use something else. Apologies for the
abrupt end, it’s as much a surprise to me as anyone else.</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="haskell negativeresult streamofconsciousness" /><summary type="html"><![CDATA[Template Haskell is often considered a smell in Haskell, responsible for everything from slow compile times to impenetrable error messages to the cow’s milk turning sour. Some of this is warranted, but if you need to check anything based on compile-time information, it is also the only game in town.]]></summary></entry><entry><title type="html">haskell profiling without pain</title><link href="http://shimweasel.com/2017/04/27/haskell-profiling-without-pain" rel="alternate" type="text/html" title="haskell profiling without pain" /><published>2017-04-27T00:00:00+00:00</published><updated>2017-04-27T00:00:00+00:00</updated><id>http://shimweasel.com/2017/04/27/haskell-profiling-without-pain</id><content type="html" xml:base="http://shimweasel.com/2017/04/27/haskell-profiling-without-pain"><![CDATA[<p>In general, stack is pretty good at caching build artifacts - the
first build might be glacial, but incremental builds are snappy. This
all goes out the window when you start playing with flags like
–profile, though - stack sees that something has changed and
dutifully rebuilds every-bloody-thing. This is a huge disincentive to
casually run profiling builds.</p>

<p>thankfully, stack keeps all its bits and bobs inside ./.stack-work, so
all you need to do is create a shadow directory like so:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir foo-prof
cd foo-prof
lndir ../foo
cd foo-prof
rm -rf .stack-work
stack build --profile
</code></pre></div></div>

<p>(on ubuntu, you can get lndir from xutils-dev.)</p>

<p>now, you have a directory with a separate .stack-work, and can mess around
in that directory for all your profiling needs without slowing
everything else down to a shuddering halt.</p>

<p>(nb: this will not cope terribly gracefully with new files being
added. a workaround is to only symlink the top level files and
directory instead of a full shadow tree: that way, only top level
changes will break things.)</p>

<p>(another possibility would be to set STACK_WORK=.stack-work-profiling
whenever you run profiling commands, but you’re going to screw it up
eventually - probably better to have totally separate implicit contexts.)</p>]]></content><author><name>Mark Wotton</name><email>mwotton@shimweasel.com</email></author><category term="haskell" /><summary type="html"><![CDATA[In general, stack is pretty good at caching build artifacts - the first build might be glacial, but incremental builds are snappy. This all goes out the window when you start playing with flags like –profile, though - stack sees that something has changed and dutifully rebuilds every-bloody-thing. This is a huge disincentive to casually run profiling builds.]]></summary></entry></feed>