Minor Gripe

2019-08-21 -- Little projects for programmers

Chris Ertel

Introduction

I keep a list of little neat projects that I think some programmers might find fun to write for themselves, if not actually use all the time. I figure I’ll list out what they are and give maybe quick blurbs about why I think they’re neat projects and some ideas for how to tackle them. I’ll do a few more write-ups with more stuff from that list from time to time.

Some projects (all web-based):

Personal wiki
Blog generator
Little web crawler

Personal Wiki

One of the best tools I used to have a copy of was Microsoft OneNote, but for the game development my friends and I used to do we were big fans of Dokuwiki.

Anyways, ignoring the OG wiki wiki web, the feature list of a modern wiki (say, Mediawiki) looks something like:

Non-HTML markup for writing pages
Easy linking of pages to and from each other
Media file uploads for images, sound, video, and so on
Ability to write references and footnotes on pages
Ability to create tables of contents for pages
Multiple user accounts
Permissions system for users
Ability to view different previous versions of a page
Ability to diff two page versions against each other
Some form of comment system, either through talk pages (Mediawiki style) or on-age comments (Confluence)
Some form of folder structure for organizing spaces (may also tie into permissions)
Some form of search

This seems like a lot, but developing a wiki for personal use doesn’t require nearly that much stuff. A minimal feature list for one might look like:

Markdown document markup
Page linking (elaborated below)

Now, with those two things, you can imitate most of what you’ll use on a daily basis. Since we’re not supporting multiple users, if we’re a little simple in our usage we can get away with basic HTML forms–no need for websockets or long-polling or whatever.

For linking, the functionality we want is the ability to define a link inside the wiki, and then have it automatically link to a “create a page” form if the page doesn’t exist, or to the page if it does.

The way to do that is to generate the page HTML for everything, crawl and collect metadata on all the pages, and then rewrite all the HTML links to point at the correct thing (external link, internal stub link, or internal page) before finally writing everything to disk (or database or wherever you think it should live).

Similarly, we can use the HTML transform trick to handle things like downloading and inlining images for use in pages (done in my blog generator), discovering footnotes and appending them, and all sorts of other things.

The humble personal wiki itself, though, starts with those two basic features.

Blog generator

Wordpress is the 800-lb gorilla of the blogging realm, closely followed by Medium. If you like hosting your own files though (and you should, because the web is meant to be federated, but that’s another story entirely) you’ve probably looked at Octopress or Jekyll or something. These platforms offer features like:

WYSIWYG or non-HTML authoring
Version control via git or whatever
Comment systems (hosted or embedded)
Analytics
Slug generation for page URLs
SEO support
Mobile support
Archive indices for finding posts
Tag systems for grouping or organizing posts
Multiple user support
Draft support
Embedded images, sound, and video content
Post versioning

Again, you can throw away almost all of that in order to get a small feature set that might take a day or two to make:

Markdown authoring
Slug generation for page URLs
Index page generation

Most of that can be handled really easily, and in fact you can get away with just bash and pandoc for the Markdown conversion.

Something that will trip you up is that, once you want to create any sort of interesting indices, page headers or footers linking to other posts, tag indices, or whatever else you’ll discover that you need to actually write code to collect the metadata for each post and then use that to generate those other resources. That will probably mean you need to figure out how you want to attach metadata to a post–I use something similar to Octopress where I store a YAML-like blob at the top of my file, but you could use a dedicated separate JSON-blob or YAML document with the same filename as the post and a different extension, or whatever else.

For comments, you might want to embed something like Disqus but I’ve found that if you really need feedback then an email address is probably better.

Little web crawler

So, everybody needs a pet spider. Big commercial search engines like Google or crawlers like 80legs have a whole bunch of features, obviously:

Distributed crawling (because
Archiving of crawls
Robots.txt handling
Throttling and polite re-crawling
Web app handling (those pesky JS pages)
Cycle detection
Redirect following
Malformed HTML support
Web interface for managing spiders
Multiple user support for managing spiders
Access control for users
Indexing of crawls to make searching possible
Searching crawls by relevance to a search term
Searching crawls by date
Error reporting when your spiders crawl onto something they shouldn’t and explode

And that’s all like table-stakes. As you can probably guess, a simple little pet spider doesn’t need to do nearly that much.

A pet spider probably wants to know how to:

Use a queue to manage crawls
Crawl (HTTP GET) a document
Perform some inspection on that document (extracting a field via CSS selector, adding to a full-text search engine like Elasticsearch, whatever)
Discover new and exiting links from that document to other documents and queue them for crawling
Ignore documents that have already been crawled, or have been crawled too recently
Throttle itself so its excited pursuit of its work doesn’t anger site admins
Handle basic failures like 404s or malformed documents (usually by exploding the spider and making a new one after flagging the crawl as troublesome)

This project probably wants a little database (say, SQLlite or PG) for storing the queue and skipping redundant crawls, but you could conceivably just use a flat text file and some bash trickery to check for duplicates. An MVP for this might well look like a bash script with curl and pup and a bit of cleverness.

Obviously, as time goes on, you’ll want to festoon your little spider with gear and features so it can have better luck on the World Wide Web, but making one to start with isn’t too gnarly.

Why do this?

I believe that almost every developer is skilled enough to tackle these sorts of things, if they only tried. We look at the big players online today and wonder how we can ever live in a world not dictated by their systems. Google doesn’t have to own search. Wikipedia doesn’t have to own knowledge. Medium doesn’t have to own self-indulgent screeds.

To make a better world, we need to empower users. To empower users, we need to educate them to be autonomous. To be autonomous, they have to seize the means of production.

My hope is that, by doing little projects like these, you can realize your potential to be autonomous of one degree or another from those orgs and feel capable of bring the fire down to the masses and giving them a chance to be autonomous too.

Tags: projects