Little projects for programmers

- 6 mins read

Introduction

I keep a list of little neat projects that I think some programmers might find fun to write for themselves, if not actually use all the time. I figure I’ll list out what they are and give maybe quick blurbs about why I think they’re neat projects and some ideas for how to tackle them. I’ll do a few more write-ups with more stuff from that list from time to time.

Some projects (all web-based):

  • Personal wiki
  • Blog generator
  • Little web crawler

Personal Wiki

One of the best tools I used to have a copy of was Microsoft OneNote, but for the game development my friends and I used to do we were big fans of Dokuwiki.

Anyways, ignoring the OG wiki wiki web, the feature list of a modern wiki (say, Mediawiki) looks something like:

  • Non-HTML markup for writing pages
  • Easy linking of pages to and from each other
  • Media file uploads for images, sound, video, and so on
  • Ability to write references and footnotes on pages
  • Ability to create tables of contents for pages
  • Multiple user accounts
  • Permissions system for users
  • Ability to view different previous versions of a page
  • Ability to diff two page versions against each other
  • Some form of comment system, either through talk pages (Mediawiki style) or on-age comments (Confluence)
  • Some form of folder structure for organizing spaces (may also tie into permissions)
  • Some form of search

This seems like a lot, but developing a wiki for personal use doesn’t require nearly that much stuff. A minimal feature list for one might look like:

  • Markdown document markup
  • Page linking (elaborated below)

Now, with those two things, you can imitate most of what you’ll use on a daily basis. Since we’re not supporting multiple users, if we’re a little simple in our usage we can get away with basic HTML forms–no need for websockets or long-polling or whatever.

For linking, the functionality we want is the ability to define a link inside the wiki, and then have it automatically link to a “create a page” form if the page doesn’t exist, or to the page if it does.

The way to do that is to generate the page HTML for everything, crawl and collect metadata on all the pages, and then rewrite all the HTML links to point at the correct thing (external link, internal stub link, or internal page) before finally writing everything to disk (or database or wherever you think it should live).

Similarly, we can use the HTML transform trick to handle things like downloading and inlining images for use in pages (done in my blog generator), discovering footnotes and appending them, and all sorts of other things.

The humble personal wiki itself, though, starts with those two basic features.

Blog generator

Wordpress is the 800-lb gorilla of the blogging realm, closely followed by Medium. If you like hosting your own files though (and you should, because the web is meant to be federated, but that’s another story entirely) you’ve probably looked at Octopress or Jekyll or something. These platforms offer features like:

  • WYSIWYG or non-HTML authoring
  • Version control via git or whatever
  • Comment systems (hosted or embedded)
  • Analytics
  • Slug generation for page URLs
  • SEO support
  • Mobile support
  • Archive indices for finding posts
  • Tag systems for grouping or organizing posts
  • Multiple user support
  • Draft support
  • Embedded images, sound, and video content
  • Post versioning

Again, you can throw away almost all of that in order to get a small feature set that might take a day or two to make:

  • Markdown authoring
  • Slug generation for page URLs
  • Index page generation

Most of that can be handled really easily, and in fact you can get away with just bash and pandoc for the Markdown conversion.

Something that will trip you up is that, once you want to create any sort of interesting indices, page headers or footers linking to other posts, tag indices, or whatever else you’ll discover that you need to actually write code to collect the metadata for each post and then use that to generate those other resources. That will probably mean you need to figure out how you want to attach metadata to a post–I use something similar to Octopress where I store a YAML-like blob at the top of my file, but you could use a dedicated separate JSON-blob or YAML document with the same filename as the post and a different extension, or whatever else.

For comments, you might want to embed something like Disqus but I’ve found that if you really need feedback then an email address is probably better.

Little web crawler

So, everybody needs a pet spider. Big commercial search engines like Google or crawlers like 80legs have a whole bunch of features, obviously:

  • Distributed crawling (because
  • Archiving of crawls
  • Robots.txt handling
  • Throttling and polite re-crawling
  • Web app handling (those pesky JS pages)
  • Cycle detection
  • Redirect following
  • Malformed HTML support
  • Web interface for managing spiders
  • Multiple user support for managing spiders
  • Access control for users
  • Indexing of crawls to make searching possible
  • Searching crawls by relevance to a search term
  • Searching crawls by date
  • Error reporting when your spiders crawl onto something they shouldn’t and explode

And that’s all like table-stakes. As you can probably guess, a simple little pet spider doesn’t need to do nearly that much.

A pet spider probably wants to know how to:

  • Use a queue to manage crawls
  • Crawl (HTTP GET) a document
  • Perform some inspection on that document (extracting a field via CSS selector, adding to a full-text search engine like Elasticsearch, whatever)
  • Discover new and exiting links from that document to other documents and queue them for crawling
  • Ignore documents that have already been crawled, or have been crawled too recently
  • Throttle itself so its excited pursuit of its work doesn’t anger site admins
  • Handle basic failures like 404s or malformed documents (usually by exploding the spider and making a new one after flagging the crawl as troublesome)

This project probably wants a little database (say, SQLlite or PG) for storing the queue and skipping redundant crawls, but you could conceivably just use a flat text file and some bash trickery to check for duplicates. An MVP for this might well look like a bash script with curl and pup and a bit of cleverness.

Obviously, as time goes on, you’ll want to festoon your little spider with gear and features so it can have better luck on the World Wide Web, but making one to start with isn’t too gnarly.

Why do this?

I believe that almost every developer is skilled enough to tackle these sorts of things, if they only tried. We look at the big players online today and wonder how we can ever live in a world not dictated by their systems. Google doesn’t have to own search. Wikipedia doesn’t have to own knowledge. Medium doesn’t have to own self-indulgent screeds.

To make a better world, we need to empower users. To empower users, we need to educate them to be autonomous. To be autonomous, they have to seize the means of production.

My hope is that, by doing little projects like these, you can realize your potential to be autonomous of one degree or another from those orgs and feel capable of bring the fire down to the masses and giving them a chance to be autonomous too.