Minor Gripe

2021-04-14 -- Adventures in bookmark parsing

Chris Ertel

Introduction

I’m working on a bookmark manager, and one of the minor features I want is to import all my old Firefox bookmarks.

This adventure has been…an adventure.

At least one other person has tackled this problem over in Node, but I’m taking stabs at it in Elixir because that’s kind of my schtick.

On Mozilla exports

From Firefox you can have it export all of your bookmarks as an HTML file.

If you’re like me, doing this you might make some assumptions:

This is a normal HTML file.
Bookmarks are probably anchor elements.
This will be well-formed.

If you’re like me, you’d be assuming wrong. :(

Let’s look at a somewhat chopped down file:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks Menu</H1>

<DL><p>
    <dt>...</dt>
    <dt>...</dt>
    <dt>...</dt>
    <dt>
        <h3>Bookmarks Toolbar</h3>
        <dl>
            <p>
            <dt>
                <h3>folder 1</h3>
                <dl>                    
                    <dt> <a href="..."> link 1</a>
                    <dt> <a href="..."> link 2</a>
                    <dt> 
                        <h3>sub folder 1</h3>
                        <dl>
                            <dt> <a href="..."> link 1</a>
                            <dt> <a href="..."> link 2</a>
                            <dt> ...
                        </dl>
                    <p>
                </dl>
                <p>
            </dt>
            <dt> ...
            <dt> <a href="..."> a root-level link</a>
        </dl>            
        <p>
    </dt>
</DL>

I apologize in advance if that isn’t 100% accurate–I chopped out things like ICON attributes that have giant base64 image chunks in them, <a> attributes like ADD_DATE and LAST_MODIFIED. A file with like a thousand bookmarks for me is like 13mb, which is a bunch.

With that snippet in hand, let’s revisit those assumptions:

This is a normal HTML file.

If you’re like me and you reach for the normal parsing tool of Floki, you might then try something like:

File.read!(path) |> Floki.parse_document!() |> Floki.find("body > dt")

Unfortunately, that’ll return an empty set []. If you then show up on Github and ask questions without checking your work, you’ll look like a doofus.

Now, why doesn’t it work? Well, look at that DOCTYPE again:

<!DOCTYPE NETSCAPE-Bookmark-file-1>

In near a decade of experience–and maybe this is me just being provincial–I have basically never seen such a strange beastie.

The important thing about it is that, well…we don’t have a head tag, or a body tag, and then that breaks that attempt to find top-level dt elements.

Bookmarks are probably anchor elements.

This one panned out, thank god.

This will be well-formed.

See those weird little <p> tags scattered around? And how the <DT> tags don’t have closing counterparts? Argh.

On to parsing

So, links in a folder look like:

<dt>
    <a>some link</a>
<dt>

And folders look like:

<dt>
    <h3> Folder name</h3>
    <dl>
        <link>
        <folder>
    </dl>
    <p>
</dt>

Note that we’re kinda hand-waving the whole </dt> part here–both Floki and the browser seem to see what’s up.

Anyways, our first cut looks like:

def import_from_file(path) do
    file = File.read!(path) |> String.replace("<p>","") |> String.replace("</p>", "")
    
    root_nodes = Floki.parse_document!(file)
    [list_root] = root_nodes |> Enum.filter( fn
      {"dl", _attrs, _kids} -> true
      _ -> false
    end)

    tags = Floki.find(list_root, "h3") |> Enum.map( fn {_,_, [tag]} -> tag end )
    urls = Floki.find(list_root, "a")
    {:ok, tags, urls}
end

This pulls the tags and urls, but it doesn’t really address actually mapping the tags on to the urls for the bookmarks. For our purposes, the tags for a bookmark at the names of its containing list and the names of all containing lists up to the root.

I’m still not happy about this, what else can we do?

Parsing, the next generation

Okay, so, my poor brain is getting really hung up on the dt tags and the p tags. What does our tree look like if we just, you know, nuke them?


<DL>
    ...
    ...
    ...
    <h3>Bookmarks Toolbar</h3>
    <dl>
        <h3>folder 1</h3>
        <dl>                    
            <a href="..."> link 1</a>
            <a href="..."> link 2</a>
            <h3>sub folder 1</h3>
            <dl>
                <a href="..."> link 1</a>
                <a href="..."> link 2</a>
                ...
            </dl>
        </dl>            
        ...
        <a href="..."> a root-level link</a>
    </dl>
</DL>

That looks a little easier to wrap our arms around, yeah? We kinda induct these rules now:

Every <dl> has either <h3>, <dl>, or <a> children.
a <dl> (bookmark folder) is named after the sibling h3 is has
an <a> (bookmark) has the tags of chain of <dl> containing it

Now, you might think we could just throw some CSS selectors at this problem and call it a day–I thought that too, and was wrong. Unfortunately, the CSS approach is going to end up out at the leafs of the document tree, and will also make it hard to gather the tags.

So, instead, we need to write some actual traversal code. Our input is going to be the root <dl> document tree, and our output wants to be something like a list of bookmark tuples consisting of the active tags when we encountered them, the URL, and the display name.

We’re also going to want to clean out attributes we don’t care about, because in my case the icon attribute ends up with a bunch of base64 encoded noise that bloats memory and makes debugging annoying. Luckily, with some regexes we can sort all of that out prior to actually doing real parsing on the HTML:

def clean_html(html) do
    html
    |> String.replace(~r/<p>/i,"")
    |> String.replace(~r/<\/p>/i, "")
    |> String.replace(~r/<dt>/i,"")
    |> String.replace(~r/<hr>/i,"")
    |> String.replace(~r/icon=".*?"/i,"")
    |> String.replace(~r/icon_uri=".*?"/i,"")
    |> String.replace(~r/add_date=".*?"/i,"")
    |> String.replace(~r/last_modified=".*?"/i,"")
end

There’s probably some benefit to be had in changing the regexes into a compound regex, but for now this is obvious and easy to hack on. It’s possible that I may take this opportunity to merge h3 tags into an attribute on dl tags, but I’m not sure i’m sold on that yet.

Let’s go ahead and sketch in our file import function:

def import_from_file(path) do
    file = File.read!(path)
    clean_html = clean_html(file)
    root_nodes = Floki.parse_document!(clean_html)
                 |> Enum.filter( fn
                    {"dl",_,_} -> true
                    _ -> false
                  end)

    urls = root_nodes |> Enum.map( &parse_node(&1,[])) |> List.flatten
    tags = bookmarks |> Enum.reduce(MapSet.new(), fn {tags, _url, _title}, tag_set ->
      MapSet.new(tags)
      |> MapSet.union(tag_set)
    end)
    {:ok, tags, bookmarks}
  end

This function takes a file, and returns the tags and bookmarks in it. The procedure is:

Load the file in
Scrub out the HTML entities and attributes we don’t care about
Throw away everything at the root level but dl nodes.
Parse out the bookmarks from the dl nodes and their children
Extract all unique tags (derived from the bookmark folders) from each bookmark.
Return the tags and bookmarks.

The most interesting (and trickiest to develop) part of this probably is the mysterious parse_node/2 function. That is implemented as:

# goal here is to return a list of { [tag1, tag2, tag3...], url, label} tuples
def parse_node( {_,_,kids}, governing_tags) do
    %{bookmarks: bookmarks} =
        Enum.reduce(kids,
        %{tags: governing_tags, bookmarks: []},
        fn(
        {"dl", _attrs, _kids} = knode, %{tags: tags, bookmarks: bookmarks} = state) ->
            newbookmarks = parse_node(knode, tags)
            [_newtag | oldtags] = tags # we pop off the head tag added by the `h3` case
            %{state | bookmarks: newbookmarks ++ bookmarks, tags: oldtags}

        {"a", attrs, [title]}, %{tags: tags, urls: bookmarks} = state ->
            {"href", url} = List.keyfind(attrs, "href", 0)
            %{state | bookmarks: [{tags, url, title } | bookmarks]}

        {"h3", _attrs, [label]}, %{tags: tags} = state ->
            %{state | tags: [label | tags]}
    end)
    bookmarks
end

(It probably helps to know that a node to Floki is a tuple of {element tag, element attributes, element child nodes or text nodes})

What this function does is, when given a dl node, traverses its children in order and then:

If the kid is an a tag, add that as a bookmark using its display name, url, and whatever tags are active.
If the kid is an h3 tag, add that to the front of our active tags list.
If the kid is a dl tag, we have a few things to do: i. Recurse into its children using the tags we already have active (which will include the one active from the h3 that preceded us). ii. Pop off the top-level tag from our current parse state, because that was added by our h3 branch. ii. Return as the parser state the old state, but with the bookmarks from (i) added to our running collection of bookmarks.

And best of all, this little mechanism actually works! It’s just this weird recursive reductive state machine thingy.

Reflections

So, first, I’m not very happy with the file format. I had to throw out way too much stuff to a) see the form of the problem and b) make performance/memory usage better. A JSON blob would’ve been nicer, but unfortunately this is the format that both Chrome and Firefox seem to agree on.

Second, I don’t really like the parse_node contraption. I’m sure there is a cleaner way of writing it out, and it has a few assumptions deeply embedded in its structure–for example, if you feed it a dl without an immediately preceding h3, Christ only know what will happen. I think my problem was that I started with a reduce and just kinda kept leaning into it. Then again, it was kinda their fault for giving me something to parse that required information in adjacent trees to parse subtrees.

Third, and maybe this is the biggest thing for me:

This was really an exercise in the Pareto principle: I only cared about extracting the tags, display names, and URLs for my bookmarks. The code to actually fetch the names and URLs is teensy–you just pull all the anchor tags from the document.

Getting the tags, though, meant fighting both the spec and the parsing library. That’s what took a few dozen lines of code and many, many revisions. If I were shipping this as a project, I’d call it good for version 1.0 without tags, and then see if customers cared enough to add that.

Anyways, that was fun, and now my bookmarks manager is coming along nicely. :)

Tags: elixir