I’m working on a bookmark manager, and one of the minor features I want is to import all my old Firefox bookmarks.
This adventure has been…an adventure.
At least one other person has tackled this problem over in Node, but I’m taking stabs at it in Elixir because that’s kind of my schtick.
From Firefox you can have it export all of your bookmarks as an HTML file.
If you’re like me, doing this you might make some assumptions:
If you’re like me, you’d be assuming wrong. :(
Let’s look at a somewhat chopped down file:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks Menu</H1>
<DL><p>
<dt>...</dt>
<dt>...</dt>
<dt>...</dt>
<dt>
<h3>Bookmarks Toolbar</h3>
<dl>
<p>
<dt>
<h3>folder 1</h3>
<dl>
<dt> <a href="..."> link 1</a>
<dt> <a href="..."> link 2</a>
<dt>
<h3>sub folder 1</h3>
<dl>
<dt> <a href="..."> link 1</a>
<dt> <a href="..."> link 2</a>
<dt> ...
</dl>
<p>
</dl>
<p>
</dt>
<dt> ...
<dt> <a href="..."> a root-level link</a>
</dl>
<p>
</dt>
</DL>
I apologize in advance if that isn’t 100% accurate–I chopped out things like ICON
attributes that have giant base64 image chunks in them, <a>
attributes like ADD_DATE
and LAST_MODIFIED
. A file with like a thousand bookmarks for me is like 13mb, which is a bunch.
With that snippet in hand, let’s revisit those assumptions:
This is a normal HTML file.
If you’re like me and you reach for the normal parsing tool of Floki, you might then try something like:
File.read!(path) |> Floki.parse_document!() |> Floki.find("body > dt")
Unfortunately, that’ll return an empty set []
. If you then show up on Github and ask questions without checking your work, you’ll look like a doofus.
Now, why doesn’t it work? Well, look at that DOCTYPE
again:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
In near a decade of experience–and maybe this is me just being provincial–I have basically never seen such a strange beastie.
The important thing about it is that, well…we don’t have a head
tag, or a body
tag, and then that breaks that attempt to find top-level dt
elements.
Bookmarks are probably anchor elements.
This one panned out, thank god.
This will be well-formed.
See those weird little <p>
tags scattered around? And how the <DT>
tags don’t have closing counterparts? Argh.
So, links in a folder look like:
<dt>
<a>some link</a>
<dt>
And folders look like:
<dt>
<h3> Folder name</h3>
<dl>
<link>
<folder>
</dl>
<p>
</dt>
Note that we’re kinda hand-waving the whole </dt>
part here–both Floki and the browser seem to see what’s up.
Anyways, our first cut looks like:
def import_from_file(path) do
file = File.read!(path) |> String.replace("<p>","") |> String.replace("</p>", "")
root_nodes = Floki.parse_document!(file)
[list_root] = root_nodes |> Enum.filter( fn
{"dl", _attrs, _kids} -> true
_ -> false
end)
tags = Floki.find(list_root, "h3") |> Enum.map( fn {_,_, [tag]} -> tag end )
urls = Floki.find(list_root, "a")
{:ok, tags, urls}
end
This pulls the tags and urls, but it doesn’t really address actually mapping the tags on to the urls for the bookmarks. For our purposes, the tags for a bookmark at the names of its containing list and the names of all containing lists up to the root.
I’m still not happy about this, what else can we do?
Okay, so, my poor brain is getting really hung up on the dt
tags and the p
tags. What does our tree look like if we just, you know, nuke them?
<DL>
...
...
...
<h3>Bookmarks Toolbar</h3>
<dl>
<h3>folder 1</h3>
<dl>
<a href="..."> link 1</a>
<a href="..."> link 2</a>
<h3>sub folder 1</h3>
<dl>
<a href="..."> link 1</a>
<a href="..."> link 2</a>
...
</dl>
</dl>
...
<a href="..."> a root-level link</a>
</dl>
</DL>
That looks a little easier to wrap our arms around, yeah? We kinda induct these rules now:
<dl>
has either <h3>
, <dl>
, or <a>
children. <dl>
(bookmark folder) is named after the sibling h3 is has <a>
(bookmark) has the tags of chain of <dl>
containing it Now, you might think we could just throw some CSS selectors at this problem and call it a day–I thought that too, and was wrong. Unfortunately, the CSS approach is going to end up out at the leafs of the document tree, and will also make it hard to gather the tags.
So, instead, we need to write some actual traversal code. Our input is going to be the root <dl>
document tree, and our output wants to be something like a list of bookmark tuples consisting of the active tags when we encountered them, the URL, and the display name.
We’re also going to want to clean out attributes we don’t care about, because in my case the icon
attribute ends up with a bunch of base64 encoded noise that bloats memory and makes debugging annoying. Luckily, with some regexes we can sort all of that out prior to actually doing real parsing on the HTML:
def clean_html(html) do
html
|> String.replace(~r/<p>/i,"")
|> String.replace(~r/<\/p>/i, "")
|> String.replace(~r/<dt>/i,"")
|> String.replace(~r/<hr>/i,"")
|> String.replace(~r/icon=".*?"/i,"")
|> String.replace(~r/icon_uri=".*?"/i,"")
|> String.replace(~r/add_date=".*?"/i,"")
|> String.replace(~r/last_modified=".*?"/i,"")
end
There’s probably some benefit to be had in changing the regexes into a compound regex, but for now this is obvious and easy to hack on.
It’s possible that I may take this opportunity to merge h3
tags into an attribute on dl
tags, but I’m not sure i’m sold on that yet.
Let’s go ahead and sketch in our file import function:
def import_from_file(path) do
file = File.read!(path)
clean_html = clean_html(file)
root_nodes = Floki.parse_document!(clean_html)
|> Enum.filter( fn
{"dl",_,_} -> true
_ -> false
end)
urls = root_nodes |> Enum.map( &parse_node(&1,[])) |> List.flatten
tags = bookmarks |> Enum.reduce(MapSet.new(), fn {tags, _url, _title}, tag_set ->
MapSet.new(tags)
|> MapSet.union(tag_set)
end)
{:ok, tags, bookmarks}
end
This function takes a file, and returns the tags and bookmarks in it. The procedure is:
dl
nodes. dl
nodes and their children
The most interesting (and trickiest to develop) part of this probably is the mysterious parse_node/2
function.
That is implemented as:
# goal here is to return a list of { [tag1, tag2, tag3...], url, label} tuples
def parse_node( {_,_,kids}, governing_tags) do
%{bookmarks: bookmarks} =
Enum.reduce(kids,
%{tags: governing_tags, bookmarks: []},
fn(
{"dl", _attrs, _kids} = knode, %{tags: tags, bookmarks: bookmarks} = state) ->
newbookmarks = parse_node(knode, tags)
[_newtag | oldtags] = tags # we pop off the head tag added by the `h3` case
%{state | bookmarks: newbookmarks ++ bookmarks, tags: oldtags}
{"a", attrs, [title]}, %{tags: tags, urls: bookmarks} = state ->
{"href", url} = List.keyfind(attrs, "href", 0)
%{state | bookmarks: [{tags, url, title } | bookmarks]}
{"h3", _attrs, [label]}, %{tags: tags} = state ->
%{state | tags: [label | tags]}
end)
bookmarks
end
(It probably helps to know that a node to Floki is a tuple of {element tag, element attributes, element child nodes or text nodes}
)
What this function does is, when given a dl
node, traverses its children in order and then:
a
tag, add that as a bookmark using its display name, url, and whatever tags are active. h3
tag, add that to the front of our active tags list. dl
tag, we have a few things to do:
i. Recurse into its children using the tags we already have active (which will include the one active from the h3
that preceded us).
ii. Pop off the top-level tag from our current parse state, because that was added by our h3
branch.
ii. Return as the parser state the old state, but with the bookmarks from (i) added to our running collection of bookmarks. And best of all, this little mechanism actually works! It’s just this weird recursive reductive state machine thingy.
So, first, I’m not very happy with the file format. I had to throw out way too much stuff to a) see the form of the problem and b) make performance/memory usage better. A JSON blob would’ve been nicer, but unfortunately this is the format that both Chrome and Firefox seem to agree on.
Second, I don’t really like the parse_node
contraption. I’m sure there is a cleaner way of writing it out, and it has a few assumptions deeply embedded in its structure–for example, if you feed it a dl
without an immediately preceding h3
, Christ only know what will happen. I think my problem was that I started with a reduce and just kinda kept leaning into it. Then again, it was kinda their fault for giving me something to parse that required information in adjacent trees to parse subtrees.
Third, and maybe this is the biggest thing for me:
This was really an exercise in the Pareto principle: I only cared about extracting the tags, display names, and URLs for my bookmarks. The code to actually fetch the names and URLs is teensy–you just pull all the anchor tags from the document.
Getting the tags, though, meant fighting both the spec and the parsing library. That’s what took a few dozen lines of code and many, many revisions. If I were shipping this as a project, I’d call it good for version 1.0 without tags, and then see if customers cared enough to add that.
Anyways, that was fun, and now my bookmarks manager is coming along nicely. :)