Why HTML
June 7th, 2008 by DeWitt Clinton

Short post here …

The thread started by Elliotte Rusty Harold (super smart guy, and a colleague of mine) called Why XHTML is provoking a number of intelligent and articulate responses. Here’s my take:

I used to be firmly in the XHTML camp, but now I’m not. I’m not against people outputting valid XHTML instead of HTML, probably they should. Unless it is too hard. I just don’t think that it matters all that much in practice. My rationalization for this is pretty straightforward, and goes like this:

The web in made up of a nearly infinite variety of html-ish documents, some valid xhtml, some valid html, but mostly just almost-valid almost-html. So if you are writing a client for the web as it is, you will need to accept the most liberal input possible because a client that throws errors when it sees a missing '/>' symbol isn’t a very useful client at all.

So take the best possible case: A client sees a document, and the server declares that document’s mime type to be application/xhtml+xml or the document declares one of the xhtml doctypes, so the client fires up a validating XML parser, finds the document to be valid XML, outputs a XML tree (nothing html specific here), then walks that tree according to the most state-of-the art html heuristics (e.g., the html5 grammar), and renders the document accordingly (the hard and time-consuming part).

That’s the best case.

The worse case is: A client sees a document, and the server doesn’t bother to set a mime-type at all (or sets the wrong one) and/or doesn’t declare an xhtml doctype, and the client starts walking the characters in the document according to the most state-of-the art html heuristics (e.g., the html5 grammar), and renders the document accordingly (the hard and time-consuming part).

The difference between the best and the worst case is slight enough, and I’m not particularly convinced that the cost of running the first pass through a validating XML parser and creating a XML parse tree buys you much benefit when starting to parse the interesting stuff, the html itself.

But the real clincher is that the best case scenario is fleetingly unlikely. There are three possible outcomes here: 1) either the document isn’t declared to be xhtml at all so you fall through to the worse case, 2) the document is declared to be valid xml so you start parsing it only to find out it was invalid, so you start over again with the worst case, or 3) the document is valid xml and you have the best case scenario.

In practice on the real web, cases 1 and 2 are so much more likely that if you’re a pragmatic client you may as well start with the worst case and never bother parsing the document for the best case in the first place.

Now if you’re a document author, and you know that pragmatic clients are going to interpret your document according to the worst case rules no matter what you do, your only real incentive for writing xhtml over html is one of personal preference.

I can see some cases with machine-generated documents that emitting xhtml is actually just as easy as emitting html (say, if you are building up a DOM tree to represent your content anyway), or if you need to round-trip within a closed ecosystem of clients. But as soon as humans are involved in the production of the document, given the axiom that no human should ever have to write xml by hand, you are almost certainly going to be producing html, not valid xhtml.

Ultimately it comes down to the human element. People write the content and people will always write it using non-validating formats, if for no other reason that writing compliant documents is hard, and people have better things to do with their time than check for missing '/>' symbols. (And moreover, people also write the software that generates xhtml, and that software is often buggy and produces invalid documents anyway.)

My conclusion is that if you can produce valid xhtml, go for it. But in the end it doesn’t make much of a difference. In fact, I’m pretty certain that the web itself wouldn’t have succeeded if xhtml was required from the beginning, because a web that renders and displays documents is a much better web than one that throws validation errors all over the place.

(And all this, coming from a former xhtml guy…)

Footnote, a few of the other axioms that inform my thinking on this:

  1. HTML is intended to be written by humans (by hand).
  2. HTML is intended to be displayed to humans.
  3. Machines that want to interpret HTML need to act like humans, and not the other way around.
Fibonacci Functions
May 20th, 2008 by DeWitt Clinton

Tim Bray recently wrote in a short post entitled Language-Book Principles:

I am getting really, really bored with factorial and Fibonacci algorithms. It is really remarkably infrequent that I implement any code that looks much like either.

To which I responded:

Ah, but they do matter! You can tell a *lot* about how a language works by looking at those simple examples. Does it encourage and/or optimize for tail-recursion? Does the language use strong typing / type annotations / type inference? Parameter pattern matching? Does it feel functional vs. imperative? Compare a Fibonacci implementation in Lisp to ML to C to Java to Scala and the differences are immediately apparent.

Here are some examples, courtesy of LiteratePrograms, slightly modified by me in a few cases. Any bugs introduced are my fault entirely, I’m sure.

A trivial, elegant, and painfully overflowing recursive implementation in C:

unsigned int fib(unsigned int n)
{
  return n < 2 ? n : fib(n-1) + fib(n-2);
}

Again in C, but using iteration instead of recursion:

unsigned long fib(unsigned long n)
{
  unsigned long f0 = 0;
  unsigned long f1 = 1;
  unsigned long result = 0;
  if (n == 0) return 0;
  while (n--) {
    result = f1;
    f1 = f0 + f1;
    f0 = result;
  }
  return result;
}

A tail recursive Fibonacci generator in Lisp:

(defun fib-trec (n)
  (labels ((calc-fib (n a b)
                     (if (= n 0)
                         a
                       (calc-fib (- n 1) b (+ a b)))))
          (calc-fib n 0 1)))

(GNU Clisp 2.43 on my OSX box evals (fib-trec 5000) instantly, so I guess it must be optimizing the tail call.)

A marginally cleaner but otherwise identical version in Scheme:

(define (fib n)
  (letrec ((fib-aux (lambda (n a b)
    (if (= n 0)
        a
        (fib-aux (- n 1) b (+ a b))))))
  (fib-aux n 0 1)))

A succinct and iterative approach in Python:

def fib(n):
  a, b = 0, 1
  for i in xrange(n):
    a, b = b, a + b
  return a

The same idea, but using generators:

def fib():
  a, b = 0, 1
  while True:
    a, b = b, a + b
    yield a

And again in Python, this time using memoization:

def fib(n):
  memo = {0:0, 1:1}
  if not n in memo:
     memo[n] = fib(n-1) + fib(n-2)
  return memo[n]

(You can move the declaration of memo to speed it up on successive calls.)

A limited precision (32 or 64 bit) implementation in OCaml:

let rec i_fib a b = function
| 0 -> a
| 1 -> b
| k when k > 1 -> i_fib b (a+b) (k-1)
| _  -> raise (Invalid_argument)

(If you don’t like the ML then you have no soul.)

Same thing, this time in Haskell:

fib :: Integer -> Integer
fib 0 = 0
fib 1 = 1
fib n = fib (n-1) + fib (n-2)

(Can you tell I’m a fan of parameter pattern matching and type annotation/inference?)

A very pretty tail recursive implementation in Scala:

def fib(n:Int) = fib_tr(n, 1, 0) 
 
def fib_tr(n: Int, b: Int, a: Int): Int = n match {
  case 0 => a 
  case _ => fib_tr(n-1, a + b, b)
}

(Okay, I may have just fallen in love with Scala.)

And lastly, for Tim, a fast implementation in Erlang:

fibo3(N) ->
    {Fib, _} = fibo3(N, {1, 1}, {0, 1}),
    Fib.
 
fibo3(0, _, Pair) -> Pair;
fibo3(N, {Fib1, Fib2}, Pair) when N rem 2 == 0 ->
    SquareFib1 = Fib1*Fib1,
    fibo3(N div 2, {2*Fib1*Fib2 - SquareFib1, SquareFib1 + Fib2*Fib2}, Pair);
fibo3(N, {FibA1, FibA2}=Pair, {FibB1, FibB2}) ->
    fibo3(N-1, Pair, {FibA1*FibB2 + FibB1*(FibA2 - FibA1), FibA1*FibB1 + FibA2*FibB2}).

I see great beauty in all* of these.

*Well, except for the Erlang, which I find to be pathological.

Creating a HTML “friends” page from a Google Reader subscription list
March 23rd, 2008 by DeWitt Clinton

Google’s Social Graph API crawls the web and extracts publicly available relationship data (edges) for people on various public pages marked up with XFN or FOAF metadata (nodes). Like many others, I have accounts on Twitter, Flickr, FriendFeed, etc., which are all great sources to crawl for social graph edge relationships, but a number of interesting relationships were still hidden inside my private Google Reader subscription list.

This short guide demonstrates how I extract that information and publish it publicly via plain HTML decorated with XFN markup. (See an example.)

The process involves using Google Reader to manage your blogroll to 1) give a common label to the people whose blogs you read, 2) name those subscriptions appropriately, 3) share those subscriptions publicly, 4) find your public user id, and finally, 5) write a script that reads the subscription list and republishes it as HTML on your own site.

Step One: Using the “Manage Subscriptions »” link at the bottom of the left-hand panel in Google Reader, create a new folder called “people”. You can call this “friends”, “blogroll”, whatever — the important thing is that this label is used to tag every personal blog you want in your public blogroll. Note that these are blogs written by individual people, not collections of people.

Step Two: Next, go back over each of these subscriptions and use the “Rename” button to set the name of the blog to be the name of the author. Some blog feeds are already published this way by their author. Others, such as John Gruber’s Daring Fireball, set the blog title to the name of the site, not the person. In those cases, go back and rename the feed “John Gruber”, or whatever.

Step Three: Using the “Settings > Tags” menu, find your new “people” label and toggle the “private” sharing status to “public”. Doing so will reveal a number of new features, such as “view public page”, “email a link”, “add a clip to your site”, and “add a blogroll to your site”. These are all interesting features, and something you might want to explore more fully later, but in our case we’re doing something a little different.

Step Four: On the same “Settings > Tag” screen, look at the URL for the “view public page” link. It should read something like:

http://www.google.com/reader/shared/user/16671002588179970970/label/people

That 20 digit number is your public id. Jot this down — you’ll need this for the next step.

Step Five: This is the hardest part, in that it involves a bit of programming (or the ability to cut and paste and modify python code), and a server that you can run scripts on.

In this example, we’ll be using a python script to take the JSON-formatted output provided by Google Reader to build a data structure composed of dicts and sequences with the help of the simplejson library, and subsequently using Django templates to render this data as HTML.

You could do the same with PHP (perhaps directly inside Wordpress or the like), or Perl, or any language, but this simple example does it as a stand-alone Python script.

The following python code snippets (line breaks added for clarity) detail how to poll Google Reader and generate the XFN markup.

First, the Google Reader JSON output for your public subscription list is available at:

SUBSCRIPTION_URL = 
    'http://www.google.com/reader/public/javascript-sub/user/%s/label/%s'

Reading and parsing the JSON data is done with the following snippet:

  url = SUBSCRIPTION_URL % (user_id, label)
  json = urllib2.urlopen(url)
  data = simplejson.load(json)

The HTML is generated using Django templates. The template itself is simply a multiline string:

TEMPLATE = '''<!DOCTYPE html
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Jane Smith's Friends</title>
    <link rel="me" href="http://example.com/"/>
  </head>
  <body>
    <p class="vcard">
      <a href="http://example.com/" class="url" rel="me">
	<span class="fn">Jane Smith</span>
      </a>\'s friends and reading list:</p>
    <ul>
      {% for item in items %}
        <li class="vcard">
          <a href="{{ item.alternate.href }}" class="url" rel="friend">
            <span class="fn">{{ item.title }}</span>
          </a>
        </li>
      {% endfor %}
    </ul>
  </body>
</html>
'''

That template is parsed and interpreted by Django using the following code:

  template = django.template.Template(TEMPLATE)
  context = django.template.Context(data)
  html = template.render(context)

And then saved to disk with:

  out = open(output, mode='w')
  out.write(html)
  out.close()

Putting all of those snippets together, with the proper error handling, imports, main method, etc., you end up with a very short (96 line) script that can read your subscription list and write it out as XFN decorated XHTML.

The full script that I run on unto.net is available at:

http://static.unto.net/reader-subscriptions-to-html.py

I saved this to my ~/bin/ directory and run it every 15 minutes with via a cron job:

*/15 * * * * /home/dewitt/bin/reader-subscriptions-to-html.py 
    16671002588179970970 people 
    --output /var/www/dewitt/friends/index.html

So how does it look as HTML?

The page at dewitt.unto.net/friends is generated once every 15 minutes with the script. As you can see, it includes XFN markup that signal relationship edges between nodes that I control (dewitt.unto.net, blog.unto.net, twitter.com/dewitt, etc.) and the nodes that are controlled by people I subscribe to using Google Reader (daringfireball.net, brad.livejournal.com, etc.).

And how does it look to the social graph api?

According to one of the social graph api sample applications that explores site connectivity, it looks pretty darn interesting.

Are you doing anything interesting with the social graph API? Have any ideas of more ways it should be crawling for public relationship data? Let me know in the comments, and be sure to join the mailing list.

Adding OpenSearch to FriendFeed
March 21st, 2008 by DeWitt Clinton

A quick 30 second post because it’s hard to add markup to FriendFeed comments…

In response to this guide on adding FriendFeed search to your browser, here’s the 30 second guide for the FriendFeed guys on how to do it on their own site (not that they need it).

First, download this file and serve it as http://friendfeed.com/opensearch.xml.

<?xml version="1.0" encoding="UTF-8"?>
<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/">
  <ShortName>FriendFeed</ShortName>
  <Description>Search FriendFeed.</Description>
  <Tags>friendfeed</Tags>
  <Contact>admin@friendfeed.com</Contact>
  <Url type="text/html"
       template="http://friendfeed.com/search?q={searchTerms}"/>
  <Url type="application/atom+xml"
       template="http://friendfeed.com/search?q={searchTerms}&amp;format=atom"/>
  <LongName>FriendFeed Search</LongName>
  <Image height="16" width="16" type="image/vnd.microsoft.icon">http://friendfeed.com/favicon.ico</Image>
  <Query role="example" searchTerms="friendfeed" />
  <Developer>FriendFeed</Developer>
  <Attribution>
    Copyright 2008, FriendFeed, Inc., All Rights Reserved
  </Attribution>
  <SyndicationRight>open</SyndicationRight>
  <AdultContent>false</AdultContent>
  <Language>en-us</Language>
  <OutputEncoding>UTF-8</OutputEncoding>
  <InputEncoding>UTF-8</InputEncoding>
</OpenSearchDescription>

Next, add this snippet to the <head/> section of all pages on FriendFeed:

<link rel="search"
           type="application/opensearchdescription+xml"
           href="http://friendfeed.com/opensearch.xml"
           title="FriendFeed Search" />

And you’re done. But let’s talk again once you syndicate search over Atom or RSS and we’ll add that, too.

Update: Added the <Url/> element for Atom based search results per Bret’s comment below.

Server migration
February 11th, 2008 by DeWitt Clinton

Users of either the OpenSearch interface to the Amazon catalog at aws.unto.net or the click-tracking bookmark manager at delancey.unto.net may experience a bit of inconsistent state while DNS changes propagate this morning. I moved those services from the virtual machine (hosted by the superstars at John Companies), to the newer instance that also hosts this blog (blog.unto.net), a wiki (wiki.unto.net), and a simple identity page (dewitt.unto.net). I’ll be moving the opensearch.org wiki server over to the new instance shortly as well. In the process I also upgraded the new machine to Fedora 7 via a somewhat risky in-place yum-based upgrade. Please let me know in the comments if you have any problems.

On the coding front I spent the weekend working on a new blog engine for unto.net, similar in spirit to Blosxom/PyBlosxom. (Oh the irony.) While I still appreciate the convenience that WordPress offers, I’m not a fan at all of the new admin console design in WP trunk, and would really prefer a backend that stores posts in a transparent version control system. Opaque CMS datastores, such as the MySQL backend for Wordpress, are my nemesis these days…

43 mini-reviews of drm-free and (mostly) independent albums
January 1st, 2008 by DeWitt Clinton

I purchased copies of, and unless the RIAA succeeds in their creative but wrongful re-imagining of copyright law, the right to fairly use, 43 music albums in 2007. Needless to say every single one was purchased on DRM-free media — high-bitrate digital MP3s when available, CD when not.

While buying 43 albums/year is well off my ten-year average of 2.44 albums a week, one of the things that caught my attention was that over the past year 84% of the albums were released artists on independent labels, i.e. labels that are not affiliated with the RIAA. Perhaps even the RIAA might start taking note when customers who spend hundreds (or thousands) of dollars a year on music start giving the vast majority of that to independent artists.

The following is a list of those 43 albums, complete with links to places you can buy them (DRM-free, of course) and links to some external reviews. (The Amazon links have my referral code; if that bothers you then please go to Amazon directly.)

I’ve also included my own one-sentence review of each album, as well as a restaurant review-style “star” rating. Note that these ratings are not done on a linear 1 to 5 scale. Rather here each star representing something an album needs to earn. To put it in perspective, I can only think of a few albums ever released that would get 5 stars, and since I previewed everything before I bought it, each should have at least 1 star (in theory).

An aside before I start: the state of music search is abysmal. I spent hours upon hours writing Python scripts to help automate the generation of this list. Amazon was the only site that provided a search API at all, and even that needs some work. In the end I had the best luck using Google search, even to find things on Emusic or Bleep or AllMusic.

If anyone ever wants to fix the situation with music search please drop me an email.

But a big thank you to my friends at RIAA Radar — keep up the good work!

And with that, here are the 43 albums I bought in 2007: