FeedSee

FeedSee

Ned Batchelder's blog Web Feed

Ned Batchelder's blog Feed
Sun May 9 22:41:58 EDT 2010
Home: http://nedbatchelder.com/blog
Feed: http://www.nedbatchelder.com/blog/rss.xml

Fossilized hack-arounds -

A few weeks ago, we had a baffling problem with our web application: some JSON responses were being gzipped incorrectly. I asked about it on Server Fault: Incorrect gzipping of http requests, can’t find who’s doing it.

The final resolution was that Akamai was gzipping the request, and adding a "Content-Encoding: gzip" header. But we'd already put in a "Content-Encoding: identity" header, the browser saw both, only attended to the first (identity), and couldn't interpret the gzipped gibberish it saw in the content.

It turns out we aren't supposed to use "Content-Encoding: identity" on responses, and removing that from our JSON code solved the problem.

But there was a mystery remaining: Akamai also adds an "X-N: S" header to the response. What the heck is that?

A friend has friends at Akamai, and sent them the question. Back came the answer:

A long time ago, when there was a browser called "Netscape", :-) there was a bug that prevented embedded images from rendering if the HTTP headers were exactly some length. (If the terminating \r\n begins on character 256, 257, or 258.) So if the header size is in this range the Akamai server adds that header...

Wow, talk about bug workarounds encased in amber. That's a really old bug and code is still trying to sidestep it. Looking on Google, it looks like other web intermediaries are also adding headers to fix it: Apache used to send X-Pad, and WebSTAR sent X-BrowserAlignment.

I doubt the affected browser is even out there in the wild any more, but Akamai is still adding this header to requests, plugging away a decade later. It's astounding to think of the labyrinth of special checks and bug adaptations in software like this, the extra cycles expended in the name of obsolete components that are no longer even listening on the other end.

The problem of course is that once you've added code like this, how can you be sure it's safe to remove? Who's even checking over the code to consider that it might be safe? Accommodations like this get in the code and generally never come out, though Apache removed theirs.

One last micro-mystery: what do the N and S mean in "X-N: S"? I'm betting on "Netscape Sucks"!



The case of the secured server -

When Tabblo was being acquired by HP, we had a bunch of different HP people talking to us about all different aspects of the acquisition. The security guys were a special treat.

We were in a small rented office in Cambridge, and were going to move to a large existing HP facility in Marlborough. But we still had a month or so of being in the Cambridge office, and the security guys wanted to make sure everything was locked down. Our founder Antonio spent a couple of hours on the phone with an HP guy from Australia who wanted all sorts of details about the physical security of the office: when entering from the street, how many different doors are there? How many are secured by locks? What kind of locks? When you get off the elevator, are there ceiling tiles above you? Could you lift the tiles and climb into the offices, etc, etc, etc. We joked that this guy was going to pop out of an air duct as a surprise visit some day.

Ultimately, though, the thing that concerned the security guys the most was our Subversion server. We were a tight-fisted startup, conserving money. That means our office space was the cheapest dump we could find, and our furniture was from Ikea. Our Subversion server was some junky Dell desktop machine stuck in the phone closet.

The security guys were very worried about the safety of this server, more than anything else we had. They asked us how we were going to move the server to the new office.

Us: "We'll put it on the van."

Them: "You can't do that."

Us: "We'll take it in one of our cars?"

Them: "Nope."

They insisted that we use a bonded mover (whatever that is) to move the server. Everything else in the office would be loaded up by regular movers onto vans and driven the 30 minutes or so to the new offices. But the server had to be moved by a bonded mover. You know, for security.

For some reason, we couldn't get a bonded mover for the day of the big move. So the regular movers came and took everything else, and left the server for later.

The next day was our first in the Marlborough office, and much of the morning was taken up with orientation, tours, unpacking and so on. I got there late, and when I did, everyone was all abuzz: the Subversion server wasn't reachable, did I know anything about it? I didn't, and calls were placed to the Cambridge landlord. Security tapes were inspected, theories abounded.

Turns out, the building janitor had taken the server.

Of course he did: put yourself in his place. He comes to the office to clean, sees that the tenants have moved out and taken absolutely everything with them, except for an old crappy computer in the phone closet. They must not have wanted it.

As it happens, we knew the janitor closet in the old building had a child's pink bike in it, so we figured it was where the janitor stashed "found" stuff. Sure enough, our server was in there with the bike. Antonio went back and got the server, put it in his car, and drove it to Marlborough.

The security guys had fits about us unilaterally executing plan B, but what could they do about it?



Stack Exchange 2.0 -

In the beginning there was Stack Overflow, the programmer's Q&A site. It's been very successful, easily overtaking its competition. It's now the clear best choice for a place to ask questions and look for answers about programming.

After a few months, Stack Overflow spawned Server Fault (for system administration topics) and Super User (for computer user topics). They've been moderately successful, nothing like Stack Overflow, though: they've each accumulated 36,000 questions, while Stack Overflow has 640,000 so far.

Then they figured, why can't we handle any topics at all, and let anyone create their own site? And so Stack Exchange was born, a site where anyone could create a Q&A arena on whatever topic they wanted.

But it wasn't free, in fact, it seemed kind of expensive: $129/month. And the sites weren't taking off. It seemed that each step removed from programming questions meant a 20x drop-off in traffic. The Stack Overflow team (Joel Spolsky, Jeff Attwood and a bunch of others) are now looking for ways to extend their success, including getting some investment.

As part of their new plan, they've announced changes to Stack Exchange: Stack Exchange 2.0. Everything's free now, but the process for creating new sites has become as convoluted as a Politburo meeting. Interestingly, the comments on the announcement are mostly mad about the loss of the paid option, because a paid site is owned by its creator, while the free sites are not.

I think the new community creation process is way too heavyweight, especially where they require a certain number of users with a certain number karma points on existing sites to commit to a community before it will be created.

Overall, it's a familiar internet story: a startup creates something, people start using it, but then the business plan shifts, and users are left feeling abandoned. Small startups have to adapt to survive, but they don't want to piss off too many people along the way.

One interesting point in this whole thing: Joel has been very direct with people, telling them if they think they can do a better job building a community, they're welcome to use one of the Stack Overflow clones to do it. And there are a bunch of clones: Array Shift is built in Drupal, OSQA is a Django app, and Shapado looks pretty full-featured. There are probably more.

It'll be interesting to watch the continued evolution of the Stack Overflow ecosystem. I'm not sure any community will get the critical mass that the original Stack Overflow did, but it's worth trying a few ways to make it happen.



Converting Blogger to Wordpress -

Until last weekend, Susan's blog had been done with Blogger. We made use of the FTP feature to push all the content to static HTML files on her server. But Blogger is discontinuing FTP support, so we had to do something.

I'm a huge believer in keeping old URLs working, so I didn't want to switch to a blogspot.com blog, or even move to blog.susansenator.com. Besides, Blogger had been seeming pretty creaky for a while, so I took the opportunity to try something better, namely Wordpress.

Creating the Wordpress blog was pretty simple. Our hosting provider offers one-click installation which worked great. Making a Wordpress theme can be a big undertaking, but not if you're just trying to mimic an existing simple blog layout. I downloaded a simple theme and started hacking away on it. The Wordpress docs are pretty good, definitely better than Blogger's, that's a recurring theme here.

Migrating all the content over was a bigger deal. Blogger offers a backup facility that gives you your entire blog as a giant XML file. Converting that to a Wordpress format was simple with blog converters. Included is blogger2wordpress, which turned my 16Mb Blogger XML file into a 12Mb Wordpress XML file.

Then Wordpress can import the XML file, but maximum size 2Mb, why? So I manually split the big XML file into 8 smaller XML files, which was tedious but not difficult. Importing each of them brought in all the old blog posts and comments. Nice. (For some reason, embedded YouTube videos are now just a URL in text, not sure why. If I had noticed that earlier I may have been able to do something about it.)

Now we have a Wordpress blog that works just like the Blogger blog did, except that everything has a different permalink than it did before. The first step to fix that is to change the permalink style Wordpress uses. It defaults to something horrendous like:

http://susansenator.com/blog/?p=123

Select "Month and name" under Permalink settings in the Wordpress installation. This makes Wordpress use nice URLs like:

http://susansenator.com/blog/2010/04/here-be-dragons/

Changing this setting will either add or require you to add a chunk of mod_rewrite rules to your Apache .htaccess file:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /blog/
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /blog/index.php [L]
</IfModule>

But lots of other things are subtly different. Archive pages are named differently, Blogger had an index.html page for the blog, and so on. I manually added these rewrites to fix these issues:

# Blogger slugs have .html, wordpress does not.
RewriteRule ^blog/([0-9]{4})/([0-9]{2})/(.*)\.html?$ /blog/$1/$2/$3/ [R=301,L]

# Blogger archives are different.
RewriteRule ^blog/([0-9]{4})_([0-9]{2})_01_archive\.html /blog/$1/$2/ [R=301,L]

# Blogger feeds are now found at the wordpress feed
RewriteRule ^blog/atom.xml /blog/feed/ [R=301,L]
RewriteRule ^blog/rss.xml /blog/feed/ [R=301,L]

# Blogger had the old-style index.html.
RewriteRule ^blog/index.html /blog/ [R=301,L]

The thorniest problem, though, is that Blogger and Wordpress don't agree on how to turn a post title into a slug. Both lowercase the text and change spaces to dashes, but Wordpress includes every word, while Blogger leaves out "a" and "the", and maybe others.

The simplest way to solve the differing slug problem was to examine the wordpress.xml file. It had the title of the posts, and the Blogger slug, in the form of the post's permalink. I could determine which posts would have a new slug under Wordpress, and create a redirect for them.

A quick Python program did the work:

from lxml import etree
import re, sys

def items(f):
    doc = etree.parse(open(f))    
    items = doc.xpath('.//item')
    for item in items:
        title = item.xpath('title/text()')
        link = item.xpath('link/text()')
        if title and link:
            yield (title[0], link[0])

# Regexes for turning a title into a Wordpress slug
slugify = [
    # Drop everything but nice word characters
    (r"[^-a-z0-9 ]", ""),
    # All spaces become dashes
    (r" ", "-"),
    # Multiple dashes become one
    (r"-+", "-"),
    ]

def do_file(f):
    for title, link in items(f):
        if "susansenator.com" not in link:
            continue
        slug = link.split('/')[-1].split('.')[0]
        wpslug = title.lower()
        for pat, rep in slugify:
            wpslug = re.sub(pat, rep, wpslug)
        if wpslug != slug:
            old_path = link.replace("http://susansenator.com/", "")
            new_path = old_path.rsplit('/', 1)[0] + "/" + wpslug
            
            print "RewriteRule ^%s /%s [R=301,L]" % (
                old_path.replace(".", r"\."),
                new_path
            )
        
do_file(sys.argv[1])

This just looks at every post, extracts the Blogger slug from the post's link, and computes the Wordpress slug. Where the two slugs differ, a rewrite rule is written. On Susan's blog, this produced 446 rewrite rules, which went into .htaccess:

### These are posts that slugify differently under blogger and wordpress, to keep old permalinks working:
RewriteRule ^blog/2010/04/cheerful-feelings-upon-awakening-in\.html /blog/2010/04/cheerful-feelings-upon-awakening-in-the-country [R=301,L]
RewriteRule ^blog/2010/03/here-is-my-passover-album-on-facebook-i\.html /blog/2010/03/passover-pics [R=301,L]
RewriteRule ^blog/2010/03/reality-of-autism-rifts-and-what-obama\.html /blog/2010/03/the-reality-of-the-autism-rifts-and-what-obama-should-do [R=301,L]
# ... 440 skipped ...
RewriteRule ^blog/2005/10/autism-and-school-board\.html /blog/2005/10/autism-and-the-school-board [R=301,L]
RewriteRule ^blog/2005/10/speed-of-dark\.html /blog/2005/10/the-speed-of-dark [R=301,L]
RewriteRule ^blog/2005/10/adolescence-without-roadmap\.html /blog/2005/10/adolescence-without-a-roadmap [R=301,L]

With the new super-sized .htaccess in place, the new blog is ready to go. All existing links work well, and no one misses a beat.



Organic metaclasses -

The way I learn things, I can read about something a number of times, and intellectually understand it, but it won't really sink in until I have a real reason to try it out myself. Toy examples don't do it for me, I have to have an actual problem in hand before the solution becomes part of my repertoire. Recently I finally had a use for metaclasses.

I wanted to create an in-memory list of items that I could reference by key. It was a micro-database of languages:

class Language(object):

    # The class attribute of all languages, mapped by id.    
    _db = {}
    
    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

Language(
    id = 'en',
    name = _('English'),
    native = u'English',
    )

Language(
    id = 'fr',
    name = _('French'),
    native = u'Fran\u00E7ais',
    )

Language(
    id = 'nl',
    name = _('Dutch'),
    native = u'Nederlands',
    )

# Some time later:
lang = Language.get(langcode)
lang.native # blah blah

This worked well, it gave me a simple schema-less set of constant items that I could look up by id. And the class attribute _db is used implicitly in the constructor, so I get a clean declarative syntax for building my list of languages.

But then I wanted another another set, for countries, so I made a MiniDbItem class to derive both Language and Country from:

class MiniDbItem(object):
    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

class Language(MiniDbItem):
    _db = {}

Language(id='en', ...)
Lanugage(id='fr', ...)

class Country(MiniDbItem):
    _db = {}
    
Country(id='US', ...)
Country(id='FR', ...)

This works, but the unfortunate part is that each derived class has to define it's own _db class attribute to keep the Languages separate from the Countries. Each derived class is obligated to do that little bit of redundant work, or the MiniDbItem base class isn't used properly.

The way to avoid that is to use a metaclass. The metaclass provides an __init__ method. In a class, __init__ is called when new class instances are created, but in a metaclass, __init__ is called when new classes are created.

class MetaMiniDbItem(type):
    """ A metaclass to give every class derived from MiniDbItem
        a _db attribute.
    """
    def __init__(cls, name, bases, dict):
        super(MetaMiniDbItem, cls).__init__(name, bases, dict)
        # Each class has its own _db, a dict of its items
        cls._db = {}

class MiniDbItem(object):
    
    __metaclass__ = MetaMiniDbItem

    def __init__(self, **kwargs):
        for k, v in kwargs.iteritems():
            setattr(self, k, v)
        self._db[self.id] = self
        
    @classmethod
    def get(cls, key):
        return cls._db.get(key)

class Language(MiniDbItem): pass

Language(id='en', ...)
Lanugage(id='fr', ...)

class Country(MiniDbItem): pass
    
Country(id='US', ...)
Country(id='FR', ...)

Now MetaMiniDbItem.__init__ is invoked twice: once when class Language is defined, and again when class Country is defined. The class being constructed is passed in as the cls parameter. We use super to invoke the regular class creation machinery, then simply set the _db attribute on the class like we want.

Of course, metaclasses can be used to do many more things than simply setting a class attribute, but this example was the first time in my work that metaclasses seemed like a natural solution to a problem rather than an advanced-magic Stupid Python Trick.



Web development peeve -

OK, in the scheme of things, this is really minor, but it irks me. Wouldn't it have been great if the query component of a URL started with an ampersand instead of a question mark?

How many times do we have to write something like this:

// Add foo onto the URL params
params += params ? "&" : "?";
params += "foo=1723";

If the query component started with ampersand, we could just tack on "&foo=1723" and be done with it. From a whole-URL view, there's some sense to separating the query and the path with a distinct character like question mark, but it's not like it would have been unparseable to say the query component starts with the first ampersand.

Next time we'll get it right... :)

And while we're on the subject, why has the Python library got the tools to deal with URLs as structured data spread across three different modules? Turns out it doesn't take much to pull them all together into a Url class that can help with URL construction and parsing tasks:

import cgi, urllib, urlparse

class Url(object):
    """A structured URL.
    
    Create from a string or Django request, then read or write the components
    through attributes `scheme`, `netloc`, `path`, `params`, `query`, and
    `fragment`.
    
    The query is more usefully available as the dictionary `args`.
    
    """
    def __init__(self, url):
        """Construct from a string or Django request."""
        if hasattr(url, 'get_full_path'):
            url = url.get_full_path()
        
        self.scheme, self.netloc, self.path, self.params, \
            self.query, self.fragment = urlparse.urlparse(url)
        self.args = dict(cgi.parse_qsl(self.query))

    def __str__(self):
        """Turn back into a URL."""
        self.query = urllib.urlencode(self.args)
        return urlparse.urlunparse((
            self.scheme, self.netloc, self.path, self.params,
            self.query, self.fragment
            ))

Now I can do stuff like:

# Redirect to one of our canonical hosts, with an extra arg.
url = Url(request)
url.netloc = THE_SECURE_HOST if request.is_secure() else THE_HOST
url.args['from'] = request.get_host()
return http.HttpResponseRedirect(str(url))

This takes care of all the Url syntax logic for me, so I don't have to think about question marks and ampersands ever again.



An Apache break in -

Apache.org had an incident last week which started as a cross-site scripting attack and ended with the attackers gaining root access to their servers. The full story is worth a read because it's instructional to see how the mistakes compound and the attackers used each new foothold to gain access to another deeper level in the system. It reads like a laundry list of simple security mistakes, but strung together in a real world scenario that resulted in a serious breach of security.

And it ends with a great honest example of the open source philosophy:

We hope our disclosure has been as open as possible and true to the ASF spirit. Hopefully others can learn from our mistakes.



Autism Mom's Survival Guide -

My wife Susan's second book is out today: The Autism Mom's Survival Guide. It's about taking care of yourself while taking care of your disabled child, and it includes stories and voices from dozens of parents. Where her first book Making Peace With Autism was a personal memoir, this is a more of a support group conversation about how to structure your life around (but not exclusively for) a challenging child.

Autism Mom's Survival Guide



Criminal exceptions -

We're using socks.py to provide SOCKS proxying in some code at work, and it works great until it doesn't. Then, unfortunately, the author didn't try very hard to help us out.

Recently we got this exception:

GeneralProxyError: (5, "bad input")

Looking into the code, here's where it raises that error:

if (type(destpair) in (list,tuple)==False) or (len(destpair)<2) or (type(destpair[0])!=str) or (type(destpair[1])!=int):
    raise GeneralProxyError((5,_generalerrors[5]))

This is criminal: here is input validation, all of which focuses on a single variable, and when raising the exception, it doesn't include the value of the variable! Ugh.

Error handling is no different than the rest of your product: you need to put yourself in your customer's shoes and think about what they'll need. Then give it to them. Simple.



Dive into HTML5 -

Mark Pilgrim is good at writing introductions to technical topics, and his latest is no exception: Dive Into HTML5. It's a detailed exposition of the latest additions to HTML.

Some of it seems strangely circuitous, as in the first chapter when he recounts in detail how the img tag got added to HTML in the first place. He has a reason (which is to underscore that HTML was never pure in the first place), but it can seem a long way to go for the lesson.

I've added a few simple tweaks to this site as a result of perusing the changes: input fields of type email and url, and links with rel of author, tag, and license. I haven't gotten into the deeper changes like the new semantic tags <article>, <header> and <time>. Native video is even farther out.

I also learned about OpenSearch which doesn't seem worth it for this site. I can't imagine many people want a special search just for this blog.

And I also learned something new about Internet Explorer in my tangential browsing along the way also: not only does IE interpret these odd conditional HTML comments:

<!--[if IE 6]>
    Special instructions for IE 6 here
<![endif]-->

but it also has a Javascript conditional comment syntax:

/*@cc_on
    alert("Hello IE user (please, please switch)!");
@*/

This leads to this remarkably terse snippet to detect if your code is running in IE:

var ie = /*@cc_on!@*/false;

Will the wonders never cease?



FeedSee