How to create a simple but powerful CDN with Google App Engine (GAE)

The main purpose when I started to look at Google App Engine (3 days ago) was to use it as a “CDN for the rest of us”, a way to cache static content (initially) and have this content distributed along all the infrastructure of Google (maybe the most powerful cloud rigth now)

What we want?:

  • Create a CDN easy to update and free of charge for static resources (images, css, js)
  • Consume as less bandwidth as possible leveraging the If-Modified-Since/Last-Modified/304 Not Modified model

Hands-on:

The first approach, of course, was to look on Google for some help, the post of Andreas Krohn helped a lot to start.

But I want to go further and take care of modern browsers If-Modified-Since requests, then the google framework and a little of Python comes to the rescue.

Note: I’m assuming you’ve already installed the Python environment and the Google App Engine SDK

First of all let me give you two little .bat files that are useful:

Start the test webserver (test.bat):
dev_appserver.py c:\ipsojobscloud

Upload your application to the cloud (update.bat):
appcfg.py update c:\ipsojobscloud

Note: simply change c:\ipsojobscloud for the folder you are working in and contains your app.yaml

Then I’ve setup the app.yaml, it’s very simple (16 lines):

application: ipsojobscloud
version: 1
runtime: python
api_version: 1

handlers:
- url: /favicon.ico
  static_files: favicon.ico
  upload: favicon.ico

- url: /images/favicon.ico
  static_files: favicon.ico
  upload: favicon.ico

- url: /.*
  script: cacheheaders.py

This app.yaml simply tells the GAE the name of the application (ipsojobscloud) the version we’re working on (use only the major release number, GAE automatically takes care of the .x when you upload).

Then we specify two handlers for the favicon.ico static file and a catch-all handler that redirects our requests to the Python script cacheheaders.py

With that environment set, we simply code the cacheheaders.py file, let’s see it in detail:

The skeleton of the file is:

import wsgiref.handlers
from google.appengine.ext import webapp

class MainPage(webapp.RequestHandler):

  def get(self, dir, file, extension):
...

def main():
  application = webapp.WSGIApplication([(r'/(.*)/([^.]*).(.*)', MainPage)], debug=False)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

Here we are importing the webapp framework and setting the class MainPage, in the main section the only change in the sample GAE is
the regular expression that we used to match our requests, the expression r’/(.*)/([^.]*).(.*)’ is telling that we are using regular expressions (r)
, then take one slash, followed by an arbitray number of characters and another slash /(.*)/ the parentesis tells the regular expression to keep the string beetween the two slashes as a variable. The next part ([^.]*). takes all caracters except a dot and puts them in to the second variable and finally, we’ll take the rest of the input as a variable with (.*)

This regular expression is designed to only capture paths like /images/helloworld.gif where variables are images, helloworld and gif respectively

Note: Of course that’s not a complete solution, we can only have one folder depth, but it’s a good readers exercice to improve that :)

The part that you need to know is that when a request arrives it’s mapped to the get function with the parameters dir, file and extension (and don’t forget the first “self” parameter)

Let’s see the code of the get function in detail:

First, check the validity of the parameters received and set the correct content-type based on the extension:

  def get(self, dir, file, extension):
    if (dir!='js' and dir!='css' and dir!='images'):
      self.error(404)
      return

    if (extension!='js' and extension!='css' and extension!='jpg' and extension!='png' and extension!='gif'):
      self.error(404)
      return

    if extension=='js':
      self.response.headers['Content-Type'] = 'application/x-javascript'
    elif extension=='css':
      self.response.headers['Content-Type'] = 'text/css'
    elif extension=='jpg':
      self.response.headers['Content-Type'] = 'image/jpeg'
    elif extension=='gif':
      self.response.headers['Content-Type'] = 'image/gif'
    elif extension=='png':
      self.response.headers['Content-Type'] = 'image/png'

Note: the firts two ifs are completely optional, we check if the dir variable is in our valid list of dirs (js, css, images) and if the extension of the file is in our allowed list (js, css, jpg, png, gif), you have to change that check or completely remove it at your convenience.

And now the tricky part:

    try:
      import os
      import datetime
      path = dir+'/'+file+"."+extension
      info = os.stat(path)
      lastmod = datetime.datetime.fromtimestamp(info[8])
      if self.request.headers.has_key('If-Modified-Since'):
        dt = self.request.headers.get('If-Modified-Since').split(';')[0]
        modsince = datetime.datetime.strptime(dt, "%a, %d %b %Y %H:%M:%S %Z")
        if modsince >= lastmod:
        # The file is older than the cached copy (or exactly the same)
          self.error(304)
          return
        else:
        # The file is newer
          self.output_file(path, lastmod)
      else:
        self.output_file(path, lastmod)
    except:
      self.error(404)
      return

First we import some packages (os, datetime), then create a variable “path” with the full path of the file we want to retrieve

path = dir+'/'+file+"."+extension

Then, take the info of the file from the Operating System and keep the last modified date into lastmod variable, note that if an error occurs (non existing file for example, the except part will be executed, returning a 404 not found response to the browser).

In the following lines we scan the headers of the request, looking for an If-Modified-Since header, if we found it take the date part

      if self.request.headers.has_key('If-Modified-Since'):
        dt = self.request.headers.get('If-Modified-Since').split(';')[0]
        modsince = datetime.datetime.strptime(dt, "%a, %d %b %Y %H:%M:%S %Z")

Then compare the last modification date of the file against the ifmodifiedsince date and act accordingly, note that self.error(304) will return a response code 304 (Not-Modified) to the browser:

        if modsince >= lastmod:
        # The file is older than the cached copy or the same
          self.error(304)
          return
        else:
        # The file is newer
          self.output_file(path, lastmod)

The self.output_file(path, lastmod) is a function we have defined to avoid code duplication:

  def output_file(self, path, lastmod):
    import datetime
    try:
      self.response.headers['Cache-Control']='public, max-age=31536000'
      self.response.headers['Last-Modified'] = lastmod.strftime("%a, %d %b %Y %H:%M:%S GMT")
      expires=lastmod+datetime.timedelta(days=365)
      self.response.headers['Expires'] = expires.strftime("%a, %d %b %Y %H:%M:%S GMT")
      fh=open(path, 'r')
      self.response.out.write(fh.read())
      fh.close
      return
    except IOError:
      self.error(404)
      return

As you can see we imported datetime to manipulate dates and try to do the following:

  • Set the header Cache-Control, to be as much cacheable as posible
  • Set the header Last-Modified (IMPORTANT ! when we send for the first time the file to the browser it keeps the Last-Modified date of the file, this value is the value that will send in the next If-Modified-Since requests, when we usually will respond 304 not-modified!)
  • Calculate an expires date in the future (we’ve put 365 days)
  • Set the Expires header with this value (last-modified+365 days)
  • Open the file and send it to the output and finally close the file
  • return, because when we output the file we’re done

Note: If something happens we returned an standard response of Not Found (404)

Conclusions:

We’ve improved the latency in the requests of static files putting them into the cloud, and keep the bandwidth used in the cloud to a minimum answering correctly to the If-Modified-Since requests and only in about 70 lines of code

One of the advantatges of Google App Engine above Amazon S3 is that GAE is free up 5 million page views a month, that give us a good chance to try this kind of features without spending cash.

You can see the speed improvement on-line in all the ipsojobs.com pages rigth now !

Some screenshots taken from firebug:

First request:

First request (not cached)

Second request:

Second request, cached, note the 304 responses

Detail of a request:

Sample cached response, details

Full source of cacheheaders.py:

import wsgiref.handlers
from google.appengine.ext import webapp

class MainPage(webapp.RequestHandler):

  def output_file(self, path, lastmod):
    import datetime
    try:
      self.response.headers['Cache-Control']='public, max-age=31536000'
      self.response.headers['Last-Modified'] = lastmod.strftime("%a, %d %b %Y %H:%M:%S GMT")
      expires=lastmod+datetime.timedelta(days=365)
      self.response.headers['Expires'] = expires.strftime("%a, %d %b %Y %H:%M:%S GMT")
      fh=open(path, 'r')
      self.response.out.write(fh.read())
      fh.close
      return
    except IOError:
      self.error(404)
      return

  def get(self, dir, file, extension):
    if (dir!='js' and dir!='css' and dir!='images'):
      self.error(404)
      return

    if (extension!='js' and extension!='css' and extension!='jpg' and extension!='png' and extension!='gif'):
      self.error(404)
      return

    if extension=='js':
      self.response.headers['Content-Type'] = 'application/x-javascript'
    elif extension=='css':
      self.response.headers['Content-Type'] = 'text/css'
    elif extension=='jpg':
      self.response.headers['Content-Type'] = 'image/jpeg'
    elif extension=='gif':
      self.response.headers['Content-Type'] = 'image/gif'
    elif extension=='png':
      self.response.headers['Content-Type'] = 'image/png'

    try:
      import os
      import datetime
      path = dir+'/'+file+"."+extension
      info = os.stat(path)
      lastmod = datetime.datetime.fromtimestamp(info[8])
      if self.request.headers.has_key('If-Modified-Since'):
        dt = self.request.headers.get('If-Modified-Since').split(';')[0]
        modsince = datetime.datetime.strptime(dt, "%a, %d %b %Y %H:%M:%S %Z")
        if modsince >= lastmod:
        # The file is older than the cached copy (or exactly the same)
          self.error(304)
          return
        else:
        # The file is newer
          self.output_file(path, lastmod)
      else:
        self.output_file(path, lastmod)
    except:
      self.error(404)
      return

def main():
  application = webapp.WSGIApplication([(r'/(.*)/([^.]*).(.*)', MainPage)], debug=False)
  wsgiref.handlers.CGIHandler().run(application)

if __name__ == "__main__":
  main()

8 Comments »

  1. Andreas Krohn said,

    June 17, 2008 @ 11:33 am

    Very good and interesting post, thanks for that one!

  2. links for 2008-06-17 | Digitalistic - Mashup or die trying said,

    June 17, 2008 @ 4:35 pm

    [...] How to create a simple but powerful CDN with Google App Engine (GAE) The main purpose when I started to look at Google App Engine (3 days ago) was to use it as a “CDN for the rest of us”, a way to cache static content (initially) and have this content distributed along all the infrastructure of Google (maybe the most p (tags: google googleappengine gae python cdn) [...]

  3. free g code said,

    June 21, 2008 @ 11:46 pm

    [...] [...]

  4. Optimizar la carga de una web (y II) | El Blog de Topilloman said,

    July 21, 2008 @ 1:11 am

    [...] esta otra entrada continúan el trabajo de la anterior pero añadiendo el control de la cache del navegador y así [...]

  5. Utilizando Google App Engine como tu CDN said,

    September 1, 2008 @ 7:45 pm

    [...] distribución de contenido, y ahora nos ha escrito Agusti Pons para comentarnos que ha hecho unas modificaciones al tutorial original en el que disminuye el llamado a los archivos por parte de los navegadores lo que mejora el [...]

  6. Utilizando Google App Engine como tu CDN | Andrebills said,

    September 2, 2008 @ 10:29 pm

    [...] distribución de contenido, y ahora nos ha escrito Agusti Pons para comentarnos que ha hecho unas modificaciones al tutorial original en el que disminuye el llamado a los archivos por parte de los navegadores lo que mejora el [...]

  7. Sofia said,

    January 22, 2009 @ 7:38 pm

    Hi,

    I see you’re no longer using this solution on ipsojobs.com. I was interested to know why since I want to reduce bandwidth costs right now and was looking at this as one of the options.

    So why do you no longer use it? What problems did you face?

    Your answer could be really helpful so please give feedback :) please :)

    Thanks,

    Sofia

  8. david said,

    May 16, 2009 @ 8:44 pm

    I second what Sofia said. Why did you drop it?

RSS feed for comments on this post · TrackBack URI

Leave a Comment

© Omatech