memento logo

Memento Tools:
Proxy Scripts

About Demos Guide Tools Depot



This document describes Python base handlers that we developed, and that can be used to implement by-proxy Memento support for third-party servers such as Web Archives and Content Management Systems. Of course, it would be preferable if those servers implemented Memento natively, but proxies may help to bootstrap that process. The base handlers can also be used to develop Memento aggregators or other Memento compliant services.

Python Proxies

These are the Proxy handlers that are currently available: The following are required to use the handlers:
  • Foresite Toolkit: Foresite handles content negotiation parsing and TimeMap creation.
  • SimpleJSON: SimpleJSON comes included (as json) in Python 2.6 and later.
  • DateUtil: A date parsing implementation built on top of the datetime standard library.
  • Mod Python: These scripts use the mod_python apache framework, but mod_wsgi would not be much effort.

Documentation

BaseHandler.py implements a class called BaseProxyHandler. This processes the HTTP requests and returns the responses generated. It requires some minor configuration to work:
  1. Change the host name in the constructor to where your proxy will live. Currently this is set to mementoproxy.lanl.gov.
  2. The constructor takes a path as an argument for where it will listen. You'll set this in your actual implementation, no need to change the file.
The class defines the following methods:
  • send(data, req, status,ct):
    This function sends the data back to the client. 'req' is the mod_python request object. 'status' is the HTTP status code to use, which defaults to 302. 'ct' is the mime type to send in the ContentType header.
  • error(data, req, status,ct):
    A wrapper function for send which changes the defaults to an error condition.
  • fetch_changes(req, requri, dt):
    This is the function that needs to be implemented for each proxy. It returns a list of (time, url) tuples to choose from for the uri requested. 'req' is the mod_python request object. 'req_uri' is the URI in the proxied resource. 'dt' is the datetime object at which the client wants the resource for, in the case of the handle_dt() function.
  • handle_event(req):
    This function generates a Simile Timeline event stream in either xml or json, as requested in the URI.
  • handle_aggr(req):
    Here, we process the TimeBundle URI and redirect to an appropriate TimeMap based on the content negotiation headers.
  • handle_rem(req):
    And here we generate the TimeMap from the set of times and resources generated by fetch_changes().
  • handle_dt(req):
    This is the main Memento function, and processes the redirects based on the requested URI.
  • handle(req):
    A dispatcher which calls the other handle_* functions based on the URI pattern.
Individual handlers define subclasses of BaseProxyHandler, and need only to implement the fetch_changes function.
  • The Internet Archive handler makes use of the regular URL pattern which produces a page with dates and URLs for archived resources for a particular URI. It parses the HTML using lxml's HTML parser, and generates the change list from that. The format to return is a list of tuples, the first is an instance of a datetime and the second a string with the URI for the Memento.
  • The Wikipedia handler is slightly more complex. It uses the Wikipedia API to extract the history information for the given article. After generating the appropriate URIs for the API, it uses LXML to process the XML responses. It sends fake user agent headers to ensure that Wikipedia does not reject the request. In fetch_changes(), it loops through requesting 500 history items per request, and constructs a full change list across the multiple calls.
Handlers in mod_python must define a 'handler' function with the request object as argument. For proxies these construct the ProxyHandler object, giving the path in the website where they will be installed. Then they call basehandler() from the baseHandler.py script with the newly constructed object and the request. This function processes the request as described above.