WWW94: a caching relay for the World Wide Web

presented by steven glassman, digital equipment corporation, palo alto


steven glassman presented the design and performance of a caching relay for the web. DEC has set up such a relay at palo alto because DEC has a security firewall between internal computers and machines outside of DEC's network. they added caching to improve the performance and to reduce network traffic. in addition, caching can reduce the latency on requests for pages and cached pages are available even if the server where the documents are stored is currently not reachable.
on the other hand, caching introduces some problems: it may return stale versions of a page if it has been changed since the last transmission, it increases latency if the page has not been cached yet and there is a need for additional local resources.

cache design

cached pages are stored as UNIX files with their URL as the filename. the files are hashed into 4096 subdirectories which are organized into a three level deep hierarchy.

the main problem with caching is: is the cached information still valid or has the original document been changed since it was copied ?
unfortunately, there is no mechanism to know if a document has been changed. there is also no reliable expiration date in the document. therefore the time until a page has to be re-cached has to be estimated somehow. DEC uses an algorithm based on the last modification date. if the document was not changed recently, it will probably not change for the next few days. a cached document will be marked with an expiration date. if the relay receives a request for a page and if the page is in the cache, the relay checks the expiration date. if the page has not yet expired, it will be sent to the client without any further tests. if the page has expired, the relay checks with the remote server if the page has been changed and if so, it will be re-transmitted.

cache performance

the log shows that 30 .. 50% of all requests were satisfied from the cache and that relay supplied pages required about 15 .. 25% of the time compared to transmission over the net. currently the cache is two gigabytes which should be enough space to hold about 80'000 documents. currently the cache holds about 630 megabytes of data and is constantly growing.

the statistics show that popularity of pages follows a zipf distribution. in other words, on a log-log scale the ratio between the number of requests per page and the number of pages is almost a straight line. this gives a good indication for the number of requests per page one can expect for a certain number of pages which in turn gives some hints for a reasonable cache size.

i found this an extremely interesting speech. this paper is available on the web.
1st_day_relay / 13-jun-94 (ra) / reto ambühler