Assignment 5: HTTP Web Proxy and Cache

This handout was adapted from Jerry Cain’s Spring 2018 offering.

Your penultimate CS110 assignment has you implement a multithreaded HTTP proxy and cache. An HTTP proxy is an intermediary that intercepts each and every HTTP request and (generally) forwards it on to the intended recipient. The servers direct their HTTP responses back to the proxy, which in turn passes them on to the client. In its purest form, the HTTP proxy is little more than a nosy network middleman that reads and propagates all incoming and outgoing HTTP activity.

Here’s the neat part, though. When HTTP requests and responses travel through a proxy, the proxy can control what gets passed along. The proxy might, for instance, do the following:

Due: Wednesday, August 8th at 11:59 p.m.

Getting started

Go ahead and clone the git repository we’ve set up for you by typing:

git clone /usr/class/cs110/repos/assign5/$USER assign5

Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck, and run ./tools/submit when you’re done.

If you descend into your assign5 directory, you’ll notice a subfolder called samples, which itself contains a symlink to a fully operational version called proxy_soln. You can invoke the sample executable without any arguments, as with:

$ ./samples/proxy_soln
Listening for all incoming traffic on port <port number>.

The port number issued depends on your SUNet ID, and with very high probability, you’ll be the only one ever assigned it. If for some reason proxy says the port number is in use, you can select any other port number between 2000 and 65535 (I’ll choose 12345 here) that isn’t in use by typing:

$ ./proxy_soln --port 12345
Listening for all incoming traffic on port 12345.

In isolation, proxy_soln doesn’t do very much. In order to see it work its magic, you should download and launch a web browser that allows you to appoint a proxy for HTTP traffic. I’m recommending you use Firefox, because I’ve used it for five years now to specifically exercise proxy, and it has worked very well for me. Once you download and launch Firefox, you can configure it (on Macs) to connect to the Internet through proxy by launching Preferences from the Apple menu, scrolling all the way to the bottom to Network Proxy, click on the Setting button, and then activating a manual proxy as I have in this screenshot. (On Windows, proxy settings can be configured similarly).

You should enter the myth machine you’re working on (and you should get in the habit of ssh‘ing into the same exact myth machine for the next week so you don’t have to continually change these settings), and you should enter the port number that your proxy is listening to.

You can, of course, use whatever web browser you want to, but I’m recommending Firefox for a few reasons. Here are two of them:

If you’d like to start small and avoid the browser, you can use curl from your own machine (or from another myth) to exercise your proxy. An example command might be the following:

curl --proxy http://myth55.stanford.edu:9979 http://icanhazip.com

(This assumes your proxy is listening on port 9979, which is probably not the case – it depends on your sunetID.)

Instructions for off-campus students

As was the case with Assignment 4, if you want to use your browser with the proxy and you’re located off campus, you may need to do some extra work.

You have several options:

Implementing v1: Sequential proxy

Your final product should be a multithreaded HTTP proxy and cache that blocks access to certain domains. As with all nontrivial programs, we’re encouraging you to work through a series of milestones instead of implementing everything in one extended, daredevil coding binge. You’ll want to read and reread Sections 11.5 and 11.6 of your B&O textbook to ensure a basic understanding of the HTTP protocol.

For the v1 milestone, you shouldn’t worry about threads or caching. You should transform the initial code base into a sequential but otherwise legitimate proxy. The code you’re starting with responds to all HTTP requests with a placeholder status line consisting of an HTTP/1.0 version string, a status code of 200, and a curt OK reason message. The response includes an equally curt payload announcing the client’s IP address. Once you’ve configured your browser so that all HTTP traffic is directed toward the relevant port of the myth machine you’re working on, go ahead and launch proxy and start visiting any and all web sites. Your proxy should at this point intercept every HTTP request and respond with this (with a different IP address, of course):

For the v1 milestone, you should upgrade the starter application to be a true proxy – an intermediary that ingests HTTP requests from the client, establishes connections to the origin servers (which are the machines for which the requests are actually intended), passes the HTTP requests on to the origin servers, waits for HTTP responses from these origin servers, and then passes those responses back to the clients. Once the v1 checkpoint has been implemented, your proxy application should basically be a busybody that intercepts HTTP requests and responses and passes them on to the intended servers.

Each intercepted HTTP request is passed along to the origin server pretty much as is, save for three small changes.

Most of the code you write for your v1 milestone will be confined to request-handler.h and request-handler.cc files (although you’ll want to make a few changes to request.h/cc as well). The HTTPRequestHandler class you’re starting with has just one public method, with a placeholder implementation.

You need to familiarize yourself with all of the various classes at your disposal to determine which ones should contribute to the v1 implementation. I repeat: You need to familiarize yourself with all of the various classes at your disposal to determine which ones should contribute to the v1 implementation. Of course, you’ll want to leverage the client socket code presented in lecture to open up a connection to the origin server. Your implementation of the one public method will evolve into a substantial amount of code – substantial enough that you’ll want to decompose and add a good number of private methods.

Once you’ve reached your v1 milestone, you’ll be the proud owner of a sequential (but otherwise fully functional) proxy. You should visit every popular web site imaginable to ensure the round-trip transactions pass through your proxy without impacting the functionality of the site (caveat: see the note below on sites that require login or are served up via HTTPS). Of course, you can expect the sites to load very slowly, since your proxy has this much parallelism: zero. For the moment, however, concern yourself with the networking and the proxy’s core functionality, and worry about improving application throughput in later milestones.

Important note: Your proxy doesn’t need to work for HTTPS websites; speaking HTTPS is more complex than what we have presented so far. (Part of the goal of HTTPS is to prevent tampering from middlemen, which is exactly what your proxy tries to do.) HTTP websites are becoming more sparse (a good thing for web security, but bad for debugging purposes). However, many top websites still don’t use HTTPS. See the “Other top sites” section from this site list, and look for sites that are marked as not working on HTTPS or not defaulting to HTTPS.

If, once you get the entire proxy working for submissions purposes, you’re interested in HTTPS proxying, drop me an email and I’ll send you a description of how to upgrade your proxy to be able to intercept some HTTPS traffic.

Implementing v2: Sequential proxy with blacklisting, caching

Once you’ve built v1, you’ll have constructed a genuine HTTP proxy. In practice, proxies are used to either block access to certain web sites, cache static resources that rarely change so they can be served up more quickly, or both.

Why block access to certain web sites? There are several reasons, and here are a few:

Why should the proxy maintain copies of static resources (like images and JavaScript files)? Here are two reasons:

In spite of the long-winded defense of why caching and blacklisting are reasonable features, incorporating support for each is relatively straightforward, provided you confine your changes to the request-handler.h and .cc files. In particular, you should just add two private instance variables – one of type HTTPBlacklist, and a second of type HTTPCache to HTTPRequestHandler. Once you do that, you should do this:

Once you’ve hit v2, you should once again pelt your proxy with oodles of requests to ensure it still works as before, save for some obvious differences. Web sites matching domain regexes listed in blocked-domains.txt should be shot down with a 403, and you should confirm your proxy’s cache grows to store a good number of documents, sparing the larger Internet from a good amount of superfluous network activity. (Again, to test the caching part, make sure you clear your browser’s cache a whole bunch.)

Implementing v3: Concurrent proxy with blacklisting and caching

You’ve implemented your HTTPRequestHandler class to proxy, block, and cache, but you have yet to work in any multithreading magic. For precisely the same reasons threading worked out so well with your Internet Archive program, threading will work miracles when implanted into your proxy. Virtually all of the multithreading you add will be confined to the scheduler.h and scheduler.cc files. These two files will ultimately define and implement an über-sophisticated HTTPProxyScheduler class, which is responsible for maintaining a list of socket/IP-address pairs to be handled in FIFO fashion by a limited number of threads.

The initial version of scheduler.h/.cc provides the lamest scheduler ever: It just passes the buck on to the HTTPRequestHandler, which proxies, blocks, and caches on the main thread. Calling it a scheduler is an insult to all other schedulers, because it doesn’t really schedule anything at all. It just passes each socket/IP-address pair on to its HTTPRequestHandler underling and blocks until the underling’s serviceRequest method sees the full HTTP transaction through to the last byte transfer.

One extreme solution might just spawn a separate thread within every single call to scheduleRequest, so that its implementation would go from this:

void HTTPProxyScheduler::scheduleRequest(int connectionfd,
                                         const string& clientIPAddress) {
  handler.serviceRequest(make_pair(connectionfd, clientIPAddress));
}

to this:

void HTTPProxyScheduler::scheduleRequest(int connectionfd,
                                          const string& clientIPAddress) {
  thread t([this](const pair<int, string>& connection) {
    handler.serviceRequest(connection);
  }, make_pair(connectionfd, clientIPAddress));
  t.detach();
}

While the above approach succeeds in getting the request off of the main thread, it doesn’t limit the number of threads that can be running at any one time. If your proxy were to receive hundreds of requests in the course of a few seconds – in practice, a very real possibility – the above would create hundreds of threads in the course of those few seconds, and that would be bad. Should the proxy endure an extended burst of incoming traffic – scores of requests per second, sustained over several minutes or even hours, the above would create so many threads that the thread count would immediately exceed a thread-manager-defined maximum.

Fortunately, you built a ThreadPool class for Assignment 4, which is exactly what you want here. I’ve included the thread-pool.h file in the assign5 repositories, and I’ve updated the Makefile to link against my working solution of the ThreadPool class. (If your Assignment 4 ThreadPool had problems, that shouldn’t hurt you here.) You should leverage a single ThreadPool with 64 worker threads, and use that to elevate your sequential proxy to a multithreaded one. Given a properly working ThreadPool, going from sequential to concurrent is actually not very much work at all.

Your HTTPProxyScheduler class should encapsulate just a single HTTPRequestHandler, which itself already encapsulates exactly one HTTPBlacklist and one HTTPCache. You should stick with just one scheduler, request handler, blacklist, and cache, but because you’re now using a ThreadPool and introducing parallelism, you’ll need to implant more synchronization directives to avoid any and all data races. Truth be told, you shouldn’t need to protect the blacklist operations, since the blacklist, once constructed, never changes. But you need to ensure concurrent changes to the cache don’t actually introduce any races that might threaten the integrity of the cached HTTP responses. In particular, if your proxy gets two competing requests for the same exact resource and you don’t protect against race conditions, you may see problems.

Here are some basic requirements:

You should not lock down the entire cache with a single mutex for all requests, as that introduces a huge bottleneck into the mix, allows at most one open network connection at a time, and renders your multithreaded application to be essentially sequential. You could take the map<string,unique_ptr<mutex>> approach that the implementation of oslock and osunlock takes, but that solution doesn’t scale for real proxies, which run uninterrupted for months at a time and cache millions of documents.

Instead, your HTTPCache implementation should maintain an array of 997 mutexes, and before you do anything on behalf of a particular request, you should hash it and acquire the mutex at the index equal to the hash code modulo 997. You should be able to inspect the initial implementation of the HTTPCache and figure out how to surface a hash code and use that to decide which mutex guards any particular request. A specific HTTPRequest will always map to the same mutex, which guarantees safety; different HTTPRequests may very, very occasionally map to the same mutex, but we’re willing to live with that, since it happens so infrequently.

I’ve ensured that the starting code base relies on thread safe versions of functions (gethostbyname_r instead of gethostbyname, readdir_r instead of readdir), so you don’t have to worry about any of that. (Note your assign5 repo includes client-socket.[h/cc], updated to use gethostbyname_r.)

Additional Tidbits

I hope you enjoy the assignment as much as I’ve enjoyed developing it. It’s genuinely thrilling to know that all of you can implement something as sophisticated as an industrial-strength proxy, particularly in light of the fact that many of you took CS106B and CS106X less than a year ago.