Assignment 5: Internet Archive

The Internet Archive maintains one of the largest open archives of websites, books, movies, and more. Their Wayback Machine allows you to enter a website’s address and see how it looked at various points in the past. (See how Stanford’s website looked in 1996.)

In this assignment, you’ll build a simplified version of the Internet Archive. Given a website to archive, your program will crawl through the network of pages it links to and resources it uses, archiving them to disk. You will build this in several phases. The first version, a sequential version with no multithreading, will take quite some time to finish archiving a website, due to the amount of time each network request takes. You’ll then speed it up using your understanding of multithreading and synchronization, and then you’ll build a thread pool to optimize your thread usage.

Due date: Wednesday, August 11th, 2021 at 11:59pm

What the finished product will do

You’ll be building an archive executable. Given a seed website, this will begin downloading that website, any websites it links to, any websites those link to, and so on. Because downloading this network of interconnected websites could end up downloading a fair portion of the internet (depending on your seed website), you are restricting your downloads to a whitelist, so we only download from particular websites of interest.

To download Stanford’s website, we can run the following. (The seed website, specified by -d, is https://www.stanford.edu, and we are restricting to sites on the domain www.stanford.edu using -w. You could use a wildcard and whitelist *.stanford.edu, but this ends up indexing a huge number of pages, and it takes a very long time to download. You can whitelist multiple different domains by using multiple w flags.)

./archive -w www.stanford.edu -d https://www.stanford.edu

This produces the following output:

[03-08-2021 08:07:00] Beginning download of https://www.stanford.edu
[03-08-2021 08:07:01] End download of https://www.stanford.edu (1.104319 seconds)
[03-08-2021 08:07:01] Skipping download of https://fonts.googleapis.com (not whitelisted, or blocked by robots.txt)
[03-08-2021 08:07:01] Skipping download of https://s.w.org (not whitelisted, or blocked by robots.txt)
[03-08-2021 08:07:01] Beginning download of https://www.stanford.edu/wp-json/
[03-08-2021 08:07:01] Beginning download of https://www.stanford.edu/wp-includes/wlwmanifest.xml
[03-08-2021 08:07:01] Beginning download of https://www.stanford.edu/xmlrpc.php?rsd
[03-08-2021 08:07:01] Beginning download of https://www.stanford.edu/wp-content/plugins/awesome-weather-pro/awesome-weather.css?ver=4.9.7
[03-08-2021 08:07:01] Skipping download of https://fonts.googleapis.com/css?family=Open+Sans%3A400%2C300&ver=4.9.7 (not whitelisted, or blocked by robots.txt)
[03-08-2021 08:07:13] Skipping download of https://www.stanford.edu/wp-content/plugins/awesome-weather-pro/js/js-cookie.js?ver=1.1 (already downloaded)
<03ny8l2021 omitted...>
[03-08-2021 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-finance (0.323541 seconds)
[03-08-2021 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-research (0.318452 seconds)
[03-08-2021 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-staff (0.314381 seconds)
[03-08-2021 08:07:13] End download of https://www.stanford.edu/list/admin/#admin-students (0.317646 seconds)
myth66.stanford.edu listening on port 9979...

After crawling a ton of www.stanford.edu pages, this launches a server; the last line tells you where to connect. (It will be different for you.) If I connect to http://myth66.stanford.edu:9979 while leaving archive running, I see the following:

I can enter https://www.stanford.edu, click “Go!", and see Stanford’s homepage, in all its glory:

I can click around on the links, and everything works, as long as I only click links pointing to www.stanford.edu sites (the domain I whitelisted). Even if Stanford’s website goes offline, this archive can still continue serving it as if nothing had ever happened.

Program usage

You may want to play around with ./samples/archive_soln before you begin, just to get a sense of what you’re trying to do here.

The -d specifies the seed page that you’d like to begin crawling from. You should specify a full URL, including http/https. For example:
```
./archive -d https://web.stanford.edu/class/cs110/summer-2021/
```
You can only specify one -d flag, but if you want to index several pages, you can run archive several times. (The files it downloads are persisted in the indexed-documents/ directory.)
Add -w flags to whitelist domains for our crawl. You can specify multiple -w flags, and you can use wildcards if you’d like (though that may include many more pages than you want). For example:
```
./archive -w "*.stanford.edu" -w fonts.googleapis.com -d https://www.stanford.edu
```
While you’re building archive, you may just want to see if it runs correctly, and you may be less interested in browsing the pages it downloads. You can add the -n or --no-serve flag to disable the server. (Your program will exit after downloading the pages.)
archive starts the server with a port number generated from your sunet ID. It’s unlikely anyone else will conflict with your port. However, if you want to run the server on a different port, add the -p or --port flag:
```
./archive -p 12345
```

Important note: archive saves downloaded files in the indexed-documents/ directory. Running archive several times will add to this database without clearing it. This allows you to crawl several different websites and have them all be accessible from your archive web server, but it might not be what you want in testing.

If you want to start fresh, run make filefree to clear the indexed-documents directory. You can also run archive with the -m flag to make it memory-only (it won’t read from disk or persist downloads on disk).

Instructions for off-campus students

As shown above, this assignment features a web server that shows the pages your program has archived. We’re running this server on ports 2000-65535, but unfortunately, for security reasons, the myth machines only allow access to these ports from on-campus computers.

You have several options:

Connect to the campus network using a VPN; instructions are here. I think this is probably the most convenient option; the VPN doesn’t take long to set up, and will be the easiest to work with.
Use an SSH proxy. SSH has a feature that allows us to send traffic to an SSH server, and it will forward that traffic to a web server. If we SSH into a Stanford computer, we can then use that computer to forward web requests to your archive server.
- Launch archive. Let’s say my server is listening on myth55 port 9979.
- Open an extra terminal window and ssh -L 9979:localhost:9979 [email protected] (replace myth55 and 9979 with whatever host/port you are using). Leave this running on the side. This SSH session will ferry traffic from your computer’s port 9979 to myth port 9979 while it’s open, but it will stop working as soon as the SSH connection is closed.
- In your browser, go to http://localhost:9979. (Replace 9979 with your port.)
This method might be annoying to set up each time you try to connect, but it works if you would prefer to avoid installing the VPN software.
Use the terminal-based lynx browser. Open up another terminal window, SSH into myth, and then run something like this:
```
lynx http://myth66.stanford.edu:9979/https://www.stanford.edu
```
If you don’t care much for the interactivity of lynx, you can ask your server for a single page by using curl:
```
curl http://myth66.stanford.edu:9979/https://www.stanford.edu
```
This simply downloads the page from myth and prints it out on your terminal window. If you get anything back, your archive is probably working correctly. (Errors in the page output probably come from the crawling framework that I’ve written for you.)
You can run archive with the web server disabled. When you run archive, add the -nim flag. archive won’t run the web server, but will spit out a list of downloaded content that you can check for correctness. This is the easiest option of this list, but you miss out on the cool factor of being able to use your archive from your browser.

A note on robots.txt

If you look at the sample output of archive that I posted above, you’ll notice that some downloads were skipped. A download from fonts.googleapis.com was skipped – this makes sense, because that was missing from our whitelist. A download from www.stanford.edu was also skipped – this makes less sense.

If you try downloading a website, and you notice that everything is being skipped even though you’ve whitelisted it, it’s possible this is happening because of robots.txt. This file is used by website administrators to tell crawlers (like us) not to download particular parts of the website. You can see it by going to /robots.txt on any website. For example, I tried using archive to download Linux man pages, but I ran into issues. When I checked https://linux.die.net/robots.txt, I saw that everything was disallowed for our crawler.

An opening plea

Please, please, please take the time to ensure you fully understand the concept checks for lectures 11-13, as well as the examples covered in labs 5 and 6. I took every common mutex/semaphore/CV bug I saw in past quarters and put them into the concept check. Please make sure you fully understand why each code example works/breaks so that you don’t fall into the same traps!

It may take some time to go through all of this material, but I promise you that it will save you double the time (or even more) from debugging if you go in with a better understanding of how everything is supposed to fit together!

Getting started

Clone the repository that we’ve set up for you by typing:

git clone /usr/class/cs110/repos/assign5/$USER assign5

Compile often, test incrementally and almost as often as you compile, run ./tools/sanitycheck a bunch to test your work, and run ./tools/submit when you’re done. As you’ve seen in past assignments, you can play with a sample application by invoking ./samples/archive_soln.

Files

archive.h/cc: This is the main (and possibly only) file that you’ll be working on for the first two milestones of the assignment. A skeleton is in place for you, but you need to implement Archive::download to crawl the internet starting from seed URL.

Feel free to add static functions, or to add private methods to Archive. My solution adds several.

document.h/cc: This contains the Document class, used to represent web pages that you encounter. You’ll create a Document for every URL you consider downloading. Call Document::download to download a web page and parse it for any links to other web pages, and call Document::getLinks to get those outgoing links.

You can modify this file if you’d like, but you won’t need to.

index.h/cc: This contains the Index class, used to store all the Documents you download, and used to implement the archive web server that you use to access downloaded web pages. You’ll need to call Index::addDocument.

You can modify this file if you’d like, but you won’t need to.

whitelist.h/cc: This class implements access control to the URLs you might try to access while crawling the web. Every whitelisted URL (specified by -w command-line arguments) should be added to the whitelist by calling addHost. Before you download a URL, you should call Whitelist::canAccess to check whether you’re allowed to access the URL. (This checks both the whitelist from the -w flags, and it does some much more complicated checking of robots.txt files to see if we’re allowed to access this specific URL – see “note on robots.txt” above.)

You can modify this file if you’d like, but you won’t need to.

log.h/cc: This contains a Log class that you can use to produce the logging messages you see in my sample output. You don’t need to use this class if you want to use your own logging – we aren’t grading you on your terminal output format – but it’s easy to use and may save you time, so you should read the interface in the log.h file.

You can modify this file if you’d like, but you won’t need to.

thread-pool.h/cc: This is where you’ll implement Milestone 3 of the assignment, which is a way to use multiple threads in a more efficient way.

tptest.cc: This is a trivial program that verifies basic ThreadPool functionality. We call this from sanitycheck to see if your ThreadPool is working. Don’t change this file.

tpcustomtest.cc: This contains a collection of functions you can use to test your ThreadPool yourself. This is for you to play with and add as many additional tests as you can think of to make sure your ThreadPool is working brilliantly.

Assignment roadmap

Milestone 1: Sequential version

First, implement the InternetArchive class constructor in archive.cc to finish setting up the InternetArchive class. You’ll need to add each whitelisted host to a Whitelist declared for you in archive.h.

Then, implement InternetArchive::download. Given a URL, create a Document from that URL, download() the Document, call getLinks to get a list of outbound links from the document, and then repeat the process with those links. Add each document to the Index declared for you in archive.h. We recommend implementing DFS with recursion, as this will be the easiest to add threads to in milestone 2.

Note that you are not required to use the log methods (see log.h), but we recommend calling them to produce output that may be helpful for debugging.

You must meet the following requirements:

You must never download the same URL twice in the same run of the archive program. (It’s fine to create two Document objects with the same URL at some point, but you should never actually call download() on a Document if you’ve already downloaded a document with the same URL.)
- When running archive a second time, it’s possible that previously-downloaded URLs already exist in the index that were loaded from disk. However, you should ignore and re-download these. This is helpful when we want to archive a site again, updating our files to reflect any recent changes.
You must call Whitelist::canAccess before calling download() on a document, and you shouldn’t download the document if canAccess returns false.
Sometimes downloads can fail. (The network might have issues, the destination server might be down, or there might be some other odd issue.) If download() throws a DownloadException, you should just skip downloading that document (move on, and don’t add it to the index). Your program should not crash. You don’t need to attempt to re-download any failed pages.

After this milestone, you should pass 11 of the 14 sanitycheck tests (everything except the ThreadPool tests).

Debugging failures

A failed sanitycheck output looks something like this:

+++ Test 12-SmallTest-03 on .ir.stanford.edu/users/r/e/rebs/cs110-21-sum/assign5
Descr:   test a single page with several links leading 2 levels deep
Command: make filefree >/dev/null && ./archive -w "web.stanford.edu" -d "http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/linked-multi.html" -niqm
NOT OK:  Waited 15 seconds, program did not complete

You can try running the Command: yourself, outside of sanitycheck, but remove the q flag to show logging output produced by the Log class. For example, you can run:

🍉 make filefree >/dev/null && ./archive -w "web.stanford.edu" -d "http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/linked-multi.html" -nim
[03-08-2021 07:46:24] Beginning download of http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/linked-multi.html
[03-08-2021 07:46:24] End download of http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/linked-multi.html (0.008881 seconds)
[03-08-2021 07:46:24] Beginning download of http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/basic2.html
[03-08-2021 07:46:24] End download of http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/basic2.html (0.007343 seconds)
... more output omitted ...

(Note the lack of q in -niqm.)

Assuming you have included log calls (see above), this should be helpful in identifying what is going wrong. You can also run the sample solution in the same way to see the logging emitted by our code.

Milestone 2: Multithreaded version

The sequential version of the program works fine, but is unacceptably slow. It takes me 27 seconds to download 145 pages from www.stanford.edu; by contrast, using 32 threads to perform 32 downloads simultaneously, this takes only 0.8 seconds. In this milestone, you will update your archive program to download and process pages in parallel using a ThreadPool of 32 threads.

Thread pools

A thread pool is a group of threads that work collaboratively to process a queue of work.

You’ve already seen something similar in Assignment 3, except with processes instead of threads. In farm, you created several worker processes, then distributed work to each worker as it finished its previous work. There were only eight child processes ever created, but each child factored several numbers when there was a lot of work to be done.

An easier approach might have been as follows: every time you had a number to factor, you could fork to launch a worker to factor that number. You could have implemented it such that a maximum of 8 workers were running at a time (on an 8-core machine), and some might argue that this would be just as good as your implementation, since this avoids contention for hardware resources just as well as your implementation did. However, this approach is definitely worse, because even if it has the same number of processes running at a time, it creates many more processes over the entire execution of the program. Creating processes is relatively expensive, so if we can reuse processes to do work, we should do so.

Similar considerations apply to threads. If you have a lot of things to do but want to do at most n things at a time, then it is often better to create n threads up front and distribute the work across those threads. Instead of creating a total of 1000 threads to download 1000 webpages, we would prefer to create only 32 threads and distribute the 1000 webpages amongst them.

When a thread pool is created with n threads, we can immediately start n threads, even before any work needs to be done. Work can be added to the pool and one of the threads will wake up to do the work. Concretely, we can add functions to the thread pool’s queue, and one of the threads will wake up, execute the function we added, and then go back to sleep. (Note: specifically, thunks are added to the queue. Thunks are just functions that take no parameters and return no values.)

The thread pool concept is practically inescapable in software engineering; you’ll see it in database libraries, web servers, some graphics rendering engines, and more. The code we’re implementing is quite general-purpose, and the concepts involved will serve you well.

The ThreadPool class has the following interface:

class ThreadPool {
public:
    ThreadPool(size_t numThreads);
    void schedule(const std::function<void(void)>& thunk);
    void wait();
    ~ThreadPool();
};

A simple program can use this pool to execute 10 function calls across 4 threads:

static const size_t kNumThreads = 4;
static const size_t kNumFunctions = 10;
int main(int argc, char *argv[]) {
    ThreadPool pool(kNumThreads);
    for (size_t id = 0; id < kNumFunctions; id++) {
        pool.schedule([id] {
            cout << oslock << "Thread (ID: " << id << ") has started." << endl << osunlock;
            size_t sleepTime = (id % 3) * 10;
            sleep_for(sleepTime);
            cout << oslock << "Thread (ID: " << id << ") has finished." << endl << osunlock;
        });
    }

    pool.wait();
    cout << "All done!" << endl;
    return 0;
}

Updating your `archive`

You’ll need to create a ThreadPool of 32 threads. For each URL you want to download, you’ll want to schedule a function that downloads that URL, adds it to the index, and schedules downloads of any linked pages. Ensure that all downloads have finished before returning from InternetArchive::download.

Take care to avoid any race conditions or deadlock! You can run ./archive_asan (AddressSanitizer finds memory errors, like Valgrind) and ./archive_tsan (ThreadSanitizer finds data races) to identify some mistakes, although note these tools cannot catch all errors.

After this milestone, you should continue passing the same 11 of the 14 sanitycheck tests as Milestone 1 (everything except the ThreadPool tests). Ensure you are still meeting the requirements from Milestone 1. In particular, make sure that two threads running at the same time can’t download the same URL twice.

Testing recommendations

When you are finished, you should see a significant speedup:

Milestone 1 timing (17.3s total):

🍉 make filefree && time ./archive -w www.stanford.edu -d https://www.stanford.edu -nmq
rm -rf indexed-documents
./archive -w www.stanford.edu -d https://www.stanford.edu -nmq  3.42s user 0.17s system 20% cpu 17.326 total

Milestone 2 timing (1.1s total):

🍉 time ./archive -w www.stanford.edu -d https://www.stanford.edu -nmq
./archive -w www.stanford.edu -d https://www.stanford.edu -nmq  1.65s user 0.14s system 163% cpu 1.090 total

Your times may vary dramatically depending on the condition of the network when you run this, but you should see a ~10-30x speedup.

Be sure to run ASan and TSan to identify any memory errors or data races. For example:

./archive_tsan -w "web.stanford.edu" -d "http://web.stanford.edu/class/cs110/summer-2021/assign5-samples/linked-multi.html" -nim

If you get errors, the output may be intimidating, but you should find the first line in the stack trace that is coming from archive.cc. What is that line doing? That’s the area you should look around when hunting for errors.

Milestone 3: Implementing a Thread Pool

Your Milestone 2 implementation is using our reference ThreadPool implementation. In this milestone, you will implement your very own ThreadPool!

You’ll want to implement this in thread-pool.cc and thread-pool.h. You can run ./tptest and ./tpcustomtest (appending _asan or _tsan to run ASan or TSan) to test your code. Finally, when you are happy with how it is working, modify archive.h to use develop::ThreadPool (your solution) instead of reference::ThreadPool (our solution) and test to ensure everything still works!

using develop::ThreadPool;

Implementing `ThreadPool`

Your constructor should do the following:

Launch a single dispatcher thread, which pulls work off the queue, wakes up a particular worker, and hands the function to be executed to that worker. For example, if dispatcherThread is a private ThreadPool member of type thread:
```
dispatcherThread = thread([this]() {
    dispatcher();
});
```
Launch a specific number of worker threads that each run the worker function. For example, if you have a vector<thread> workerThreads in your ThreadPool, you could write:
```
for workerID from 0 to numThreads:
    workerThreads.push_back(thread([this, workerID]() {
        worker(workerID);
    }));
```

Your schedule function should accept a thunk (a function that takes no parameters and returns no value – expressed as type function<void(void)>) and append it to the end of a queue of such functions. Each time a function is added, the dispatcher thread should be notified. Once the dispatcher is notified, schedule should return right away so that more functions can be scheduled. schedule should be thread-safe (i.e. if your program has more threads that are running outside of the ThreadPool, it should be possible to call schedule from multiple different threads and not have any chance of encountering race conditions).

The dispatcher thread should loop almost interminably. In each iteration, it should sleep until schedule tells it that something has been added to the queue. It should then wait for a worker to become available, select the worker, mark it as unavailable, dequeue a function, put a copy of that function in a place where the worker can access it, and then signal the worker to execute it.

The worker threads should also loop repeatedly, blocking within each iteration until the dispatcher thread signals it to execute an assigned function. Once signaled, the worker should invoke the function, wait for it to execute, then mark itself as available so that it can be discovered and selected again by the dispatcher.

The wait function should wait until the ThreadPool is completely idle. It should be possible to call wait multiple times, and wait should be thread-safe (it should be possible to call wait from multiple threads simultaneously).

The ThreadPool destructor should wait until all functions have been executed (it’s fine to call wait), then dispose of any ThreadPool resources. Our solution doesn’t use any dynamic memory allocation, but if you use it, then be sure to free resources.

Implementation suggestion

You should aim to implement as little code as possible before testing, ensuring that what you have written does what you think it does. This can be challenging, since there are many pieces here that all need to fit together perfectly in order for everything to work. However, here are some things you can try:

First, implement the constructor to spawn the dispatcher and worker threads.
If you try testing with only the constructor implemented, you will encounter sad times, because you aren’t joining the threads, so your code will crash with terminate called without active exception. Let’s put this problem off until later, so we can implement other parts without worrying about this! Temporarily implement the destructor by calling sleep(10);. You can worry about implementing it properly later; this will make your program wait for enough time (10 seconds) that you’ll be able to see whether your other code is working before it crashes due to terminate called without active exception.
Implement the schedule function, adding the thunk to the queue, and implement the dispatcher thread, which receives from the queue. In schedule, you can print something like “schedule: added to queue!” and in dispatcher, you can print something like “dispatcher: received from queue!". You might even try running the received thunk in the dispatcher. This is obviously not a good idea in the long term, because it means that all of the enqueued thunks will run (sequentially) in the dispatcher instead of in worker threads, but it will allow you to test that the handoff between schedule and the dispatcher thread is working correctly.
Implement the handoff from the dispatcher thread to a worker thread. Add plenty of print statements so that you can follow along in this process. In the dispatcher, you should print out which worker you’re trying to wake up; when a worker wakes up, you should print its worker ID, so that you can verify that the correct thread has awoken and is running the thunk.
Once you have thoroughly tested all of the above and are confident that things are working fine, implement the wait function to wait for the ThreadPool to become idle.
Finally, implement the destructor, which should wait for the pool to become idle, signal the threads to break out of their while loops, and join the threads.

Can I implement ThreadPool without a dispatcher thread?

Yes; it’s quite possible to implement a scheme where workers are notified of incoming work, and then they pull work off the queue without the dispatcher specifically handing the work to them. However, we want you to implement ThreadPool with a dispatcher thread. It’s better practice with thread communication/synchronization, and the dispatcher thread is essential to implementing more capable ThreadPools (such as a ThreadPool with lazy initialization, where the worker threads aren’t created unless they’re actually needed).

Testing suggestions

You can test your ThreadPool using tptest.c and tpcustomtest.c (which compile to tptest and tpcustomtest). If you use dynamic memory allocation, make sure that you do not leak any memory. (You shouldn’t need to.) You should also run ASan and TSan (append _asan and _tsan to the aforementioned executable names).

Finally, update archive.h to use your ThreadPool implementation by replacing reference::ThreadPool with develop::ThreadPool. Test to ensure everything is still working properly!

CS 110

Assignment 5: Internet Archive

What the finished product will do

Program usage

Instructions for off-campus students

A note on robots.txt

An opening plea

Getting started

Files

Assignment roadmap

Milestone 1: Sequential version

Debugging failures

Milestone 2: Multithreaded version

Thread pools

Updating your archive

Testing recommendations

Milestone 3: Implementing a Thread Pool

Implementing ThreadPool

Implementation suggestion

Can I implement ThreadPool without a dispatcher thread?

Testing suggestions

Updating your `archive`

Implementing `ThreadPool`