Lab 5 Solutions

These questions were written by Jerry Cain and Ryan Eberhardt.

Before the end of lab, be sure to fill out the lab checkoff sheet here!

Problem 1: Threading Short Answer Questions

Here are some short answer questions about the specifics of threads and the directives that help threads communicate.

Threads are often called lightweight processes. In what sense are they processes? And why the lightweight distinction?
- Each thread of execution runs as if it’s a miniature program. Threads have their own stack, and they have access to globals, dynamic memory allocation, global data, system calls, and so forth. They’re relatively lightweight compared to true processes, because you don’t need to create a new process when creating a thread. The thread manager cordons off a portion of the full stack segment for the new thread’s stack, but otherwise the thread piggybacks off existing text, heap, and data segments previously establish by fork and execvp. (For more info, go ahead and read the man page for clone).
Threads running within the same process all share the same address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?
- It’s a double-edged sword. Because all threads have access to the same virtual address space, it’s relatively trivial to share information. However, because two or more threads might want to perform surgery on a shared piece of data, directives must be implanted into thread routines to ensure data and resources are shared without inspiring data corruption via race conditions, or deadlock via resource contention. That’s a very difficult thing to consistently get right.
When might you choose multithreading over multiprocessing? When might you choose multiprocessing over multithreading?
- Multithreading: easiest choice in most cases where you need to do multiple things concurrently. It’s easier to communicate between threads and they use less resources to execute.
- Multiprocessing: required if you want to run separate executables. (You can’t be running two executables in the same process, threads or no threads.) Also a good choice if you want to architect for security/robustness, since processes are self-contained and errors in one process won’t necessarily take down all other processes.
Say two threads run i-- on a single core machine. Is that safe? Why or why not?
- i-- is not thread safe if it expands to two or more assembly code instructions, as it does in x86_64. For a local variable not already in a register, the generated code might look like this:
```
mov    (%rbp), %rax
sub    $1, %rax
mov    %rax, (%rbp)
```
What’s the difference between a mutex and a semaphore with an initial value of 1? Can one be substituted for the other?
- It would be functionally possible to replace a mutex with semaphore(1), although I would not advise doing that. Mutexes are simpler and faster, so if you want to have at most one thread doing something, you should use a mutex.
- A mutex is not a replacement for a semaphore. First off, the thread that acquires the lock on a mutex must be the one to unlock it (by C++ specification), whereas one thread can signal a semaphore while a different thread waits on it. There’s also nothing to prevent a semaphore initialized to 1 from being signaled to surround an even higher number (like 2, or 2 million), whereas a mutex can really only be in one of two states (locked or unlocked).
What is the lock_guard class used for, and why is it useful?
- It’s used to layer over some kind of mutex (typically a mutex, but possibly some other mutex classes like timed_mutex and recursive_mutex) so that it’s locked when it needs to be locked (via the lock_guard constructor) and unlocked when the lock_guard goes out of scope (via its destructor). It’s useful because it guarantees that a mutex that’s locked through a lock_guard will be unlocked no matter how the function, method, or enclosing scope ends. In particular, a function might have two or more exits paths (some early returns, some exceptions being thrown, and so forth).

Problem 2: Debugging Myth Busters

The myth cluster is composed of several individual computers, numbered myth51-myth66. When you run ssh myth.stanford.edu, a load balancer evaluates how busy each machine is, and it directs you to the machine with the least load. In order to do this, the load balancer needs to be able to efficiently survey how busy each myth machine is.

We have implemented some starter code that sequentially logs into each myth machine and counts the number of running processes. You can get the starter code by cloning the lab5 folder:

$ git clone /usr/class/cs110/repos/lab5/shared lab5
$ cd lab5
$ make

We have provided a basic sequential (no threads) implementation in myth-buster-sequential.cc. The countMythProcesses function builds a map mapping myth numbers (e.g. 56 for myth56) to the number of processes running on that machine:

static const int kMinMythMachine = 51;
static const int kMaxMythMachine = 66;
static map<int, int> countMythProcesses() {
    map<int, int> processCounts;
    for (int mythNum = kMinMythMachine; mythNum <= kMaxMythMachine; mythNum++) {
        // SSH into the myth machine and count the processes it's running
        int numProcesses = getNumProcesses(mythNum);
        // numProcesses may be zero if a myth machine is currently down/dead.
        // Skip those.
        if (numProcesses >= 0) {
            cout << "myth" << mythNum << " has this many processes: " << numProcesses << endl;
            processCounts[mythNum] = numProcesses;
        }
    }
    return processCounts;
}

This code works, but is much too slow, taking 12.5 seconds to run:

🍉 time ./myth-buster-sequential
myth51 has this many processes: 368
myth52 has this many processes: 347
myth53 has this many processes: 318
myth54 has this many processes: 353
myth55 has this many processes: 366
myth56 has this many processes: 325
myth57 has this many processes: 401
myth58 has this many processes: 363
myth59 has this many processes: 420
myth60 has this many processes: 332
myth61 has this many processes: 367
myth62 has this many processes: 324
myth63 has this many processes: 355
myth64 has this many processes: 337
myth65 has this many processes: 355
myth66 has this many processes: 332
Machine with fewest processes: myth53
Number of processes on least loaded machine: 318
./myth-buster-sequential  0.30s user 0.17s system 3% cpu 12.445 total

We want to implement a parallel version of countMythProcesses that uses threads to speed up this process. You should use multiple threads so that you can connect to multiple myth machines in parallel, with the following requirements:

You should be connected to no more than 5 myth machines at a time. This limit is artificially low, since there aren’t many myth machines to begin with, but such restrictions are common in order to avoid overwhelming other computers (or yourself) with network requests, and you’ll see such a restriction when connecting to other websites in assignment 5. The final solution should take at least 1.5 seconds to run if this restriction is implemented correctly.
Your code should not have any possible data races or synchronization bugs. We have included a tool called ThreadSanitizer (sometimes called TSan) that observes what your program does and can warn you if your code has a data race. Like Valgrind, TSan cannot guarantee that your code is free of bugs, but it can catch some mistakes, and it does not report false positives.

While Valgrind injects extra code into your program at runtime, TSan needs to be compiled as part of your program. The Makefile builds an extra binary myth-buster-threaded_tsan that has TSan compiled into your program:
```
🍉 time ./myth-buster-threaded_tsan
myth55 has this many processes: 371
myth54 has this many processes: 371
myth52 has this many processes: 346
myth53 has this many processes: 317
myth51 has this many processes: 360
myth62 has this many processes: 318
myth57 has this many processes: 403
myth58 has this many processes: 364
myth64 has this many processes: 325
myth65 has this many processes: 348
myth59 has this many processes: 410
myth60 has this many processes: 333
myth61 has this many processes: 366
myth56 has this many processes: 325
myth66 has this many processes: 346
myth63 has this many processes: 363
Machine with fewest processes: myth53
Number of processes on least loaded machine: 317
./myth-buster-threaded_tsan  0.29s user 0.15s system 16% cpu 2.695 total
```
If TSan detects a data race, it will print an error message with information about the memory being accessed by multiple threads, as well as the lines of code being executed by threads that are accessing the variable simultaneously.

We have implemented a buggy threaded countMythProcesses in myth-buster-threaded.cc. Try running ./myth-buster-threaded, observe the broken behavior, and then try to fix each bug. Try to avoid figuring out the bugs from simply reading and staring at the code. Instead, see if you can use print statements or gdb to figure out what the program is doing incorrectly, and then update the code to fix the issue. We want you to practice figuring out what a program is doing (incorrectly) when it’s not immediately obvious what’s wrong with the code, since this is a skill that will help you the most in your assignments and in your own work after this class.

As an additional, optional exercise after you are finished debugging, see if you can implement your own threaded version from scratch! You can cp myth-buster-sequential.cc myth-buster-threaded.cc to replace our threaded implementation and get a blank canvas for your own implementation.

There were six bugs. Here they are, listed in the order I think you would have found them if you were debugging based on observed symptoms (instead of reading/analyzing the code):

The thread prints using oslock but fails to use osunlock to let other threads print to cout.
The thread waits on concurrentConnections (taking a ball from the bucket so that it can connect to a myth machine), but it fails to signal when it is done. Eventually, the bucket is depleted of balls, and other threads’ wait calls hang.
The code joins each thread before moving on to create the next one, so this code executes sequentially.
Speaking of the semaphore, the concurrentConnections.wait() call happens after the getNumProcesses(mythNum) call, which connects to the myth machines. This means the semaphore does nothing towards limiting the number of active connections to myth machines.
The thread lambda function captures mythNum by reference instead of by value. As a result, we connect to the same myth machine several times, and don’t connect to other myth machines at all.
The threads update processCounts without synchronizing using a mutex. Most of the time, this will be fine, but if two threads update the data structure at the same time, we could see anything happen, including segfaults, silent memory corruption, or even arbitrary code execution. TSan should catch this data race.

Here is our solution with all bugs fixed:

static int countForMachine(int mythNum, semaphore &concurrentConnections) {
    semaphore_guard sg(concurrentConnections);
    return getNumProcesses(mythNum);
}

static map<int, int> countMythProcesses() {
    map<int, int> processCounts;
    vector<thread> threads;
    mutex processCountsMutex;
    semaphore concurrentConnections(5);
    for (int mythNum = kMinMythMachine; mythNum <= kMaxMythMachine; mythNum++) {
        threads.push_back(thread([mythNum, &processCounts, &processCountsMutex, &concurrentConnections](){
            int numProcesses = countForMachine(mythNum, concurrentConnections);
            // numProcesses may be zero if a myth machine is currently down/dead.
            // Skip those.
            if (numProcesses >= 0) {
                cout << oslock << "myth" << mythNum << " has this many processes: " << numProcesses
                    << endl << osunlock;
                lock_guard<mutex> lg(processCountsMutex);
                processCounts[mythNum] = numProcesses;
            }
        }));
    }
    for (thread &t : threads) {
        t.join();
    }
    return processCounts;
}

Questions for discussion, after the bugs are fixed

How do you decide how many threads to have running at a time? Is more threads always better?
- It depends on the situation. If your threads are doing something that requires a lot of CPU time, then numThreads should equal numCpuCores. Having one thread per CPU core would be optimal, since each thread should be using 100% of the CPU core. Adding more threads would simply introduce more scheduling overhead. However, if your threads are I/O bound (doing very little on the CPU, spending most of the time waiting for user input or the filesystem or the network), then you can have many more threads than CPUs without using all of your CPU resources, and doing so might allow you to do much more work in parallel.
When should a variable be passed by reference? When should it be passed by value?
- A variable should be passed by reference when you want threads to share the same, single variable. (If the variable is being modified, accesses should be synchronized using a mutex.) It should be passed by value when you want threads to have separate copies, or when you want to “give” a variable to a thread. (If you pass a reference to a variable on your stack when invoking a thread, and then you return from the current function, that would delete the stack frame and free the variable’s memory, causing a use-after-free if the other thread continues using the variable.)
When should the threads be joined?
- In this program, threads should be joined after all the threads are created. Joining too early serializes the code, causing the myth connections to run sequentially.
When should the mutex be locked? When should it be unlocked? How would the program behave if you locked/unlocked in different places?
- The mutex should be locked immediately before updating processCounts, and unlocked immediately afterwards. Locking before is unnecessary but may not cause problems, as long as we don’t hold the lock while calling getNumProcesses(mythNum), as doing so would serialize the downloads.
When should the semaphore be waited on? When should it be signaled? How would the program behave if you waited/signaled in different places?
- The semaphore should be waited on before calling getNumProcesses, and should be signaled immediately afterwards. It would be fine to wait on the semaphore before creating the thread, and this would actually be preferable if there were hundreds of myth machines to connect to; by doing this, we avoid creating lots of threads that sit around idly doing nothing for a long time until they can get a ball from the bucket. The semaphore should be signaled immediately after getNumProcesses(mythNum) returns. Doing this later wouldn’t break anything, but it’s not ideal, since it’s best to let other threads start connecting to other myth machines as early as possible.
What debugging tricks did you find most helpful in the course of fixing this code?
- Let us know!

CS 110

Lab 5 Solutions

Problem 1: Threading Short Answer Questions

Problem 2: Debugging Myth Busters

Questions for discussion, after the bugs are fixed