Lab 4 Solutions

This handout was adapted from Jerry Cain’s Spring 2018 offering.

I’ve created a Slack channel for Lab 4 discussion (aptly named #lab4), and all students (but particularly remote students) are encouraged to share their ideas there.

Problem 1: Assignment 2 Revisited

Here are a collection of short answer questions drilling your understanding of subprocess and farm. It’s not uncommon for students to get working solutions to assignments and still not be entirely clear why they work. These questions are here to force you to think big picture and understand the systems concepts I feel are important.

Your Assignment 2 implementation of subprocess required two pipes – one to foster a parent-to-child communication channel, and a second to foster a child-to-parent communication channel. Clearly explain why a single pipe shouldn’t be used to support both communication channels.
- Because both child and parent would need to write to fds[1], and there’d be no generic way to codify whether material in the pipe is intended for child or parent, so one could accidentally read what’s intended for the other.
You’ve seen dprintf in lecture and in the assign3 handout, and it presumably contributed to many farm implementations. Explain why there’s a dprintf function, but there’s no analogous dscanf function. Hint: Think about why dprintf(fd, "%s %d\n", str, i) would be easy to manage whereas dscanf(fd, "%s %d\n", str, &i) wouldn’t be. Read the first few lines of the man pages for the traditional fprintf and fscanf functions to understand how they operate.
- With dprintf, the entire string can be constructed (and its length computed) before the underlying write calls. dscanf might need to read an extra character (e.g. the space in "1234 ") to confirm a placeholder like %d has been matched, and there’s no way to revert that extra read.

Consider the implementation of spawnAllWorkers below. Even though it rarely causes problems, the highlighted line below technically contains a race condition. Briefly describe the race condition, and explain how to fix it.

    struct worker {
           worker() {}
           worker(char *argv[], const set<int>& openfds = set<int>()) : sp(subprocess(argv, true, false, openfds)), available(false) {}
           subprocess_t sp;
           bool available;
        };
        
        static const size_t kNumCPUs = sysconf(_SC_NPROCESSORS_ONLN);
        static vector<worker> workers(kNumCPUs);
        static size_t numWorkersAvailable;
        
        static const char *kWorkerArguments[] = {
           "./factor.py", "--self-halting", NULL
        };
        static void spawnAllWorkers() {
           cout << "There are this many CPUs: " << kNumCPUs << ", numbered 0 through " 
                << kNumCPUs - 1 << "." << endl;
           set<int> openfds;
           for (size_t i = 0; i < workers.size(); i++) {
              workers[i] = worker(const_cast<char **>(kWorkerArguments, openfds));
              openfds.insert(workers[i].supplyfd);
              assignToCPU(workers[i].sp.pid , i); // implementation omitted, irrelevant
           }
        }
        
        int main(int argc, char *argv[]) {
            signal(SIGCHLD, markWorkersAsAvailable); // markWorkersAsAvailable is correct
            spawnAllWorkers();
            // other functions, assume all correct
            return 0;
        }

The right hand side constructs a temporary that launches a process that self-halts before content of temporary is copied into workers[i]. If that happens, the SIGCHLD handler is invoked and crawls over an array that doesn’t have the pid yet, and sadness ensues.
The solution is to block SIGCHLD before the highlighted line and then unblock after.

While implementing the farm program for assign2, you were expected to implement a getAvailableWorker function to effectively block farm until at least one worker was available. My own solution relied on a helper function I called waitForAvailableWorker, which I present below. After analyzing my own solution, answer the following questions:
- Assuming no signals are blocked at the time waitForAvailableWorker is called, clearly identify when SIGCHLD is blocked and when it is not.
  - TL;DR: SIGCHLD isn’t blocked prior to the sigprocmask call, within the call to sigsuspend. It’s blocked everywhere else.
  - More detail: The first three lines create a singleton mask containing just SIGCHLD, and the fourth line blocks it. The while loop test is evaluated while the SIGCHLD block is in place, which is exactly what we want, because we don’t want its evaluation to be interrupted by the execution of markWorkersAsAvailable. If the test passes, the call sigsuspend simultaneously lifts the block on SIGCHLD (remember: existingmask is empty) and puts the process to sleep until some (or rather, any) signal is received, at which point any installed handler (presumably the SIGCHLD handler, since we expect SIGCHLD to be the signal that comes in) executes with high priority. After any handler exits, sigsuspend returns while simultaneously restoring the mask that was in place before sigsuspend was called, and re-evaluates the test again with a block on SIGCHLD. That process repeats until the while loop test fails, at which point waitForAvailableWorker returns, leaving the block on SIGCHLD in place.
- Had I accidentally passed in &additions to the sigsuspend call instead of &existing, the farm could have deadlocked. Explain why.
  - numWorkersAvailable == 0 could pass, and then sigsuspend would force farm to deadlock, as only SIGCHLD signals are coming in and capable of changing numWorkersAvailable, and they’re blocked.
- Had I accidentally omitted the sigaddset call and not blocked SIGCHLD, farm could have deadlocked. Explain how.
  - numWorkersAvailable == 0 passes, farm is swapped off CPU, all kNumCPUs workers self-halt, all kNumCPUs SIGCHLDs handled by one SIGCHLD handler call, farm descends into sigsuspend, and no additional SIGCHLDs ever arrive to wake farm up.
- In past quarters, I saw a bunch of students who lifted the block on SIGCHLD before the two lines in bold instead of after. As it turns out, executing numWorkersAvailable-- after the block is lifted can cause problems, but executing workers[i].available = false actually can’t. Explain why the placement of the -- is more sensitive to race conditions than the Boolean assignment is.
  - Execution of -- could be interrupted and confused by execution of ++ within SIGCHLD handler.
  - When Boolean assignment is executed, relevant worker is halted, so interruption by SIGCHLD handler would hop over workers[i], as its pid can’t possibly have been returned by handler’s waitpid call until workers[i] is continued.
```
static sigset_t waitForAvailableWorker() {
   sigset_t existing, additions;
   sigemptyset(&additions);
   sigaddset(&additions, SIGCHLD);
   sigprocmask(SIG_BLOCK, &additions, &existing);
   while (numWorkersAvailable == 0) sigsuspend(&existing);
   return existing;
}
    
static size_t getAvailableWorker() {
   sigset_t existing = waitForAvailableWorker();
   size_t i;
   for (i = 0; !workers[i].available; i++);
   assert(i < workers.size());
   numWorkersAvailable--;
   workers[i].available = false;
   sigprocmask(SIG_SETMASK, &existing, NULL); // restore original block set
   return i;
}
```

Problem 2: Threading Short Answer Questions

Here are some more short answer questions about the specifics of threads and the directives that help threads communicate.

Is i-- thread safe on a single core machine? Why or why not?
- i-- is not thread safe if it expands to two or more assembly code instructions, as it does in x86_64. For a local variable not already in a register, the generated code might look like this:
```
mov    (%rbp), %rax
sub    $1, %rax
mov    %rax, (%rbp)
```
- If the thread gets swapped off the processor after the first or second of those two lines and another thread is scheduled and changes i, the original thread will eventually resume and continue on with stale data.
- Some architectures are capable of doing an in-memory decrement (or, more broadly, a memory-memory instruction) in one instruction, so i-- could compile to one assembly code instruction on some architectures if i is a global variable, and in a single core world where only one thing can happen at a time. But in general, you write code to be portable and architecture-agnostic, so you would need to operate as if i--, even where i is a global, is thread-unsafe.
What’s the difference between a mutex and a semaphore with an initial value of 1? Can one be substituted for the other?
- The semaphore could be used in place of the mutex, but not vice versa. First off, the thread that acquires the lock on a mutex must be the one to unlock it (by C++ specification), whereas one thread can signal a semaphore while a second thread waits on it. There’s also nothing to prevent a semaphore initialized to 1 from being signaled to surround an even higher number (like 2, or 2 million), whereas a mutex can really only be in one of two states.
What is the lock_guard class used for, and why is it useful?
- It’s used to layer over some kind of mutex (typically a mutex, but possibly some other mutex classes like timed_mutex and recursive_mutex) so that it’s locked when it needs to be locked (via the lock_guard constructor) and unlocked when the lock_guard goes out of scope (via its destructor). It’s useful because it guarantees that a mutex that’s locked through a lock_guard will be unlocked no matter how the function, method, or enclosing scope ends. In particular, a function might have two or more exits paths (some early returns, some exceptions being thrown, and so forth).
What is busy waiting? Is it ever a good idea? Does your answer to the good-idea question depend on the number of CPUs your multithreaded application has access to?
- Busy waiting is consuming processor time waiting for some condition to change, as with while (numAvailableWorkers == 0) {;}. In a single CPU machine, it makes sense for a process or thread that would otherwise busy wait to yield the processor, since the condition can’t possibly change until some other thread gets the processor and makes changes to the variables that are part of the condition. That’s what sigsuspend (for processes) and conditional_variable_any::wait (for threads) are for. In a few scenarios where multithreaded code is running on a machine with multiple CPUs, it’s okay to busy wait if and only if you know with high probability that another thread is running at the same time (on another CPU) and will invert the condition that’s causing the first thread to busy wait.
As it turns out, the semaphore’s constructor allows a negative number to be passed in, as with semaphore s(-11). Identify a scenario where -11 might be a sensible initial value.
- Say you have 24 threads, and you need to wait for just 12 of them to finish (any 12 of the 24). You could create a semaphore initialized to -11, signal that semaphore in each thread on completion, and then wait on the semaphore in the main thread. The wait() call will return once the semaphore value reaches 1, which happens when 12 of the threads signal it.
What would the implementation of semaphore::signal(size_t increase = 1) need to look like if we wanted to allow a semaphore’s encapsulated value to be promoted by the increase amount? Note that increase defaults to 1, so that this version could just replace the standard semaphore::signal that’s officially exported by the semaphore abstraction.
- The key line in signal would need to replace the ++ with a += increase. It would also need to decide whether the condition variable needs to be notified by checking to see if the encapsulated value pre-increment was less than or equal to 0 and the value post-increment is positive. If so, then notify_all needs to be invoked.
What’s the multiprocessing equivalent of the mutex?
- It’s not a perfect parallel, but if I had to choose one, it’d be signal masks. We used signal masks to remove the possibility of interruption by signal, which is the only thing that can otherwise introduce a race condition into a sequential program. (Another answer is something I referenced in an earlier solution, even though I didn’t teach it in CS110. Investigate the flock function and think about how it could be used to prevent interprocess race conditions on shared memory mapped segments.)
What’s the multiprocessing equivalent of the condition_variable_any?
- sigsuspend is the equivalent of conditional_variable_any::wait, and a signal outside the sigsuspend signal mask is the equivalent of conditional_variable_any::notify_[one|all].
The semaphore implementation we produced in lecture is repeated below. Many students have asked whether the very last line of semaphore::signal could have called notify_one instead of notify_all. Does is matter? Explain.
```
semaphore::semaphore(int value) : value(value) {}
    
void semaphore::wait() {
  lock_guard<mutex> lg(m);
  cv.wait(m, [this]{ return value > 0; });
  value--;
}
    
void semaphore::signal() {
  lock_guard<mutex> lg(m);
  value++;
  if (value == 1) cv.notify_all();
}
```
- It definitely matters. Assume signal is implemented in terms of notify_one instead of notify_all, and consider a four-thread test program with a single semaphore s (initialized to 0), threads 1 and 2 calling s.wait(), threads 3 and 4 calling s.signal(). The following scheduling pattern – one that’s very possible – leads to thread 2 blocking forever:
  - Threads 1 and 2 each progress and block on s.wait().
  - Thread 3 signals s, which notifies cv to wake thread 1 (but not 2) up.
  - Thread 4 signals s, without notifying cv.
  - Thread 1 passes through s.wait(), but thread 2 is still asleep under cv, even though the semaphore value is 1.

Problem 3: Threads vs Processes

To provide some answers about the pros and cons of threads and processes, we’ll lead you through a collection of short answer questions about multiprocessing and multithreading that focuses more on the big picture.

What does it mean when we say that a process has a private address space?
- It means that its range of addresses are, by default, private to the owning process, and that it’s impossible for an another arbitrary, unprivileged process to access any of it.
What are the advantages of a private address space?
- The fact that it’s private, of course. Most other programs can’t accidentally or intentionally examine another process’s data.
What are the disadvantages?
- The fact that it’s private, of course. Address space privacy makes it exceptionally difficult to exchange data with other processes, and works against any efforts to breezily distribute work over multiple CPUs using multiprocessing.
What programming directives have we used in prior discussion section handouts to circumvent address space privacy?
- Signals and pipes can be used to communicate between processes despite address space privacy. We didn’t talk about it in class, but mmap can also be used to create segments of memory shared between multiple processes.
When architecting a larger program like farm or stsh that relies on multiprocessing, what did we need to do to exchange information across process boundaries?
- farm relied on pipes to forward data to each of the workers, and it relied on SIGTSTP, SIGCHLD, SIGCONT, SIGKILL, and a SIGCHLD handler to manage synchronization issues. And in some sense, farm relied on the execvp’s string arguments to influence what each of the workers should be running.
- stsh relied on pretty much the same thing, albeit for different reasons. stsh relied on signals and signal handlers to manage job control, terminal control transfer to be clear who’s responding to keyboard events and any signals associated with them, and pipes to foster communication between neighboring processes in a pipeline.
Can a process be used to execute multiple executables? Restated, can it execvp twice to run multiple programs?
- Nope, not in general. A program like stsh can fork off additional processes, each of which execvps, but that’s not the same thing. Bottom line is that process spaces can’t be reused to run multiple executables. They can only be transformed once.
Threads are often called lightweight processes. In what sense are they processes? And why the lightweight distinction?
- Each thread of execution runs as if it’s a miniature program. Threads have their own stack, and they have access to globals, dynamic memory allocation, global data, system calls, and so forth. They’re relatively lightweight compared to true processes, because you don’t need to create a new process when creating a thread. The thread manager cordons off a portion of the full stack segment for the new thread’s stack, but otherwise the thread piggybacks off existing text, heap, and data segments previously establish by fork and execvp. (For more info, go ahead and read the man page for clone).
Threads are often called virtual processes as well. In what sense are threads an example of virtualization?
- Virtualization is an abstraction mechanism used to make a single resource appear to be many or to make many resources appear to be one. In this case, the process is subdivided to house many lightweight processes, so that one process is made to look like many.
Threads running within the same process all share the same address space. What are the advantages and disadvantages of allowing threads to access pretty much all of virtual memory?
- It’s a double-edged sword, as you’re all learning with Assignments 5 and 6. Because all threads have access to the same virtual address space, it’s relatively trivial to share information. However, because two or more threads might want to perform surgery on a shared piece of data, directives must be implanted into thread routines to ensure data and resources are shared without inspiring data corruption via race conditions, or deadlock via resource contention. That’s a very difficult thing to consistently get right.

Each thread within a larger process is given a thread id, also called a tid. In fact, the thread id concept is just an extension of the process id. For singly threaded processes, the pid and the main thread’s tid are precisely the same. If a process with pid 12345 creates three additional threads beyond the main thread (and no other processes or threads within other processes are created), then the tid of the main thread would be 12345, and the thread ids of the three other threads would be 12346, 12347, and 12348.

What are the advantages of leveraging the pid abstraction for thread ids?
- To the extent that the OS can view threads as processes, the OS can provide services on a per-thread basis instead of just a per-process one.
What happens if you pass a thread id that isn’t a process id to waitpid?
- Poorly documented, but the takeaway here is that waitpid does not consider it to be an error if a thread id is passed to waitpid. waitpid’s man page (https://linux.die.net/man/2/waitpid) includes some information about how waitpid and threads interact with one another.
What happens if you pass a thread id to sched_setaffinity?
- It actually works, and informs the OS that the identified thread should only be run on those CPUs included in the provided cpuset_t.
What are the advantages of requiring that a thread always be assigned to the same CPU?
- Each CPU typically has a dedicated L1 cache that stores data fetched by instructions run on that CPU. CPU-specific L1 caches are more likely to retain content relevant to the thread if the same thread keeps coming back instead of being arbitrarily assigned to other CPUs.

In some situations, the decision to rely on multithreading instead of multiprocessing is dictated solely by whether the code to be run apart from the main thread is available in executable or in library form. But in other situations, we have a choice.

Why might you prefer multithreading over multiprocessing if both are reasonably good options?
- Ease of data exchange and code sharing.
Why might you prefer multiprocessing over multithreading if both are reasonably good options?
- Protection, privacy, insulation from other processes errors.
What happens if a thread within a larger process calls fork?
- On Linux, the process space is cloned, but the only surviving thread is the one that called fork. In general, you don’t want a single thread of many to call fork unless you have an exquisite reason to do so.
What happens if a thread within a larger process calls execvp?
- It transforms the entire process to run a new executable. The fact that the pre-execvp process involved threading is irrelevant once execvp is invoked. It’s that drastic.

And some final questions about how farm and stsh could be implemented:

Assume you know nothing about multiprocessing but needed to emulate the functionality of farm. How could you use multithreading to design an equally parallel solution without ever relying on multiprocessing?
- You’ll see in Assignment 4 that a ThreadPool is a generalization of what we need here.
- Spawn one thread for each CPU, and set the thread routine to the code that computes and prints a factorization (using oslock and osunlock, of course). Note that all workers are initially available by storing something on behalf of each thread in the parent (array indices, thread ids, whatever you want).
- Implement the main thread to continue and read all numbers to be factored, forwarding numbers on to available workers (I’m being intentionally vague here as as to not steal the thunder of the ThreadPool).
- As workers are assigned numbers to be factored, some note they’re occupied, and as each worker finishes printing its factorization, have it mark itself as available.
- Use semaphores to coordinate rendezvous communication between orchestrator and each workers. Rely on mutexes to guard shared data structures (e.g. the set of available workers) from race conditions.
- When all numbers have been factored and all threads are available, prompt each thread to gracefully exit, join on all of those threads, and exit the program.
Could multithreading contribute to the implementation of stsh in any meaningful way? Why or why not?
- Not really. A shell is all about process control – forking and execvp-ing, and there are no in-shell algorithms that are intensely CPU, I/O, or network-bound, so introducing parallelism within stsh itself is arguably unnecessary.

CS 110

Lab 4 Solutions

Problem 1: Assignment 2 Revisited

Problem 2: Threading Short Answer Questions

Problem 3: Threads vs Processes