Lab 3: Sorting with Processes and Threads

This handout was adapted from Jerry Cain’s Spring 2018 offering.

I’ve created a Slack channel for Lab 3 discussion (aptly named #lab3), and all students (but particularly remote students) are encouraged to share their ideas there.

Getting started

Before starting, go ahead and clone the lab3 folder:

$ git clone /usr/class/cs110/repos/lab3/shared lab3
$ cd lab3
$ make

Problem 1: Analyzing parallel `mergesort`

Consider the architecturally interesting portion of the mergesort executable, which launches 128 peer processes to cooperatively sort an array of 128 randomly generated numbers. The implementations of createSharedArray and freeSharedArray are omitted for the time being.

static bool shouldKeepMerging(size_t start, size_t reach, size_t length) {
  return start % reach == 0 && reach <= length;
}

static void repeatedlyMerge(int numbers[], size_t length, size_t start) {
  int *base = numbers + start;
  for (size_t reach = 2; shouldKeepMerging(start, reach, length); reach *= 2) {
    raise(SIGTSTP);
    inplace_merge(base, base + reach/2, base + reach);
  }
  exit(0);
}

static void createMergers(int numbers[], pid_t workers[], size_t length) {
  for (size_t start = 0; start < length; start++) {
    workers[start] = fork();
    if (workers[start] == 0) 
      repeatedlyMerge(numbers, length, start);
  }
}

static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
  size_t step = 1;
  while (step <= length) {
    for (size_t start = 0; start < length; start += step) 
      waitpid(workers[start], NULL, WUNTRACED);
    step *= 2;
    for (size_t start = 0; start < length; start += step) 
      kill(workers[start], SIGCONT);
  }
}

static void mergesort(int numbers[], size_t length) {
  pid_t workers[length];
  createMergers(numbers, workers, length);
  orchestrateMergers(numbers, workers, length);
}

static const size_t kNumElements = 128;
int main(int argc, char *argv[]) {
  for (size_t trial = 1; trial <= 10000; trial++) {
    int *numbers = createSharedArray(kNumElements);    
    mergesort(numbers, kNumElements);
    assert(is_sorted(numbers, numbers + kNumElements));
    freeSharedArray(numbers, kNumElements);
  }
  return 0;
}

The program presented above is a nod to concurrent programming and whether parallelism can reduce the asymptotic running time of an algorithm. (I use the term “asymptotic running time” rather loosely; that generally refers to the behavior as the size of the input approaches infinity, and, usually, we don’t have infinite resources. But for the purposes of this problem, pretend like we do.) We’ll lead you through a series of short questions – some easy, some less easy – to test your multiprocessing and signal chops and to understand why the “asymptotic” running time of an algorithm can sometimes be improved in a parallel programming world.

For reasons I’ll discuss shortly, this above program works because the address in the numbers variable is cloned across the 128 fork calls, and this particular address maps to the same set of physical addresses in all 128 processes (and that’s different than what usually happens).

The program successfully sorts any array of length 128 by relying on 128 independent processes. (Again, assume infinite resources, so assume we have as many CPU cores as we do elements in the array.) In a nutshell, the above program works because:

All even numbered workers (e.g. workers[0], workers[2], etc.) self-halt, while all odd numbered workers immediately terminate.
Once all even numbered workers have self-halted, each is instructed to continue on to call inplace_merge (a C++ built-in) to potentially update the sequence so that numbers[0]<=numbers[1], numbers[2]<=numbers[3], etc. In general, inplace_merge(first, mid, last) assumes the two ranges [first, mid) and [mid, last) are already sorted in non-decreasing order, and places the merged result in [first, last).
Once all neighboring pairs have been merged into sorted sub-arrays of length 2, workers[0], workers[4], workers[8], etc. all self-halt while workers[2], workers[6], workers[10], etc. all exit.
Once all remaining workers self-halt, each is instructed to continue to merge the 64 sorted sub-arrays of length 2 into 32 sorted sub-arrays of length 4.
The algorithm continues as above, where half of the remaining workers terminate while the other half continue to repeatedly merge larger and larger sub-arrays until only workers[0] remains, at which point workers[0] does one final merge before exiting. The end product is a sorted array of length 128, and that’s pretty awesome.

For this lab problem, we want to lead you through a series of short answer questions to verify a deeper understanding of how the entire mergesort system works. Truth be told, the mergesort algorithm we’ve implemented is more of theoretical interest than practical. But it’s still a novel example of parallel programming that rings much more relevant and real-world than the Disneyland example I presented in lecture.

Use the following short answer questions to guide the discussion. (Note: this entire problem is based on a final exam question from a prior quarter.)

Why is the raise(SIGSTOP) line within the implementation of repeatedlyMerge necessary?
When the implementation of orchestrateMergers executes the step *= 2; line the very first time, all worker processes have either terminated or self-halted. Explain why that’s guaranteed.
The repeatedlyMerge function relies on a reach parameter, and the orchestrateMergers function relies on a step parameter. Each of the two parameters doubles with each iteration. What are the two parameters accomplishing?
Had we replaced the one use of WUNTRACED with a 0, would the overall program still correctly sort an arbitrary array of length 128? Why or why not?
Had we instead replaced the one use of WUNTRACED with WUNTRACED | WNOHANG instead, would the overall program still correctly sort an arbitrary array of length 128? Why or why not?

Assume the following implementation of orchestrateMergers replaces the original version. Would the overall program always successfully sort an arbitrary array of length 128? Why or why not?

static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
    for (size_t step = 1; step <= length; step *= 2) {
        for (size_t start = 0; start < length; start += step) {
            int status;
            waitpid(workers[start], &status, WUNTRACED);
            if (WIFSTOPPED(status)) kill(workers[start], SIGCONT);
        }
    }
}

Now assume the following implementation of orchestrateMergers replaces the original version. Note the inner for loop counts down instead of up. Would the overall program always successfully sort an arbitrary array of length 128? Why or why not?

static void orchestrateMergers(int numbers[], pid_t workers[], size_t length) {
    for (size_t step = 1; step <= length; step *= 2) {
        for (ssize_t start = length - step; start >= 0; start -= step) {
            int status;
            waitpid(workers[start], &status, WUNTRACED);
            if (WIFSTOPPED(status)) kill(workers[start], SIGCONT);
        }
    }
}

The createSharedArray function (defined in memory.h and memory.cc in your lab3 repo) sets aside space for an array of 128 (or, more generally, length) integers and seeds it with random numbers. It does so using the mmap function you’ve seen in Assignment 1 and 2, and you’ll also saw it a bunch of times while playing with strace last week during discussion section.

static int *createSharedArray(size_t length) {
    int *numbers =
        static_cast<int *>(mmap(NULL, length * sizeof(int), PROT_READ | PROT_WRITE,
                                MAP_SHARED | MAP_ANONYMOUS, -1, 0));
    
    static RandomGenerator rgen;
    for (size_t i = 0; i < length; i++) 
        numbers[i] = rgen.getNextInt(kMinValue, kMaxValue);
    return numbers;
}

The mmap function takes the place of malloc here, because it sets up space not in the heap, but in an undisclosed segment that other processes can see and touch (that’s what MAP_ANONYMOUS and MAP_SHARED mean).

Normally virtual address spaces are private and inaccessible to other processes, but that’s clearly not the case here. Given what you learned about virtual-to-physical address mapping during this past Monday’s lecture, explain what the operating system must do to support this so that only the mergers have shared access but arbitrary, unrelated processes don’t?
Virtual memory is one form of virtualization used so that the above program works. Describe one other form of virtualization you see.
Assuming the implementation of inplace_merge is O(n), explain why the running time of our parallel mergesort is O(n) instead of the O(n log n) normally ascribed to the sequential version. (Again, assume we have an unbounded number of CPU cores available to us.) Your explanation should be framed in terms of some simple math; it’s not enough to just say it’s parallel.

Problem 2: Multithreaded `quicksort`

quicksort is an efficient, divide-and-conquer sorting algorithm whose traditional implementation looks like this:

static void quicksort(vector<int>& numbers, ssize_t start, ssize_t finish) {
  if (start >= finish) return;
  ssize_t mid = partition(numbers, start, finish);
  quicksort(numbers, start, mid - 1);
  quicksort(numbers, mid + 1, finish);
}

static void quicksort(vector<int>& numbers) {
  quicksort(numbers, 0, numbers.size() - 1);
}

The details of how partition works aren’t important. All you need to know is that a call to partition(numbers, start, finish) reorders the elements between numbers[start] and numbers[finish], inclusive, so that numbers residing within indices start through mid - 1, inclusive, are less than or equal to the number at index mid, and that all numbers residing in indices mid + 1 through stop, inclusive, are strictly greater than or equal to the number at index mid. As a result of this reorganization, we know that, once partition returns, the number residing at index mid actually belogs there in the final sort.

What’s super neat is that the two recursive calls to quicksort can execute in parallel, since the sequences they operate on don’t overlap. In fact, to make sure you get some practice with C++ threads right away, you’re going to cannibalize the above implementation so that each call to quicksort spawns off threads to recursively sort the front and back portions at the same time.

Descend into your clone of the shared lab4 directory, and execute the sequential quicksort executable to confirm that it runs and passes with flying colors. Then examine the quicksort.cc file to confirm your understanding of quicksort. You can ignore the details of the partition routine and just trust that it works, but ensure you believe in the recursive substructure of the three-argument quicksort function.
Now implement the aggressiveQuicksort function, which is more or less the same as the sequential quicksort, except that each of the two recursive calls run in independent, parallel threads. Create standalone threads without concern for any system thread count limits. Ensure that any call to aggressiveQuicksort returns only after its recursively guided threads finish. Test your implementation to verify it works as intended by typing ./quicksort --aggressive on the command line.
Tinker with the value of kNumElements (initially set to the 128) to see how high you can make it before you exceed the number of threads allowed to coexist in a single process. You don’t need to surface an exact number, as a ballpark figure is just fine.
Leveraging your aggressiveQuicksort implementation, implement the recursive conservativeQuicksort function so it’s just as parallel, but the second recursive call isn’t run within a new thread; instead, it’s run within the same thread of execution as the caller. Test your implementation to verify it works as intended by typing in ./quicksort --conservative on the command line.
Time each of the three versions by using the time utility as you may be doing in Assignment 3 while testing farm. Are the running times of the parallel versions lower or higher than the sequential versions? Are the running times what you expect? Explain.

Mini problem 3: The `/proc` directory

Note: This problem isn’t important, but I think it’s fascinating. It demonstrates how the /proc virtual directory can be explored to view information about running processes on your machine.

ssh into any myth machine, and type ps u at the command prompt to learn the process id of your terminal, as with:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
rebs     27139  1.4  0.0  41684  7440 pts/9    Ss   00:23   0:00 zsh
rebs     27465  0.0  0.0  30404  1528 pts/9    R+   00:23   0:00 ps u

The pid of my zsh terminal is 27139, but yours will almost certainly be different. Assuming a pid 27139, type in the following:

$ cd /proc/27139
$ ls -l maps
-r--r--r-- 1 rebs operator 0 Jul 12 00:23 maps
$ cat maps
00400000-004b6000 r-xp 00000000 08:01 6815756                            /bin/zsh5
006b5000-006b6000 r--p 000b5000 08:01 6815756                            /bin/zsh5
006b6000-006bc000 rw-p 000b6000 08:01 6815756                            /bin/zsh5
006bc000-006d0000 rw-p 00000000 00:00 0
01186000-01436000 rw-p 00000000 00:00 0                                  [heap]
7f1bae988000-7f1bae98a000 r-xp 00000000 08:01 7871449                    /usr/lib/x86_64-linux-gnu/zsh/5.1.1/zsh/regex.so
7f1bae98a000-7f1baeb89000 ---p 00002000 08:01 7871449                    /usr/lib/x86_64-linux-gnu/zsh/5.1.1/zsh/regex.so
7f1baeb89000-7f1baeb8a000 r--p 00001000 08:01 7871449                    /usr/lib/x86_64-linux-gnu/zsh/5.1.1/zsh/regex.so
# Many more lines omitted for brevity.....
7f1bb0f69000-7f1bb0f6d000 rw-p 00000000 00:00 0
7f1bb0f6d000-7f1bb0f6e000 r--p 00025000 08:01 2495029                    /lib/x86_64-linux-gnu/ld-2.23.so
7f1bb0f6e000-7f1bb0f6f000 rw-p 00026000 08:01 2495029                    /lib/x86_64-linux-gnu/ld-2.23.so
7f1bb0f6f000-7f1bb0f70000 rw-p 00000000 00:00 0
7ffeb50c2000-7ffeb50e3000 rw-p 00000000 00:00 0                          [stack]
7ffeb51db000-7ffeb51dd000 r--p 00000000 00:00 0                          [vvar]
7ffeb51dd000-7ffeb51df000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

From the man page for proc: “The proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc. Most of it is read-only, but some files allow kernel variables to be changed.”

Within proc is a subdirectory for every single process running on the machine, and within each of those there are sub-subdirectories that present information about various resources tapped by that process. In my case, the process subdirectory is named 22628, and the sub-subdirectory of interest is maps, which provides information about all of the contiguous regions of virtual memory the process relies on for execution.
To find out what each row and column in the output means, consult this stackoverflow question and read through the accepted answer.

Still in the /proc/<pid> directory, try the following:

$ ls -l fd
total 0
lrwx------ 1 rebs operator 64 Jul 12 00:23 0 -> /dev/pts/9
lrwx------ 1 rebs operator 64 Jul 12 00:23 1 -> /dev/pts/9
lrwx------ 1 rebs operator 64 Jul 12 00:23 10 -> /dev/pts/9
lr-x------ 1 rebs operator 64 Jul 12 00:23 12 -> /usr/share/zsh/functions/Completion.zwc
lrwx------ 1 rebs operator 64 Jul 12 00:23 2 -> /dev/pts/9

This shows a list of open file descriptors, and also lists the files they are connected to. (Descriptors 0, 1, 2, and 10 are connected to my terminal. I’m not sure what zsh is using descriptor 10 for, but it’s there! Descriptor 12 seems to be used for a tab-completion file.)

CS 110

Lab 3: Sorting with Processes and Threads

Getting started

Problem 1: Analyzing parallel mergesort

Problem 2: Multithreaded quicksort

Mini problem 3: The /proc directory

Problem 1: Analyzing parallel `mergesort`

Problem 2: Multithreaded `quicksort`

Mini problem 3: The `/proc` directory