Lecture 17

Note: Reading these lecture notes is not a substitute for watching the lecture. I frequently go off script, and you are responsible for understanding everything I talk about in lecture unless I specify otherwise.

Systems classes following CS 110

CS 140: Operating systems
- You’ve been using processes, threads, virtual memory, and more in this class. CS 140 goes a level deeper and implements these things within the operating system.
CS 143: Compilers
- There are 4 assignments, in which you implement the 4 stages of a compiler. By the end of the class, you have a working compiler that can translate programs written in COOL (a language somewhat similar to C++ and Java) into assembly instructions.
CS 144: Computer networking
- This class is about understanding (and sometimes implementing) the various network layers that allow us to establish functional connections between computers.
CS 155: Computer security
- This is one of the coolest classes I’ve ever taken. It’s a very practical class that demonstrates common vulnerabilities in systems. Take it even if you don’t think you’re going to be a systems person :)

Principles of Computer Systems

You may already feel familiar with a lot of these concepts, but I want to put them into concrete terms and talk about how they have been relevant to our discussions throughout the past quarter. Many times we take these ideas for granted, but it’s worth thinking about them for a class.

Abstraction

Abstraction is about defining interfaces and focusing on the ideas behind a function instead of on the implementation details. We can define interfaces and use them without needing to know how everything works under the hood, and we can support multiple implementations that follow the same interface.

For example: Do you actually know how writing to stdout causes characters to appear on your terminal window? Probably not, but we have defined an interface where you can write to file descriptor 1 and those bytes will be printed to your terminal. You can use this abstraction without even understanding how it works; furthermore, different operating systems implement things differently, but we can use the same interface regardless of what operating system we’re using.

Other abstractions we’ve used in this class:

Filesystems: In previous classes, you’ve probably worked with C FILE *s or C++ fstreams without knowing how they worked
Processes: You know how to do multiprocessing, even though you don’t really know what’s happening at an assembly instruction level in order to support that
Signals: You understand how to send and receive signals, but you probably don’t know what the operating system is doing on your behalf in order to make it happen
Threads: You know how to create threads, but you don’t really know how they’re implemented
Network sockets: You know how to use network connections as pipes that connect two computers, but you don’t know what’s happening under the hood for the OS to provide this illusion

Modularity and Layering

Modularity: as soon as code starts getting complicated, let’s start breaking it down into manageable pieces.

Layering is a special form of modularity in which we stack pieces on top of each other.

You’ve seen layering since your CS 106 days. For example, a stack is a data structure layered on top of a vector, and a vector is a data structure layered on top of an array.

Some layering we have seen in this class:

Filesystems involve many layers, as you saw in Assignment 1:
- Block
- Inode
- File
- Directory
- Pathname
- Symbolic links
In the past 2 weeks, we have layered sockbufs on top of raw sockets and sockstreams on top of sockbufs
MapReduce allows us to build a distributed processing infrastructure, then layer simple mappers/reducers on top

Naming and name resolution

We need names to refer to system resources. (How else would you address a process? How else would you address an open file?) We also need name resolution systems to convert from human-friendly names to machine-friendly ones.

Caching

A cache is a component – sometimes implemented in hardware, sometimes in software – that stores data so that future requests can be handled more quickly.

We see caching all over in the storage hierarchy:

Network-based storage is really slow
We can use disk space to cache that if we want
We can use RAM to cache data from disk
The L3, L2, and L1 processor caches cache data from RAM
Finally, information is stored in registers.

There are also TLB caches, DNS caches, and web caches (like what you did in proxy).

Virtualization

Virtualization is about making many hardware resources look like one, or making one hardware resource look like many.

Making many hardware resources look like one:

RAID allows you to connect many disks to a machine that appear as one disk
AFS does a similar thing with networked filesystems
A web load balancer distributes load to many servers

Making one hardware resource look like many:

Virtual memory makes every process think it owns all of memory
Threads/processes provide the illusion that everything is running in parallel, even if there is only one CPU

Concurrency

This is about multiple threads or processes running at the same time. We’ve seen concurrency even across clusters of machines in MapReduce. Even signal and interrupt handlers are a form of concurrency. Some programming languages (e.g. Erlang) are designed so entirely around concurrency that they make race conditions impossible.

Client-server request-response

Request/response is a good way to organize functionality into modules that have a clear set of responsibilities. You were already familiar with this with functions and libraries of functions. Now, in this class, we’ve seen this pattern extended to system calls, to multiprocessing, and to network requests.