Lecture 2: Filesystem Recap, Intro to System Calls

Note: Reading these lecture notes is not a substitute for watching the lecture. I frequently go off script, and you are responsible for understanding everything I talk about in lecture unless I specify otherwise.

Directory details: `.` and `..`

In Unix, there are some features of pathnames that you may have seen before: a leading / refers to the root directory, a leading ~ refers to your home directory, . refers to the current directory, and .. refers to the parent directory. (As an example, ~/./Desktop/../ refers to your home directory.)

As it turns out, . and .. are actually implemented as features of the filesystem. Every directory has at least two entries: an entry mapping . to the directory’s own inumber, and an entry mapping .. to the parent directory’s inumber.

Even the root directory has such entries. In the root directory’s case, however, both . and .. resolve to inumber 1 (i.e. the parent of the root directory is still the root directory).

Links

Filesystems also generally support links, which are references to other files in the filesystem.

We have been talking about hard links throughout our entire discussion of directories, though you may not have known it. A hard link is simply an entry in a directory file. (We’ve been using the term “directory entry” and will continue to do so, but hard links are functionally the same thing.) Every file has at least one hard link pointing to it (i.e. the link in its parent directory). Every directory has at least two hard links pointing to it: the link in its parent directory, and its own . directory entry.

While hard links map filenames to inumbers, soft links map filenames to other filenames. (Soft links are also called symbolic links, because they resolve to symbolic names instead of numbers.) Just like directories are just files, symbolic links are also just files. When a symbolic link is created, a new file is created, but instead of having type “regular file” or “directory,” it has type “link.” The contents of this file is the path to the file it links to.

Layers of Abstraction in Filesystems

The Unix V6 filesystem comes from the 1970s, yet, as you can see, there is already a large amount of complexity. One common paradigm for dealing with complexity is layering. I explained filesystem layering in the Assignment 1 handout, but I think it is worth repeating here:

The hardware layer involves the actual mechanical and electrical details of making a disk work. There is so much complicated circuitry and physics involved here, and, in fact, the hardware layer is broken into many layers by the people who work on it.
The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
Lastly, the pathname layer implements full absolute path lookups, starting from the root directory. If you hand /usr/class/cs110/hello.txt to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it finds hello.txt.

On top of these 6 layers sit many application layers that use the filesystem without having to think about how it works.

Not only does layering provide us with a means of breaking down complexity, but it also has some nice properties if we ever want to modify the system to do something new. Let’s say we want to create a networked filesystem. Instead of having to write it from scratch, we can keep everything except for the hardware and block layers, replacing those with some layers that deal with network communication.

Unix employs this principle everywhere. As you will see later in the course, many resources are made to look like files (even though they aren’t files) so that we can control them using the file abstractions we’ve developed. Your computer interacts with your terminal window, printer, Bluetooth radio, and even (to a certain extent) CPU as if they were files, even though that is certainly not the case. As we will see towards the end of the class, the layering principle is equally pervasive in the land of networking.

System Calls

There is a reason you have probably never written code that manages raw sectors on disk: you can’t. It would be dangerous if you could; you might unintentionally corrupt some critical sectors, rendering your entire filesystem unusable. Worse, malicious code could access or alter data that it isn’t supposed to have access to. For example, unprivileged code isn’t allowed to read or modify the /etc/passwd or /etc/shadow files (which store information about passwords on your system), but if a program were allowed to access the raw filesystem, it could circumvent those permission checks.

We need the operating system to perform these privileged operations on our behalf, and to mediate access to the filesystem so that it can block malicious behavior. We interact with the operating system, asking it to do privileged operations on our behalf, through functions known as system calls (syscalls for short).

int open (const char *filename, int flags, ...)
- This tells the OS, “hey, I would like to work with this file.”
- flags tells open how we’d like to interact with the file. (Are we reading or writing? If writing, what do we do if the file already exists? Etc)
- If we are writing to a file that doesn’t already exist and we wish to create it, a third argument can be used to specify the new file’s permissions.
- This function returns a number, which is a file descriptor. This fd can be passed to other filesystem-related syscalls to work with the file we’ve just opened.
ssize_t read(int fd, char* buffer, size_t count)
- Given a file descriptor (returned by open), attempt to read count bytes from the file into buffer
- Returns the number of bytes actually read from the file.
ssize_t write(int descriptor, char buf[], size_t count)
- Given a file descriptor, attempt to write count bytes from buf to the file
- Returns the number of bytes successfully written to the file.
int close(int descriptor)
- Tell the operating system we’re done using a particular file. Frees resources that the operating system was using to keep track of the file

printf might seem like some magical core function to you, but it’s actually built on top of syscalls in ways that you can easily understand now. I can write a “hello world” program without using printf:

int main(int argc, char *argv[]) {
    char* output = "Hello world\n";
    write(STDOUT_FILENO, output, 12);
    return 0;
}

(Note: 12 is the length of the output string.)

Implementing `cp`

cp (which is used to copy files from the terminal) might seem like a magical command, but it is just a program, written using primitives that you can understand. The basic approach is to read some bytes from one file, write those bytes to the output file, and repeat.

int main(int argc, char *argv[]) {
    assert(argc == 3);
    const char *infile = argv[1];
    const char *outfile = argv[2];
    int infd = open(infile, O_RDONLY);
    int outfd = open(outfile, O_WRONLY | O_CREAT | O_EXCL, 0664);

    while (true) {
        char buffer[1024];
        ssize_t count = read(infd, buffer, sizeof(buffer));
        if (count == 0) break;

        // In a loop, try writing the bytes we read out to the output file,
        // until all of them get written
        size_t numWritten = 0;
        size_t numToWrite = count;
        while (numWritten < numToWrite) {
            numWritten += write(outfd, buffer + numWritten, count - numWritten);
        }
    }

    close(infd);
    close(outfd);
    return 0;
}

Implementing `find`

The find program searches a directory for files with a matching name. For example, to find instances of stdio.h:

find /usr/include -name stdio.h

To implement this, we need a new syscall:

int stat(const char *pathname, struct stat *buf);

This populates a struct stat. At minimum, a stat struct will have the following fields, populated almost directly from the target file’s inode:

dev_t     st_dev     ID of device containing file
ino_t     st_ino     file serial number
mode_t    st_mode    mode of file
nlink_t   st_nlink   number of links to the file
uid_t     st_uid     user ID of file
gid_t     st_gid     group ID of file
dev_t     st_rdev    device ID (if file is character or block special)
off_t     st_size    file size in bytes (if file is a regular file)
time_t    st_atime   time of last access
time_t    st_mtime   time of last data modification
time_t    st_ctime   time of last status change
blksize_t st_blksize a filesystem-specific preferred I/O block size for
                     this object.  In some filesystem types, this may
                     vary from file to file
blkcnt_t  st_blocks  number of blocks allocated for this object

The st_mode field is of particular interest to us; it is a bit set containing (among other things) information about whether this is a regular file/directory/symbolic link. We can extract that information using the S_ISDIR, S_ISREG, and S_ISLINK macros.

static void listMatches(char *path, size_t length, const char *pattern) {
    DIR *dir = opendir(path);
    if (dir == NULL) return;
    strcpy(path + length, "/");
    length++;
    while (true) {
        struct dirent *de = readdir(dir);
        if (de == NULL) break;
        if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) continue;

        strcpy(path + length, de->d_name);
        struct stat st;
        lstat(path, &st);
        if (S_ISREG(st.st_mode)) {
            if (strcmp(de->d_name, pattern) == 0) printf("%s\n", path);
        } else if (S_ISDIR(st.st_mode)) {
            listMatches(path, length + strlen(de->d_name), pattern);
        }
    }
    closedir(dir);
}

int main(int argc, char *argv[]) {
    assert(argc == 3);
    const char *directory = argv[1];

    struct stat st;
    stat(directory, &st);
    assert(S_ISDIR(st.st_mode));

    char *pattern = argv[2];
    char path[4096];
    strcpy(path, directory);
    listMatches(path, strlen(path), pattern);
    return 0;
}

Implementing `ls`

In addition to telling us what type a file is, st_mode also stores permissioning information. We can use this to reconstruct the permission strings that appear in the leftmost of ls -l output.

static const char kFlags[] = {'r', 'w', 'x'};
static const mode_t kMasks[] = {
    S_IRUSR, S_IWUSR, S_IXUSR,
    S_IRGRP, S_IWGRP, S_IXGRP,
    S_IROTH, S_IWOTH, S_IXOTH,
};

static void updatePermissionBit(char buffer[], int pos, char ch, bool flag) {
    if (!flag) return;
    buffer[pos] = ch;
}

static void printPermissions(mode_t m) {
    char buffer[11]; // 10 + 1 = 1 + 3 * 3 + 1
    memset(buffer, '-', 11);
    buffer[10] = '\0';
    updatePermissionBit(buffer, 0, 'd', S_ISDIR(m));
    updatePermissionBit(buffer, 0, 'l', S_ISLNK(m));
    for (size_t i = 0; i < 9; i++) {
        updatePermissionBit(buffer, i + 1, kFlags[i % 3],
                m & kMasks[i]);
    }
    printf("%s ", buffer);
}

static void printName(const char *name, const struct stat *st, bool link, const char *path) {
  printf("%s", name);
  if (!link) return;
  char target[st->st_size + 1];
  readlink(path, target, sizeof(target));
  target[st->st_size] = '\0'; // readlink doesn't put down '\0' char, drop it in ourselves
  printf(" -> %s", target);
}

static void listEntry(const char *name, const struct stat *st, bool link, const char *path) {
  printPermissions(st->st_mode);
  printName(name, st, link, path);
  printf("\n");
}

static void listDirectory(const char *name, size_t length, const struct stat *st) {
  char path[2048];
  strcpy(path, name);
  DIR *dir = opendir(path);
  strcpy(path + length, "/");
  while (true) {
    struct dirent *de = readdir(dir);
    if (de == NULL) break;
    if (de->d_name[0] == '.') continue;
    strcpy(path + length + 1, de->d_name);
    struct stat st;
    lstat(path, &st);
    listEntry(de->d_name, &st, S_ISLNK(st.st_mode), path);
  }
  closedir(dir);
}

int main(int argc, char *argv[]) {
    struct stat st;
    const char* dir = ".";
    lstat(dir, &st);
    if (S_ISREG(st.st_mode) || S_ISLNK(st.st_mode)) {
        listEntry(dir, &st, S_ISLNK(st.st_mode), dir);
    } else if (S_ISDIR(st.st_mode)) {
        listDirectory(dir, strlen(dir), &st);
    }
    return 0;
}

The vnode, file entry, and file descriptor tables

When we open a file and get a file descriptor back, how does the operating system manage that file descriptor in association with the file we’re trying to interact with? (This is more than incidental curiosity; this material will dictate much of our discussion on interprocess communication next week.)

At the lowest level of this three-table hierarchy sits the vnode table. The design rationale is: Disk accesses are generally expensive, and if we need to read the inode for a file every time we want a piece of that file, we’re slowing down our disk access. The vnode table serves as a global cache of inodes we’re currently using. Any open file on the computer (from any running process) has an entry in the vnode table, and no file has more than one entry in the vnode table.
The file entry table stores session information about how we are interacting with a file. Every time we open a file, an entry gets created in this table. Did we open the file for reading or writing (or both)? How many bytes have we read or written so far? Each entry also stores a pointer to an entry in the vnode table, so that we can remember which file we’re working with.

The operating system only maintains one file entry table on behalf of all running processes, but unlike the vnode table, a file could be represented multiple times in this table. (If you open a particular file 5 times, it will have 5 entries in this table.)
The file descriptor table stores pointers to entries in the file entry table. When open returns a file descriptor, that is an index of an element in this table. Each process has its own, independent file descriptor table.

CS 110

Lecture 2: Filesystem Recap, Intro to System Calls

Directory details: . and ..

Links

Layers of Abstraction in Filesystems

System Calls

Filesystem-related syscalls

Many IO-related stdlib functions are layered on top of syscalls

Implementing cp

Implementing find

Implementing ls

The vnode, file entry, and file descriptor tables

Directory details: `.` and `..`

Implementing `cp`

Implementing `find`

Implementing `ls`