Lecture 2: Filesystem Recap, Intro to System Calls
Note: Reading these lecture notes is not a substitute for watching the lecture. I frequently go off script, and you are responsible for understanding everything I talk about in lecture unless I specify otherwise.
Directory details: .
and ..
In Unix, there are some features of pathnames that you may have seen before: a
leading /
refers to the root directory, a leading ~
refers to your home
directory, .
refers to the current directory, and ..
refers to the parent
directory. (As an example, ~/./Desktop/../
refers to your home directory.)
As it turns out, .
and ..
are actually implemented as features of the
filesystem. Every directory has at least two entries: an entry mapping .
to
the directory’s own inumber, and an entry mapping ..
to the parent
directory’s inumber.
Even the root directory has such entries. In the root directory’s case,
however, both .
and ..
resolve to inumber 1 (i.e. the parent of the root
directory is still the root directory).
Links
Filesystems also generally support links, which are references to other files in the filesystem.
We have been talking about hard links throughout our entire discussion of
directories, though you may not have known it. A hard link is simply an entry
in a directory file. (We’ve been using the term “directory entry” and will
continue to do so, but hard links are functionally the same thing.) Every file
has at least one hard link pointing to it (i.e. the link in its parent
directory). Every directory has at least two hard links pointing to it: the
link in its parent directory, and its own .
directory entry.
While hard links map filenames to inumbers, soft links map filenames to other filenames. (Soft links are also called symbolic links, because they resolve to symbolic names instead of numbers.) Just like directories are just files, symbolic links are also just files. When a symbolic link is created, a new file is created, but instead of having type “regular file” or “directory,” it has type “link.” The contents of this file is the path to the file it links to.
Layers of Abstraction in Filesystems
The Unix V6 filesystem comes from the 1970s, yet, as you can see, there is already a large amount of complexity. One common paradigm for dealing with complexity is layering. I explained filesystem layering in the Assignment 1 handout, but I think it is worth repeating here:
- The hardware layer involves the actual mechanical and electrical details of making a disk work. There is so much complicated circuitry and physics involved here, and, in fact, the hardware layer is broken into many layers by the people who work on it.
- The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
- The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
- The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
- The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
- Lastly, the pathname layer implements full absolute path lookups, starting
from the root directory. If you hand
/usr/class/cs110/hello.txt
to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it findshello.txt
.
On top of these 6 layers sit many application layers that use the filesystem without having to think about how it works.
Not only does layering provide us with a means of breaking down complexity, but it also has some nice properties if we ever want to modify the system to do something new. Let’s say we want to create a networked filesystem. Instead of having to write it from scratch, we can keep everything except for the hardware and block layers, replacing those with some layers that deal with network communication.
Unix employs this principle everywhere. As you will see later in the course, many resources are made to look like files (even though they aren’t files) so that we can control them using the file abstractions we’ve developed. Your computer interacts with your terminal window, printer, Bluetooth radio, and even (to a certain extent) CPU as if they were files, even though that is certainly not the case. As we will see towards the end of the class, the layering principle is equally pervasive in the land of networking.
System Calls
There is a reason you have probably never written code that manages raw sectors
on disk: you can’t. It would be dangerous if you could; you might
unintentionally corrupt some critical sectors, rendering your entire filesystem
unusable. Worse, malicious code could access or alter data that it isn’t
supposed to have access to. For example, unprivileged code isn’t allowed to
read or modify the /etc/passwd
or /etc/shadow
files (which store
information about passwords on your system), but if a program were allowed to
access the raw filesystem, it could circumvent those permission checks.
We need the operating system to perform these privileged operations on our behalf, and to mediate access to the filesystem so that it can block malicious behavior. We interact with the operating system, asking it to do privileged operations on our behalf, through functions known as system calls (syscalls for short).
Filesystem-related syscalls
int open (const char *filename, int flags, ...)
- This tells the OS, “hey, I would like to work with this file.”
flags
tellsopen
how we’d like to interact with the file. (Are we reading or writing? If writing, what do we do if the file already exists? Etc)- If we are writing to a file that doesn’t already exist and we wish to create it, a third argument can be used to specify the new file’s permissions.
- This function returns a number, which is a file descriptor. This fd can be passed to other filesystem-related syscalls to work with the file we’ve just opened.
ssize_t read(int fd, char* buffer, size_t count)
- Given a file descriptor (returned by
open
), attempt to readcount
bytes from the file intobuffer
- Returns the number of bytes actually read from the file.
- Given a file descriptor (returned by
ssize_t write(int descriptor, char buf[], size_t count)
- Given a file descriptor, attempt to write
count
bytes frombuf
to the file - Returns the number of bytes successfully written to the file.
- Given a file descriptor, attempt to write
int close(int descriptor)
- Tell the operating system we’re done using a particular file. Frees resources that the operating system was using to keep track of the file
Many IO-related stdlib functions are layered on top of syscalls
printf
might seem like some magical core function to you, but it’s actually
built on top of syscalls in ways that you can easily understand now. I can
write a “hello world” program without using printf
:
int main(int argc, char *argv[]) {
char* output = "Hello world\n";
write(STDOUT_FILENO, output, 12);
return 0;
}
(Note: 12 is the length of the output
string.)
Implementing cp
cp
(which is used to copy files from the terminal) might seem like a magical
command, but it is just a program, written using primitives that you can
understand. The basic approach is to read some bytes from one file, write those
bytes to the output file, and repeat.
int main(int argc, char *argv[]) {
assert(argc == 3);
const char *infile = argv[1];
const char *outfile = argv[2];
int infd = open(infile, O_RDONLY);
int outfd = open(outfile, O_WRONLY | O_CREAT | O_EXCL, 0664);
while (true) {
char buffer[1024];
ssize_t count = read(infd, buffer, sizeof(buffer));
if (count == 0) break;
// In a loop, try writing the bytes we read out to the output file,
// until all of them get written
size_t numWritten = 0;
size_t numToWrite = count;
while (numWritten < numToWrite) {
numWritten += write(outfd, buffer + numWritten, count - numWritten);
}
}
close(infd);
close(outfd);
return 0;
}
Implementing find
The find
program searches a directory for files with a matching name. For
example, to find instances of stdio.h
:
find /usr/include -name stdio.h
To implement this, we need a new syscall:
int stat(const char *pathname, struct stat *buf);
This populates a struct stat
. At minimum, a stat
struct will have the
following fields, populated almost directly from the target file’s inode:
dev_t st_dev ID of device containing file
ino_t st_ino file serial number
mode_t st_mode mode of file
nlink_t st_nlink number of links to the file
uid_t st_uid user ID of file
gid_t st_gid group ID of file
dev_t st_rdev device ID (if file is character or block special)
off_t st_size file size in bytes (if file is a regular file)
time_t st_atime time of last access
time_t st_mtime time of last data modification
time_t st_ctime time of last status change
blksize_t st_blksize a filesystem-specific preferred I/O block size for
this object. In some filesystem types, this may
vary from file to file
blkcnt_t st_blocks number of blocks allocated for this object
The st_mode
field is of particular interest to us; it is a bit set containing
(among other things) information about whether this is a regular
file/directory/symbolic link. We can extract that information using the
S_ISDIR
, S_ISREG
, and S_ISLINK
macros.
static void listMatches(char *path, size_t length, const char *pattern) {
DIR *dir = opendir(path);
if (dir == NULL) return;
strcpy(path + length, "/");
length++;
while (true) {
struct dirent *de = readdir(dir);
if (de == NULL) break;
if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0) continue;
strcpy(path + length, de->d_name);
struct stat st;
lstat(path, &st);
if (S_ISREG(st.st_mode)) {
if (strcmp(de->d_name, pattern) == 0) printf("%s\n", path);
} else if (S_ISDIR(st.st_mode)) {
listMatches(path, length + strlen(de->d_name), pattern);
}
}
closedir(dir);
}
int main(int argc, char *argv[]) {
assert(argc == 3);
const char *directory = argv[1];
struct stat st;
stat(directory, &st);
assert(S_ISDIR(st.st_mode));
char *pattern = argv[2];
char path[4096];
strcpy(path, directory);
listMatches(path, strlen(path), pattern);
return 0;
}
Implementing ls
In addition to telling us what type a file is, st_mode
also stores
permissioning information. We can use this to reconstruct the permission
strings that appear in the leftmost of ls -l
output.
static const char kFlags[] = {'r', 'w', 'x'};
static const mode_t kMasks[] = {
S_IRUSR, S_IWUSR, S_IXUSR,
S_IRGRP, S_IWGRP, S_IXGRP,
S_IROTH, S_IWOTH, S_IXOTH,
};
static void updatePermissionBit(char buffer[], int pos, char ch, bool flag) {
if (!flag) return;
buffer[pos] = ch;
}
static void printPermissions(mode_t m) {
char buffer[11]; // 10 + 1 = 1 + 3 * 3 + 1
memset(buffer, '-', 11);
buffer[10] = '\0';
updatePermissionBit(buffer, 0, 'd', S_ISDIR(m));
updatePermissionBit(buffer, 0, 'l', S_ISLNK(m));
for (size_t i = 0; i < 9; i++) {
updatePermissionBit(buffer, i + 1, kFlags[i % 3],
m & kMasks[i]);
}
printf("%s ", buffer);
}
static void printName(const char *name, const struct stat *st, bool link, const char *path) {
printf("%s", name);
if (!link) return;
char target[st->st_size + 1];
readlink(path, target, sizeof(target));
target[st->st_size] = '\0'; // readlink doesn't put down '\0' char, drop it in ourselves
printf(" -> %s", target);
}
static void listEntry(const char *name, const struct stat *st, bool link, const char *path) {
printPermissions(st->st_mode);
printName(name, st, link, path);
printf("\n");
}
static void listDirectory(const char *name, size_t length, const struct stat *st) {
char path[2048];
strcpy(path, name);
DIR *dir = opendir(path);
strcpy(path + length, "/");
while (true) {
struct dirent *de = readdir(dir);
if (de == NULL) break;
if (de->d_name[0] == '.') continue;
strcpy(path + length + 1, de->d_name);
struct stat st;
lstat(path, &st);
listEntry(de->d_name, &st, S_ISLNK(st.st_mode), path);
}
closedir(dir);
}
int main(int argc, char *argv[]) {
struct stat st;
const char* dir = ".";
lstat(dir, &st);
if (S_ISREG(st.st_mode) || S_ISLNK(st.st_mode)) {
listEntry(dir, &st, S_ISLNK(st.st_mode), dir);
} else if (S_ISDIR(st.st_mode)) {
listDirectory(dir, strlen(dir), &st);
}
return 0;
}
The vnode, file entry, and file descriptor tables
When we open
a file and get a file descriptor back, how does the operating
system manage that file descriptor in association with the file we’re trying to
interact with? (This is more than incidental curiosity; this material will
dictate much of our discussion on interprocess communication next week.)
- At the lowest level of this three-table hierarchy sits the vnode table. The design rationale is: Disk accesses are generally expensive, and if we need to read the inode for a file every time we want a piece of that file, we’re slowing down our disk access. The vnode table serves as a global cache of inodes we’re currently using. Any open file on the computer (from any running process) has an entry in the vnode table, and no file has more than one entry in the vnode table.
- The file entry table stores session information about how we are
interacting with a file. Every time we
open
a file, an entry gets created in this table. Did we open the file for reading or writing (or both)? How many bytes have we read or written so far? Each entry also stores a pointer to an entry in the vnode table, so that we can remember which file we’re working with.
The operating system only maintains one file entry table on behalf of all running processes, but unlike the vnode table, a file could be represented multiple times in this table. (If youopen
a particular file 5 times, it will have 5 entries in this table.) - The file descriptor table stores pointers to entries in the file entry
table. When
open
returns a file descriptor, that is an index of an element in this table. Each process has its own, independent file descriptor table.