Assignment 2: Reading UNIX Filesystems

Thanks to Mendel Rosenblum for this wonderful assignment. Writeup by Jerry Cain, with modifications by Nick Troccoli, Chris Gregg, and Ryan Eberhardt.

An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid-1970s. After considerable effort, the dig’s technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data.

The archaeologists determined the data on the disks was stored using the Version 6 Unix filesystem. Sadly, none of the archaeologists are familiar with that filesystem, so they haven’t been able to read the image contents. Your task is to write a program that understands the Unix v6 filesystem to extract the file system data.

Due Date: Wednesday, July 7, 2021 at 11:59 p.m. pacific time

Overview of Unix v6 FS and this assignment

The key systems principle we are trying to drive home this week is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem:

The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
Lastly, the pathname layer implements full absolute path lookups, starting from the root directory. If you hand /usr/class/cs110/hello.txt to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it finds hello.txt.

Any software accessing the filesystem will sit on top of the file and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged up thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.

In this assignment, you will implement the latter 4 of these layers. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem.

Notes

Expect that it will take some time to figure out the starter code before you can actually start doing anything. We throw a lot of files at you, and it isn’t trivial to figure out what’s going on and what’s expected of you. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Take time to read through all the starter code and to figure out how the files fit together. Give yourself at least several hours before expecting yourself to be productive, and you’ll be less frustrated.

There are several files in the starter code that you will need to jump back and forth between, and future assignments will also involve work with many files. It may help you to sketch out how the files interact on pencil and paper, or to make some notes for yourself somewhere. If you are using VSCode, command/ctrl-click will be very helpful for jumping between definitions and uses of structs and functions. If you are using vim, the ctags plugin may be helpful.

Unix v6 file system supplement

If you would like an additional reference, Section 2.5 of the Salzer and Kaashoek book contains most of what you need to know about the Unix v6 file system in order to do this assignment.

Working with the assignment repo

Getting started

All coding should be done on the myth cluster machines (accessible by running ssh [email protected]), as that’s where we’ll be testing your submission. (If you’ve never used myth before, see essential resources from CS 107. Do you live very far from campus or have a spotty internet connection? We are prototyping a system that would let you work locally on your own machine, instead of needing to connect to myth. Let us know if you would be interested in helping to try this out.

To get your own copy of the starter code, you should clone the master git repository we’ve set up for you by typing:

git clone /usr/class/cs110/repos/assign2/$USER assign2

Doing so will create an assign2 directory within your own space, and you can descend into that directory to land on the collection of files you’ll be modifying to arrive at a working product.

Building, running, testing, and committing local changes

You can run the project even before you’ve written any code (though, of course, it won’t produce the correct output).

To build the project, cd into your local repository and type

make

To test your work, cd into your local repository and type

./diskimageaccess <options> <diskimagePath>

<diskimagePath> should be the path to one of the test disks located in ./samples/testdisks/. We currently have three test disks: basicDiskImage, depthFileDiskImage, and dirFnameSizeDiskImage.

diskimageaccess recognizes two flags as valid <options>:

-i: test the inode and file layers
-p: test the filename and pathname layers

For example, to run both the inode and filename tests on the basic disk, you could run

./diskimageaccess -ip ./samples/testdisks/basicDiskImage

You can visually compare your output to that generated by my own solution by typing

./samples/diskimageaccess_soln <options> <diskimagePath>

The expected output for running diskimageaccess on each disk image X is stored in the X.gold file inside the testdisks directory. We are leveraging the sanitycheck tool, which you may be familiar with from CS 107. In particular, if you type

./tools/sanitycheck

at the command prompt while in your assign2 directory, the sanitycheck script will exercise your implementation in precisely the same way our own grading scripts will. If you pass all sanitycheck tests, you should pass most of our functionality grading tests, although sanitycheck isn’t guaranteed to be comprehensive.

Submitting

When time comes for you to submit, simply type

./tools/submit

Submit early and submit often! You can resubmit as many times as you would like before the deadline. We will only grade your most recent submission.

Starter code

This section offers an explanation of each file in the starter code. See “Working on the Assignment” below for a plan of attack once you have oriented yourself with the codebase.

The files we provide you fall into three categories.

The Version 6 Unix header files (filsys.h, ino.h, direntv6.h)

These define structs that you will be using quite frequently. The structs are explained in the “Header Files and Structures” section below. You shouldn’t need to modify these files.

The test harness (diskimageaccess.c, chksumfile.[ch], unixfilesystems.[ch])

These files provide the infrastructure for building your code and running our tests against it. You are welcome to modify these files for testing, if you would like, but you shouldn’t need to.

File system module (diskimg.[ch], inode.[ch], file.[ch], pathname.[ch])

The test code we give you interfaces with the file system using a layered API that we’ve already designed. For each of the layers that you’ll need to implement for this assignment, there is a header file that defines the interface to the layer and a corresponding .c file that should contain the implementation.

The layers are:

Block layer (diskimg.[ch])

This defines and implements the interface for reading and writing sectors (note that the words block and sector are used interchangeably) on the disk image. We give you an implementation of this layer that should be sufficient for this assignment, so you don’t need to modify these files.

inode layer (inode.[ch])

This defines and implements the interface for reading the file system’s inodes. This includes the ability to look up inodes by inumber and to get the block/sector number of the nth block of the inode’s data. You will be modifying these files.

File layer (file.[ch])

This defines and implements the interface for reading blocks of data from a file by specifying its inumber. This is one of the two layers our tests explicitly exercise. You will be modifying these files.

Filename layer (directory.[ch])

This defines and implements the interface for implementing Unix directories on top of files. Its primary function is to get information about a single directory entry. You will be modifying these files.

Pathname layer (pathname.[ch])

This defines and implements the interface to look up a file by its absolute pathname. This is the second of the two layers we explicitly test. You will be modifying these files.

You are welcome to modify any of the files we give you if you would like. When making changes in the test harness, do so in a backward compatible way so that the testing scripts we give you continue to work. In other words, it’s fine to add testing and debugging code, but make sure that when we run your program with -i and -p, its output has the same format with or without your changes, so our automated tests won’t break.

Header files and structures

In the starter code we’ve provided C structs corresponding to the file system’s on-disk data structures. These structs have the same layout in memory as the structures have on disk. They include:

struct filsys (filsys.h)

Corresponds to the superblock of the file system. This is a slightly modified copy of the header file from Version 6 Unix.

struct inode (ino.h)

Corresponds to a single inode. Again this comes from Version 6 of Unix, with some small modifications.

struct direntv6 (direntv6.h)

Corresponds to a directory entry. This was copied from section 2.5 in the textbook.

In addition, unixfilesystem.h contains a description of the file system layout, including the sector address of the superblock and of the start of the inode region.

Working on the Assignment

Your job is to complete the implementation of 4 layers of the filesystem: the “inode layer”, “file layer”, “directory layer”, and “pathname layer”. Each of these layers has 2 corresponding files, and layers build on each other. You will need to modify the .c files for each layer. We also provide a fully-implemented “block layer” in diskimg.[ch] that provides the code for reading and writing sectors (note that the words block and sector are used interchangeably here) on the disk image.

As you implement them, consider what functions from lower layers you have already implemented and how they can help! We recommend implementing them in the following order.

Testing Recommendations: As you work, we highly recommend testing incrementally and ensuring each function is working before moving on. Sanity check can be helpful at certain points, but it only tests certain layers directly, and it also omits some output from the program in order to do an output comparison. To view error messages or other testing output you are printing, make sure to run the diskimageaccess program outside of sanity check. You can see more specific testing recommendations for each function below.

Inode Layer (`inode.[ch]`)

This defines and implements the interface for getting information about an inode and the blocks its data is stored in. There are two functions you must implement:

`inode_iget`

This function reads the data for the inode with the specified inumber from the filesystem and fills in an inode struct with that information.

Testing Recommendation: One way to test this function in isolation is to run the diskimageaccess test program with the -i flag in GDB, with a breakpoint on inode_iget. For a given call, inspect the inumber being passed in, and look at the corresponding .gold file for the disk image you are testing with to see what the mode and size for that inode should be (e.g. samples/testdisks/basicDiskImage.gold for the basic disk image). Then use gdb to confirm that the inode your function is reading in has these same values. Note that you may need to print out the inode’s mode in hex (try p/x myInode.i_mode in GDB) and you may need to call the inode_getsize function to print out the correct inode size (try call inode_getsize(&myInode) in GDB).

`inode_indexlookup`

This function returns the block number of the file data block at the specified index. What this means is that, if we take just the blocks for this file holding actual file data (e.g. ignoring indirect blocks) and index them sequentially, the fileBlockIndex parameter to this function specifies which of those blocks we must return the block/sector number for. In other words, the returned block number will always be for a block that stores actual file data. As a concrete example, let’s say a file is 1,536 (512 * 3) bytes big. This means it takes up 3 blocks of data. Let’s say furthermore that the block numbers it is using are 35, 65 and 21. This means in the inode it stores these three block numbers, and if you fetched blocks 35, 65 and 21 from disk you would have data for the entire file. But for inode_indexlookup, fileBlockIndex would be either 0, 1 or 2 - it’s specifying the index into the file’s data of the block to get, and it’s your job to return the disk block number it corresponds to. For instance, if fileBlockIndex was 0, you would return 35. If fileBlockIndex was 1, you would return 65, etc.

This implementation needs to manage direct, singly indirect, and doubly indirect block numbers. Translating for direct block numbers is straightforward (as we did in the example above), but translating for indirect block numbers is hard to get right.

Hint 1: Understand that indirect block numbers identify other blocks that contain 256 two-bytes integers (best managed via the uint16_t type).
Hint 2: Understand that doubly indirect block numbers lead to singly indirect block numbers, which lead to block numbers. Make sure whatever code you write to translate singly indirect block numbers also contributes (without cutting and pasting or otherwise copying it) to the process used to ultimately convert doubly indirect block numbers.

Testing Recommendation: One way to test this function in isolation is to write temporary testing code in file_getblock. Choose an inumber from the .gold output (e.g. samples/testdisks/basicDiskImage.gold for the basic disk image) that is either a small or large file depending on what format you’d like to test. Then fetch that inode and try fetching each of its file blocks (0, 1, 2,…) via inode_indexlookup and verify that each of the numbers is correct. For a small file, you can use GDB to examine that inode’s list of block numbers and ensure the right one is returned. For example, if you have an inode with blocks {45, 23, 14, …} then file block index 0 should return 45, 1 should return 23, 2 should return 14, etc. For a large file, you could write additional testing code to manually fetch the block(s) that should contain that file block to ensure you are getting the right answer. For example, if you have an inode with blocks {45, 23, 14, …} then for file block index 256 you would fetch block 23 in your test code so that you could verify the return value from inode_indexlookup matches its first number.

File Layer (`file.[ch]`)

The inode layer supports getting information about a file. This layer’s job is to actually read the data a file contains, in chunks. This is one of the two layers our tests explicitly exercise. After this step, your code should pass the inode level functionality tests. There is one function you must implement:

`file_getblock`

This function’s job is to read the file data block at the specified index. In other words, whereas inode_indexlookup finds which block number contains the data, this function actually reads in that data. This function returns the number of meaningful bytes in a block/sector that contribute to the payload of some file (be it a regular file or a directory). In other words, it returns the number of bytes in that block that are actually used to store data for the file. As an example, if the file size happens to be a multiple of 512, then this function always returns 512. Alternatively, if a file isn’t a multiple of 512 and we have to get the payload stored in the very last block, you should expect to return a number that’s less than 512 (because only a part of what’s stored in the block is relevant).

Testing Recommendation: This function is directly tested by sanity check. If any tests are failing, try running them outside of sanity check for more information. If your inode information isn’t matching, revisit your inode_iget function. If your checksums aren’t matching, identify the inode that is outputting incorrectly and check the .gold file for more information about it. Pay close attention to whether that inode represents a small or large file; that could give you a hint as to where your logic might be going wrong!

Directory Layer (`directory.[ch]`)

This layer implements looking up a file with a certain name in a directory. There is one function you must implement:

####directory_findname

This function’s job is to get the directory entry (defined as struct direntv6) from the given directory that matches the given name. As a hint, your implementation should make use of strncmp instead of strcmp, because some of the stored directory names aren’t '\0'-terminated. They are, however, guaranteed to be of length 14 or less. Check out the man page for strncmp and strcmp so you understand the difference.

Testing Recommendation: One way to test this function in isolation is to write temporary testing code in pathname_lookup. Open the .gold output for the test disk (e.g. samples/testdisks/basicDiskImage.gold and choose a file or folder to test with that is directly contained within the root directory (/). For example, for the basic disk image, these would include “bigfile”, “medfile”, “o”, “verybig”, “CVS” and “foo”. For any of these, add testing code that calls directory_findname to lookup that file name in directory inumber 1 (the root directory). You can inspect that the function fetches the correct struct direntv6 - specifically, the name should match, and the inumber should match the inumber listed in the .gold file (E.g. “Path /verybig 5 mode 0x91ff size 7340032 checksum” means verybig was inode 5). Additionally, you can take the inumber from that direntv6, fetch the inode with inode_iget, and verify its fields match the remainder of the listed .gold output, specifically the mode and size, like in the testing for inode_iget. You can further expand this testing method to test any of the file lookups by hand as long as you identify the correct dirinumber (e.g. in the basic disk image you could try looking up “Repository” within dirinumber 6 (/CVS), “Entries” within dirinumber 12 (/foo/CVS), etc.).

Pathname Layer (`pathname.[ch]`)

The directory layer supports looking up a file in a single directory. This layer’s job is to look up a file given its entire absolute path, potentially nested in multiple directories. This is the second of the two layers we explicitly test. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass. There is one function you must implement:

`pathname_lookup`

This function takes in an absolute path referring to a file or directory and returns the inumber of that specified file or directory. The function strsep will come in handy here!

Testing Recommendation: This function is directly tested by sanity check. If any tests are failing, try running them outside of sanity check for more information. If your inode information isn’t matching, try tracing the lookup by hand to make sure that the correct step is taken each time (e.g. / to /foo, /foo to /foo/CVS, etc.). The .gold output files show the inumbers for various path depths, so if you are looking up /a/b/c you can check that your lookup for /a, /a/b and /a/b/c are all correct.

Implementation details and tips

Legacy of an old machine

Back in the 1970s, storage space – both on disk and in main memory – was at a premium. As a result, the Unix v6 file system goes to lengths to reduce the size of data it stores. You’ll notice that many integer values in the structs we provided are stored using only 16 bits, rather than today’s more standard 32 or 64. (In our code we use the type int16_t from stdint.h to get a 16-bit integer, but back in the ’70s, the C int type was 16 bits wide.)

In another space-saving move, the designers of the file system stored the inode’s size field as a 24-bit integer. There’s no 24-bit integer type in C, so we represent this value using two fields in the inode struct: i_size1, which contains the least-significant 16 bits of the value, and i_size0, which contains the most-significant 8 bits of the value. We provide a function inode_getsize in inode.c that assembles these two fields into a normal C integer for you.

The first inode

Since there is no inode with an inumber of 0, the designers of the file system decided not to waste the 32 bytes of disk space to store it. The first inode in the first inode block has inumber 1; this inode corresponds to the root directory for the file system. (See unixfilesystem.h for details.)

Be careful not to assume that the first inode has an inumber of 0! Off-by-one errors are the worst.

inode’s `i_mode`

The 16-bit integer i_mode in the inode struct isn’t really a number; rather, the individual bits of the field indicate various properties of the inode. ino.h contains #defines which describe what each bit means.

For instance, we say an inode is allocated if it points to an existing file. The most-significant bit (i.e. bit 15) of i_mode indicates whether the inode is allocated or not. So the C expression (i_mode & IALLOC) == 0 is true if the inode is unallocated and false otherwise.

Similarly, bit 12 of i_mode indicates whether the file uses the large file mapping scheme. So if (i_mode & ILARG) != 0 , then the inode’s i_addr fields point at indirect and doubly-indirect blocks rather than directly at the data blocks.

Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR) for directories. So the expression (i_mode & IFMT) == IFDIR is true if the inode is a directory, and false otherwise.

Additional Tips and Tricks

Constants: make sure to use provided constants, and define your own constants, where appropriate. There are calculations throughout your code, and your goal should be to make them as easy to follow as possible. For example, use the constants in unixfilesystem.h where appropriate!
Declaring Buffers: When creating a buffer to store data, carefully consider the type of the buffer. In places like inode_iget and directory_findname, we know what type of data we are reading in, so you should not use void * and casting, and instead declare the type as specifically as you can. This is a very common style deduction, so consider this carefully!

Grading

We will grade your assignment based on a combination of test results and code quality.

Code test results: 60 points
Clean build and clean valgrind reports (no memory leaks or errors): 6 points
Code quality: graded on bucket system [exceptional, solid, minor-problems, major-problems, etc.]

First, we’ll run your code on a set of test disk images and check that your program gives the expected output. Then, we’ll read through your code and give you comments on style and overall quality. We’ll look for places where you inappropriately break the layering of the file system or make other mistakes that don’t necessarily cause your program to generate the wrong output. We’ll also look for common C programming errors such as memory leaks and use of the heap where stack allocation is more appropriate, or places when you used void* pointers when you could have used stronger types. Finally, we’ll check that your code is easy to read: It should be well-formatted, organized, and be easy to follow.

CS 110