Assignment 1: Reading UNIX Filesystems

Thanks to Mendel Rosenblum for this wonderful assignment. This is a real system, people!

This handout was adapted from Jerry Cain’s Spring 2018 offering.

An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid-1970s. After considerable effort, the dig’s technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data.

The archaeologists determined the data on the disks was stored using the Version 6 Unix file system. Sadly, none of the archaeologists made it through CS110, so they haven’t been able to read the image contents. Your task is to write a program that understands the Unix v6 file system to extract the file system data.

Due Date: Friday, July 6, 2018 at 11:59 p.m.

Overview of Unix v6 FS and this assignment

The key systems principle we are trying to drive home this week is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem:

The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
Lastly, the pathname layer implements full absolute path lookups, starting from the root directory. If you hand /usr/class/cs110/hello.txt to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it finds hello.txt.

Any software accessing the filesystem will sit on top of the file and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged up thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.

In this assignment, you will implement the latter 4 of these layers. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem.

Notes

This assignment is usually given for 1 week, but I am giving you almost two weeks so that you have time to get accustomed to working with the development environment and to brush up on your C skills. Get help early if you need it, and don’t get too used to this pacing; future assignments will only be out for 1 week.

Also, expect that it will take some time to figure out the starter code before you can actually start doing anything. We throw a lot of files at you, and it isn’t trivial to figure out what’s going on and what’s expected of you. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Take time to read through all the starter code and to figure out how the files fit together. Give yourself at least an hour before expecting yourself to be productive, and you’ll be less frustrated.

There are several files in the starter code that you will need to jump back and forth between, and future assignments will also involve work with many files. It pays to spend time becoming more proficient with your tools (particularly your terminal emulator and editor) so that you can work efficiently and don’t need to restart vim every time you want to switch to a different file. I’m also working on writing a handout with some misc tips and resources for working with your tools; I will post a link on the course homepage, as well as here, when I’m finished.

Working with the assignment repo

Getting started

All coding should be done on the myth cluster machines (accessible by running ssh [email protected]), as that’s where we’ll be testing your submission. (If you’ve never used myth before, see essential resources from CS 107. You can also read tips I wrote a few years ago, some of which may be out of date.) To get your own copy of the starter code, you should clone the master git repository we’ve set up for you by typing:

git clone /usr/class/cs110/repos/assign1/$USER assign1

Doing so will create an assign1 directory within your own space, and you can descend into that directory to land on the collection of files you’ll be modifying to arrive at a working product.

Building, running, testing, and committing local changes

You can run the project even before you’ve written any code (though, of course, it won’t produce the correct output).

To build the project, cd into your local repository and type

make

To test your work, cd into your local repository and type

./diskimageaccess <options> <diskimagePath>

<diskimagePath> should be the path to one of the test disks located in ./samples/testdisks/. We currently have three test disks: basicDiskImage, depthFileDiskImage, and dirFnameSizeDiskImage.

diskimageaccess recognizes two flags as valid <options>:

-i: test the inode and file layers
-p: test the filename and pathname layers

For example, to run both the inode and filename tests on the basic disk, you could run

./diskimageaccess -ip ./samples/testdisks/basicDiskImage

You can visually compare your output to that generated by my own solution by typing

./samples/diskimageaccess_soln <options> <diskimagePath>

The expected output for running diskimageaccess on each disk image X is stored in the X.gold file inside the testdisks directory. We are leveraging the sanitycheck tool, which you may be familiar with from CS 107. In particular, if you type

./tools/sanitycheck

at the command prompt while in your assign1 directory, the sanitycheck script will exercise your implementation in precisely the same way our own grading scripts will. We won’t always expose all of the functionality tests, but we are for Assignment 1. (Read: if your sanitycheck passes all tests and runs clean under valgrind, you will have a 100% for the functionality component of this assignment. That won’t necessarily be the case in future assignments.)

Submitting

When time comes for you to submit, simply type

./tools/submit

Submit early and submit often! You can resubmit as many times as you would like before the deadline. We will only grade your most recent submission.

Extra credit

If you submitted the Week 1 survey by Monday, July 2, 11:59pm, run the following in the root of your assign1 directory:

echo "I completed the Week 1 survey." >> NOTE_TO_GRADER
git add NOTE_TO_GRADER

If you ls, you should see NOTE_TO_GRADER alongside inode.c, file.c, etc. Then, rerun ./tools/submit.

We aren’t verifying that you actually did submit the survey, but please uphold the Honor Code and be truthful in your response.

Starter code

This section offers an explanation of each file in the starter code. See “Suggested Implementation Order” below for a plan of attack once you have oriented yourself with the codebase.

The files we provide you fall into three categories.

The Version 6 Unix header files (filsys.h, ino.h, direntv6.h)

These define structs that you will be using quite frequently. The structs are explained in the “Header Files and Structures” section below. You shouldn’t need to modify these files.

The test harness (diskimageaccess.c, chksumfile.[ch], unixfilesystems.[ch])

These files provide the infrastructure for building your code and running our tests against it. You are welcome to modify these files for testing, if you would like, but you shouldn’t need to.

File system module (diskimg.[ch], inode.[ch], file.[ch], pathname.[ch])

The test code we give you interfaces with the file system using a layered API that we’ve already designed. For each of the layers that you’ll need to implement for this assignment, there is a header file that defines the interface to the layer and a corresponding .c file that should contain the implementation.

The layers are:

Block layer (diskimg.[ch])

This defines and implements the interface for reading and writing sectors (note that the words block and sector are used interchangeably) on the disk image. We give you an implementation of this layer that should be sufficient for this assignment, so you don’t need to modify these files.

inode layer (inode.[ch])

This defines and implements the interface for reading the file system’s inodes. This includes the ability to look up inodes by inumber and to get the block/sector number of the nth block of the inode’s data. You will be modifying these files.

File layer (file.[ch])

This defines and implements the interface for reading blocks of data from a file by specifying its inumber. This is one of the two layers our tests explicitly exercise. You will be modifying these files.

Filename layer (directory.[ch])

This defines and implements the interface for implementing Unix directories on top of files. Its primary function is to get information about a single directory entry. You will be modifying these files.

Pathname layer (pathname.[ch])

This defines and implements the interface to look up a file by its absolute pathname. This is the second of the two layers we explicitly test. You will be modifying these files.

You are welcome to modify any of the files we give you if you would like. When making changes in the test harness, do so in a backward compatible way so that the testing scripts we give you continue to work. In other words, it’s fine to add testing and debugging code, but make sure that when we run your program with -i and -p, its output has the same format with or without your changes, so our automated tests won’t break.

Before you try to write any code, you’ll want to familiarize yourself with the details of the Unix V6 file system by reading Section 2.5 of the Saltzer and Kaashoek textbook. You may also find it helpful to read and understand how the test script works.

Unix v6 file system supplement

Section 2.5 of the Salzer and Kaashoek book contains most of what you need to know about the Unix v6 file system in order to do this assignment. You may be able to complete this assignment with just the material I presented in lecture and by referencing my lecture notes, but S&K spells things out in more detail. The information below is supplementary to the textbook, so it assumes you’ve already read and understand the material there.

Header files and structures

In the starter code we’ve provided C structs corresponding to the file system’s on-disk data structures. These structs have the same layout in memory as the structures have on disk. They include:

struct filsys (filsys.h)

Corresponds to the superblock of the file system. This is a slightly modified copy of the header file from Version 6 Unix.

struct inode (ino.h)

Corresponds to a single inode. Again this comes from Version 6 of Unix, with some small modifications.

struct direntv6 (direntv6.h)

Corresponds to a directory entry. This was copied from section 2.5 in the textbook.

In addition, unixfilesystem.h contains a description of the file system layout, including the sector address of the superblock and of the start of the inode region.

Legacy of an old machine

Back in the 1970s, storage space – both on disk and in main memory – was at a premium. As a result, the Unix v6 file system goes to lengths to reduce the size of data it stores. You’ll notice that many integer values in the structs we provided are stored using only 16 bits, rather than today’s more standard 32 or 64. (In our code we use the type int16_t from stdint.h to get a 16-bit integer, but back in the ‘70s, the C int type was 16 bits wide.)

In another space-saving move, the designers of the file system stored the inode’s size field as a 24-bit integer. There’s no 24-bit integer type in C, so we represent this value using two fields in the inode struct: i_size1, which contains the least-significant 16 bits of the value, and i_size0, which contains the most-significant 8 bits of the value. We provide a function inode_getsize in inode.c that assembles these two fields into a normal C integer for you.

The first inode

Since there is no inode with an inumber of 0, the designers of the file system decided not to waste the 32 bytes of disk space to store it. The first inode in the first inode block has inumber 1; this inode corresponds to the root directory for the file system. (See unixfilesystem.h for details.)

Be careful not to assume that the first inode has an inumber of 0! Off-by-one errors are the worst.

inode’s `i_mode`

The 16-bit integer i_mode in the inode struct isn’t really a number; rather, the individual bits of the field indicate various properties of the inode. ino.h contains #defines which describe what each bit means.

For instance, we say an inode is allocated if it points to an existing file. The most-significant bit (i.e. bit 15) of i_mode indicates whether the inode is allocated or not. So the C expression (i_mode & IALLOC) == 0 is true if the inode is unallocated and false otherwise.

Similarly, bit 12 of i_mode indicates whether the file uses the large file mapping scheme. So if (i_mode & ILARG) != 0 , then the inode’s i_addr fields point at indirect and doubly-indirect blocks rather than directly at the data blocks.

Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR) for directories. So the expression (i_mode & IFMT) == IFDIR is true if the inode is a directory, and false otherwise.

Suggested Implementation Order

The starter code contains a number of unfinished functions. We suggest you attack them in the following order:

inode_iget and inode_indexlookup in inode.c.
file_getblock in file.c. After this step, your code should pass the inode level functionality tests.
directory_findname in directory.c.
pathname_lookup in pathname.c. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass.

Grading

We will grade your assignment based on a combination of test results and code quality.

Code test results: 60 points
Clean build and clean valgrind reports: 6 points
Code quality: graded on bucket system [exceptional, solid, minor-problems, major-problems, etc.]

First, we’ll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we’ll use for grading and the correct output for each of them.

If this were a real archaeological dig, you would have been given only the disk image and a computer. But to make things easier, we’ve provided you with starter code that tries to read all the files on the image and outputs checksums of the inode and file contents. We’ve also computed these checksums using a working implementation of the assignment. If your checksums match ours, then your code is correctly reading the data off the disk.

Finally, we’ll read through your code and give you comments on style and overall quality. We’ll look for places where you inappropriately break the layering of the file system or make other mistakes that don’t necessarily cause your program to generate the wrong output. We’ll also look for common C programming errors such as memory leaks and use of the heap where stack allocation is more appropriate, or places when you used void* pointers when you could have used stronger types. Finally, we’ll check that your code is easy to read: It should be well-formatted, organized, and be easy to follow.

CS 110