Assignment 2: Reading UNIX Filesystems
Thanks to Mendel Rosenblum for this wonderful assignment. Writeup by Jerry Cain, with modifications by Nick Troccoli, Chris Gregg, and Ryan Eberhardt.
An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid-1970s. After considerable effort, the dig’s technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data.
The archaeologists determined the data on the disks was stored using the Version 6 Unix filesystem. Sadly, none of the archaeologists are familiar with that filesystem, so they haven’t been able to read the image contents. Your task is to write a program that understands the Unix v6 filesystem to extract the file system data.
Due Date: Wednesday, July 7, 2021 at 11:59 p.m. pacific time
Overview of Unix v6 FS and this assignment
The key systems principle we are trying to drive home this week is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem:
- The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
- The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
- The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
- The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
- Lastly, the pathname layer implements full absolute path lookups, starting
from the root directory. If you hand
/usr/class/cs110/hello.txt
to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it findshello.txt
.
Any software accessing the filesystem will sit on top of the file and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged up thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.
In this assignment, you will implement the latter 4 of these layers. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem.
Notes
Expect that it will take some time to figure out the starter code before you can actually start doing anything. We throw a lot of files at you, and it isn’t trivial to figure out what’s going on and what’s expected of you. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Take time to read through all the starter code and to figure out how the files fit together. Give yourself at least several hours before expecting yourself to be productive, and you’ll be less frustrated.
There are several files in the starter code that you will need to jump back and forth between, and future assignments will also involve work with many files. It may help you to sketch out how the files interact on pencil and paper, or to make some notes for yourself somewhere. If you are using VSCode, command/ctrl-click will be very helpful for jumping between definitions and uses of structs and functions. If you are using vim, the ctags plugin may be helpful.
Unix v6 file system supplement
If you would like an additional reference, Section 2.5 of the Salzer and Kaashoek book contains most of what you need to know about the Unix v6 file system in order to do this assignment.
Working with the assignment repo
Getting started
All coding should be done on the myth
cluster machines (accessible by running
ssh [email protected]
), as that’s where we’ll be testing your
submission. (If you’ve never used myth
before, see essential resources from
CS 107. Do you live very
far from campus or have a spotty internet connection? We are prototyping a
system that would let you work locally on your own machine, instead of needing
to connect to myth. Let us know if you would be interested in helping to try
this out.
To get your own copy of the starter code, you should clone the master git repository we’ve set up for you by typing:
git clone /usr/class/cs110/repos/assign2/$USER assign2
Doing so will create an assign2
directory within your own space, and you can
descend into that directory to land on the collection of files you’ll be
modifying to arrive at a working product.
Building, running, testing, and committing local changes
You can run the project even before you’ve written any code (though, of course, it won’t produce the correct output).
To build the project, cd
into your local repository and type
make
To test your work, cd
into your local repository and type
./diskimageaccess <options> <diskimagePath>
<diskimagePath>
should be the path to one of the test disks located in
./samples/testdisks/
. We currently have three test disks: basicDiskImage
,
depthFileDiskImage
, and dirFnameSizeDiskImage
.
diskimageaccess
recognizes two flags as valid <options>
:
-i
: test the inode and file layers-p
: test the filename and pathname layers
For example, to run both the inode and filename tests on the basic disk, you could run
./diskimageaccess -ip ./samples/testdisks/basicDiskImage
You can visually compare your output to that generated by my own solution by typing
./samples/diskimageaccess_soln <options> <diskimagePath>
The expected output for running diskimageaccess
on each disk image X
is
stored in the X.gold
file inside the testdisks
directory. We are leveraging
the sanitycheck
tool,
which you may be familiar with from CS 107. In particular, if you type
./tools/sanitycheck
at the command prompt while in your assign2
directory, the sanitycheck
script will exercise your implementation in precisely the same way our own
grading scripts will. If you pass all sanitycheck tests, you should pass most
of our functionality grading tests, although sanitycheck isn’t guaranteed to be
comprehensive.
Submitting
When time comes for you to submit, simply type
./tools/submit
Submit early and submit often! You can resubmit as many times as you would like before the deadline. We will only grade your most recent submission.
Starter code
This section offers an explanation of each file in the starter code. See “Working on the Assignment” below for a plan of attack once you have oriented yourself with the codebase.
The files we provide you fall into three categories.
The Version 6 Unix header files (filsys.h
, ino.h
, direntv6.h
)
These define structs that you will be using quite frequently. The structs are explained in the “Header Files and Structures” section below. You shouldn’t need to modify these files.
The test harness (diskimageaccess.c
, chksumfile.[ch]
, unixfilesystems.[ch]
)
These files provide the infrastructure for building your code and running our tests against it. You are welcome to modify these files for testing, if you would like, but you shouldn’t need to.
File system module (diskimg.[ch]
, inode.[ch]
, file.[ch]
, pathname.[ch]
)
The test code we give you interfaces with the file system using a layered API
that we’ve already designed. For each of the layers that you’ll need to
implement for this assignment, there is a header file that defines the
interface to the layer and a corresponding .c
file that should contain the
implementation.
The layers are:
Block layer (diskimg.[ch]
)
This defines and implements the interface for reading and writing sectors (note that the words block and sector are used interchangeably) on the disk image. We give you an implementation of this layer that should be sufficient for this assignment, so you don’t need to modify these files.
inode layer (inode.[ch]
)
This defines and implements the interface for reading the file system’s inodes. This includes the ability to look up inodes by inumber and to get the block/sector number of the nth block of the inode’s data. You will be modifying these files.
File layer (file.[ch]
)
This defines and implements the interface for reading blocks of data from a file by specifying its inumber. This is one of the two layers our tests explicitly exercise. You will be modifying these files.
Filename layer (directory.[ch]
)
This defines and implements the interface for implementing Unix directories on top of files. Its primary function is to get information about a single directory entry. You will be modifying these files.
Pathname layer (pathname.[ch]
)
This defines and implements the interface to look up a file by its absolute pathname. This is the second of the two layers we explicitly test. You will be modifying these files.
You are welcome to modify any of the files we give you if you would like. When
making changes in the test harness, do so in a backward compatible way so that
the testing scripts we give you continue to work. In other words, it’s fine to
add testing and debugging code, but make sure that when we run your program
with -i
and -p
, its output has the same format with or without your
changes, so our automated tests won’t break.
Header files and structures
In the starter code we’ve provided C struct
s corresponding to the file
system’s on-disk data structures. These struct
s have the same layout in
memory as the structures have on disk. They include:
struct filsys
(filsys.h
)
Corresponds to the superblock of the file system. This is a slightly modified copy of the header file from Version 6 Unix.
struct inode
(ino.h
)
Corresponds to a single inode. Again this comes from Version 6 of Unix, with some small modifications.
struct direntv6
(direntv6.h
)
Corresponds to a directory entry. This was copied from section 2.5 in the textbook.
In addition, unixfilesystem.h
contains a description of the file system
layout, including the sector address of the superblock and of the start of the
inode region.
Working on the Assignment
Your job is to complete the implementation of 4 layers of the filesystem:
the “inode layer”, “file layer”, “directory layer”, and “pathname layer”. Each
of these layers has 2 corresponding files, and layers build on each other. You
will need to modify the .c
files for each layer. We also provide a
fully-implemented “block layer” in diskimg.[ch]
that provides the code for
reading and writing sectors (note that the words block and sector are used
interchangeably here) on the disk image.
As you implement them, consider what functions from lower layers you have already implemented and how they can help! We recommend implementing them in the following order.
Testing Recommendations: As you work, we highly recommend testing
incrementally and ensuring each function is working before moving on. Sanity
check can be helpful at certain points, but it only tests certain layers
directly, and it also omits some output from the program in order to do an
output comparison. To view error messages or other testing output you are
printing, make sure to run the diskimageaccess
program outside of sanity
check. You can see more specific testing recommendations for each function
below.
Inode Layer (inode.[ch]
)
This defines and implements the interface for getting information about an inode and the blocks its data is stored in. There are two functions you must implement:
inode_iget
This function reads the data for the inode with the specified inumber from the filesystem and fills in an inode struct with that information.
Testing Recommendation: One way to test this function in isolation is to
run the diskimageaccess
test program with the -i
flag in GDB, with a
breakpoint on inode_iget
. For a given call, inspect the inumber being passed
in, and look at the corresponding .gold
file for the disk image you are
testing with to see what the mode and size for that inode should be (e.g.
samples/testdisks/basicDiskImage.gold
for the basic disk image). Then use
gdb to confirm that the inode your function is reading in has these same
values. Note that you may need to print out the inode’s mode in hex (try p/x myInode.i_mode
in GDB) and you may need to call the inode_getsize
function
to print out the correct inode size (try call inode_getsize(&myInode)
in
GDB).
inode_indexlookup
This function returns the block number of the file data block at the specified
index. What this means is that, if we take just the blocks for this file
holding actual file data (e.g. ignoring indirect blocks) and index them
sequentially, the fileBlockIndex
parameter to this function specifies which
of those blocks we must return the block/sector number for. In other words,
the returned block number will always be for a block that stores actual file
data. As a concrete example, let’s say a file is 1,536 (512 * 3) bytes big.
This means it takes up 3 blocks of data. Let’s say furthermore that the block
numbers it is using are 35, 65 and 21. This means in the inode it stores these
three block numbers, and if you fetched blocks 35, 65 and 21 from disk you
would have data for the entire file. But for inode_indexlookup
,
fileBlockIndex
would be either 0, 1 or 2 - it’s specifying the index into the
file’s data of the block to get, and it’s your job to return the disk block
number it corresponds to. For instance, if fileBlockIndex was 0, you would
return 35. If fileBlockIndex was 1, you would return 65, etc.
This implementation needs to manage direct, singly indirect, and doubly indirect block numbers. Translating for direct block numbers is straightforward (as we did in the example above), but translating for indirect block numbers is hard to get right.
- Hint 1: Understand that indirect block numbers identify other blocks that contain 256 two-bytes integers (best managed via the
uint16_t
type). - Hint 2: Understand that doubly indirect block numbers lead to singly indirect block numbers, which lead to block numbers. Make sure whatever code you write to translate singly indirect block numbers also contributes (without cutting and pasting or otherwise copying it) to the process used to ultimately convert doubly indirect block numbers.
Testing Recommendation: One way to test this function in isolation is to
write temporary testing code in file_getblock
. Choose an inumber from the
.gold
output (e.g. samples/testdisks/basicDiskImage.gold
for the basic
disk image) that is either a small or large file depending on what format you’d
like to test. Then fetch that inode and try fetching each of its file blocks
(0, 1, 2,…) via inode_indexlookup
and verify that each of the numbers is
correct. For a small file, you can use GDB to examine that inode’s list of
block numbers and ensure the right one is returned. For example, if you have
an inode with blocks {45, 23, 14, …} then file block index 0 should return
45, 1 should return 23, 2 should return 14, etc. For a large file, you could
write additional testing code to manually fetch the block(s) that should
contain that file block to ensure you are getting the right answer. For
example, if you have an inode with blocks {45, 23, 14, …} then for file block
index 256 you would fetch block 23 in your test code so that you could verify
the return value from inode_indexlookup
matches its first number.
File Layer (file.[ch]
)
The inode layer supports getting information about a file. This layer’s job is to actually read the data a file contains, in chunks. This is one of the two layers our tests explicitly exercise. After this step, your code should pass the inode level functionality tests. There is one function you must implement:
file_getblock
This function’s job is to read the file data block at the specified index. In
other words, whereas inode_indexlookup
finds which block number contains the
data, this function actually reads in that data. This function returns the
number of meaningful bytes in a block/sector that contribute to the payload of
some file (be it a regular file or a directory). In other words, it returns
the number of bytes in that block that are actually used to store data for the
file. As an example, if the file size happens to be a multiple of 512, then
this function always returns 512. Alternatively, if a file isn’t a multiple of
512 and we have to get the payload stored in the very last block, you should
expect to return a number that’s less than 512 (because only a part of what’s
stored in the block is relevant).
Testing Recommendation: This function is directly tested by sanity check. If
any tests are failing, try running them outside of sanity check for more
information. If your inode information isn’t matching, revisit your
inode_iget
function. If your checksums aren’t matching, identify the inode
that is outputting incorrectly and check the .gold
file for more information
about it. Pay close attention to whether that inode represents a small or
large file; that could give you a hint as to where your logic might be going
wrong!
Directory Layer (directory.[ch]
)
This layer implements looking up a file with a certain name in a directory. There is one function you must implement:
####directory_findname
This function’s job is to get the directory entry (defined as struct direntv6
) from the given directory that matches the given name. As a hint,
your implementation should make use of strncmp
instead of strcmp
, because
some of the stored directory names aren’t '\0'
-terminated. They are,
however, guaranteed to be of length 14 or less. Check out the man page for
strncmp
and strcmp
so you understand the difference.
Testing Recommendation: One way to test this function in isolation is to
write temporary testing code in pathname_lookup
. Open the .gold
output for
the test disk (e.g. samples/testdisks/basicDiskImage.gold
and choose a file
or folder to test with that is directly contained within the root directory
(/). For example, for the basic disk image, these would include “bigfile”,
“medfile”, “o”, “verybig”, “CVS” and “foo”. For any of these, add testing code
that calls directory_findname
to lookup that file name in directory inumber 1
(the root directory). You can inspect that the function fetches the correct
struct direntv6
- specifically, the name should match, and the inumber should
match the inumber listed in the .gold
file (E.g. “Path /verybig 5 mode 0x91ff
size 7340032 checksum” means verybig was inode 5). Additionally, you can take
the inumber from that direntv6
, fetch the inode
with inode_iget
, and
verify its fields match the remainder of the listed .gold
output,
specifically the mode and size, like in the testing for inode_iget
. You can
further expand this testing method to test any of the file lookups by hand as
long as you identify the correct dirinumber (e.g. in the basic disk image you
could try looking up “Repository” within dirinumber 6 (/CVS), “Entries” within
dirinumber 12 (/foo/CVS), etc.).
Pathname Layer (pathname.[ch]
)
The directory layer supports looking up a file in a single directory. This layer’s job is to look up a file given its entire absolute path, potentially nested in multiple directories. This is the second of the two layers we explicitly test. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass. There is one function you must implement:
pathname_lookup
This function takes in an absolute path referring to a file or directory and returns the inumber of that specified file or directory. The function strsep
will come in handy here!
Testing Recommendation: This function is directly tested by sanity check. If
any tests are failing, try running them outside of sanity check for more
information. If your inode information isn’t matching, try tracing the lookup
by hand to make sure that the correct step is taken each time (e.g. / to /foo,
/foo to /foo/CVS, etc.). The .gold
output files show the inumbers for
various path depths, so if you are looking up /a/b/c
you can check that your
lookup for /a
, /a/b
and /a/b/c
are all correct.
Implementation details and tips
Legacy of an old machine
Back in the 1970s, storage space – both on disk and in main memory – was at a
premium. As a result, the Unix v6 file system goes to lengths to reduce the
size of data it stores. You’ll notice that many integer values in the
struct
s we provided are stored using only 16 bits, rather than today’s more
standard 32 or 64. (In our code we use the type int16_t
from stdint.h
to
get a 16-bit integer, but back in the ’70s, the C int
type was 16 bits wide.)
In another space-saving move, the designers of the file system stored the
inode’s size field as a 24-bit integer. There’s no 24-bit integer type in C,
so we represent this value using two fields in the inode
struct: i_size1
,
which contains the least-significant 16 bits of the value, and i_size0
, which
contains the most-significant 8 bits of the value. We provide a function
inode_getsize
in inode.c
that assembles these two fields into a normal C
integer for you.
The first inode
Since there is no inode with an inumber of 0, the designers of the file system
decided not to waste the 32 bytes of disk space to store it. The first inode
in the first inode block has inumber 1; this inode corresponds to the root
directory for the file system. (See unixfilesystem.h
for details.)
Be careful not to assume that the first inode has an inumber of 0! Off-by-one errors are the worst.
inode’s i_mode
The 16-bit integer i_mode
in the inode
struct isn’t really a number;
rather, the individual bits of the field indicate various properties of the
inode. ino.h
contains #define
s which describe what each bit means.
For instance, we say an inode is allocated if it points to an existing file.
The most-significant bit (i.e. bit 15) of i_mode
indicates whether the inode
is allocated or not. So the C expression (i_mode & IALLOC) == 0
is true
if
the inode is unallocated and false
otherwise.
Similarly, bit 12 of i_mode
indicates whether the file uses the large file
mapping scheme. So if (i_mode & ILARG) != 0
, then the inode’s i_addr
fields point at indirect and doubly-indirect blocks rather than directly at the
data blocks.
Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field
is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR
) for
directories. So the expression (i_mode & IFMT) == IFDIR
is true
if the
inode is a directory, and false
otherwise.
Additional Tips and Tricks
- Constants: make sure to use provided constants, and define your own
constants, where appropriate. There are calculations throughout your code,
and your goal should be to make them as easy to follow as possible. For
example, use the constants in
unixfilesystem.h
where appropriate! - Declaring Buffers: When creating a buffer to store data, carefully
consider the type of the buffer. In places like
inode_iget
anddirectory_findname
, we know what type of data we are reading in, so you should not usevoid *
and casting, and instead declare the type as specifically as you can. This is a very common style deduction, so consider this carefully!
Grading
We will grade your assignment based on a combination of test results and code quality.
- Code test results: 60 points
- Clean build and clean
valgrind
reports (no memory leaks or errors): 6 points - Code quality: graded on bucket system [exceptional, solid, minor-problems, major-problems, etc.]
First, we’ll run your code on a set of test disk images and check that your
program gives the expected output. Then, we’ll read through your code and give
you comments on style and overall quality. We’ll look for places where you
inappropriately break the layering of the file system or make other mistakes
that don’t necessarily cause your program to generate the wrong output. We’ll
also look for common C programming errors such as memory leaks and use of the
heap where stack allocation is more appropriate, or places when you used
void*
pointers when you could have used stronger types. Finally, we’ll check
that your code is easy to read: It should be well-formatted, organized, and be
easy to follow.