Assignment 1: Reading UNIX Filesystems
Thanks to Mendel Rosenblum for this wonderful assignment. This is a real system, people!
This handout was adapted from Jerry Cain’s Spring 2018 offering.
An archaeological expedition in Murray Hill, NJ has uncovered some magnetic disk drives from the mid-1970s. After considerable effort, the dig’s technical expert read the contents of the drives, and for each disk drive, she created a local file to be an exact bitwise copy of the disk’s data.
The archaeologists determined the data on the disks was stored using the Version 6 Unix file system. Sadly, none of the archaeologists made it through CS110, so they haven’t been able to read the image contents. Your task is to write a program that understands the Unix v6 file system to extract the file system data.
Due Date: Friday, July 6, 2018 at 11:59 p.m.
Overview of Unix v6 FS and this assignment
The key systems principle we are trying to drive home this week is that of layering: one way to manage complexity in systems is to break it down into layers that sit on top of each other. There are several layers involved in the Unix v6 filesystem:
- The block layer is among the lowest software layers. It manages the details of communication with the disk, enabling us to read or write sectors. This layer sits underneath almost every filesystem operation, and above this layer, we don’t want to have to think about the nitty-gritty of talking to the hardware.
- The inode layer supplies higher layers with metadata about files. When we need to know which block number is storing a particular portion of a file, the inode layer tells us that. Above this layer, we don’t want to think about the mechanics of retrieving or updating metadata (which isn’t simple, considering inodes are smaller than sectors).
- The file layer supplies higher layers with actual file contents. We request some number of bytes from a file at a particular offset, and the file layer populates a buffer with that information.
- The filename layer allows us to find a file within a directory. Given a filename and a directory presumably containing that file, this layer figures out what inode stores the information for that file.
- Lastly, the pathname layer implements full absolute path lookups, starting
from the root directory. If you hand
/usr/class/cs110/hello.txt
to the pathname layer, it will utilize the filename layer beneath it, looping through the components of that full path, until it findshello.txt
.
Any software accessing the filesystem will sit on top of the file and pathname layers. If we’re trying to write an application that does meaningful work, we almost certainly don’t want to be bogged up thinking about the organization of bytes on disk, so we leave that to the aforementioned layers beneath us.
In this assignment, you will implement the latter 4 of these layers. By the end of the assignment, your program will be able to list and read files from a disk formatted with the Unix v6 filesystem.
Notes
This assignment is usually given for 1 week, but I am giving you almost two weeks so that you have time to get accustomed to working with the development environment and to brush up on your C skills. Get help early if you need it, and don’t get too used to this pacing; future assignments will only be out for 1 week.
Also, expect that it will take some time to figure out the starter code before you can actually start doing anything. We throw a lot of files at you, and it isn’t trivial to figure out what’s going on and what’s expected of you. This may be frustrating, but we want to push you to get better at reading code and finding your way around an existing system; such skills are critical if you want to work on systems comprised of hundreds of thousands or millions of lines of code. Take time to read through all the starter code and to figure out how the files fit together. Give yourself at least an hour before expecting yourself to be productive, and you’ll be less frustrated.
There are several files in the starter code that you will need to jump back and forth between, and future assignments will also involve work with many files. It pays to spend time becoming more proficient with your tools (particularly your terminal emulator and editor) so that you can work efficiently and don’t need to restart vim every time you want to switch to a different file. I’m also working on writing a handout with some misc tips and resources for working with your tools; I will post a link on the course homepage, as well as here, when I’m finished.
Working with the assignment repo
Getting started
All coding should be done on the myth
cluster machines (accessible by running
ssh [email protected]
), as that’s where we’ll be testing your
submission. (If you’ve never used myth
before, see essential resources from
CS 107. You can also read
tips I wrote a few years
ago,
some of which may be out of date.) To get your own copy of the starter code,
you should clone the master git repository we’ve set up for you by typing:
git clone /usr/class/cs110/repos/assign1/$USER assign1
Doing so will create an assign1
directory within your own space, and you can
descend into that directory to land on the collection of files you’ll be
modifying to arrive at a working product.
Building, running, testing, and committing local changes
You can run the project even before you’ve written any code (though, of course, it won’t produce the correct output).
To build the project, cd
into your local repository and type
make
To test your work, cd
into your local repository and type
./diskimageaccess <options> <diskimagePath>
<diskimagePath>
should be the path to one of the test disks located in
./samples/testdisks/
. We currently have three test disks: basicDiskImage
,
depthFileDiskImage
, and dirFnameSizeDiskImage
.
diskimageaccess
recognizes two flags as valid <options>
:
-i
: test the inode and file layers-p
: test the filename and pathname layers
For example, to run both the inode and filename tests on the basic disk, you could run
./diskimageaccess -ip ./samples/testdisks/basicDiskImage
You can visually compare your output to that generated by my own solution by typing
./samples/diskimageaccess_soln <options> <diskimagePath>
The expected output for running diskimageaccess
on each disk image X
is
stored in the X.gold
file inside the testdisks
directory. We are leveraging
the sanitycheck
tool,
which you may be familiar with from CS 107. In particular, if you type
./tools/sanitycheck
at the command prompt while in your assign1
directory, the sanitycheck
script will exercise your implementation in precisely the same way our own
grading scripts will. We won’t always expose all of the functionality tests,
but we are for Assignment 1. (Read: if your sanitycheck passes all tests and
runs clean under valgrind
, you will have a 100% for the functionality
component of this assignment. That won’t necessarily be the case in future
assignments.)
Submitting
When time comes for you to submit, simply type
./tools/submit
Submit early and submit often! You can resubmit as many times as you would like before the deadline. We will only grade your most recent submission.
Extra credit
If you submitted the Week 1 survey by Monday, July 2, 11:59pm, run the following in the root of your assign1 directory:
echo "I completed the Week 1 survey." >> NOTE_TO_GRADER
git add NOTE_TO_GRADER
If you ls
, you should see NOTE_TO_GRADER
alongside inode.c
, file.c
,
etc. Then, rerun ./tools/submit
.
We aren’t verifying that you actually did submit the survey, but please uphold the Honor Code and be truthful in your response.
Starter code
This section offers an explanation of each file in the starter code. See “Suggested Implementation Order” below for a plan of attack once you have oriented yourself with the codebase.
The files we provide you fall into three categories.
The Version 6 Unix header files (filsys.h
, ino.h
, direntv6.h
)
These define structs that you will be using quite frequently. The structs are explained in the “Header Files and Structures” section below. You shouldn’t need to modify these files.
The test harness (diskimageaccess.c
, chksumfile.[ch]
, unixfilesystems.[ch]
)
These files provide the infrastructure for building your code and running our tests against it. You are welcome to modify these files for testing, if you would like, but you shouldn’t need to.
File system module (diskimg.[ch]
, inode.[ch]
, file.[ch]
, pathname.[ch]
)
The test code we give you interfaces with the file system using a layered API
that we’ve already designed. For each of the layers that you’ll need to
implement for this assignment, there is a header file that defines the
interface to the layer and a corresponding .c
file that should contain the
implementation.
The layers are:
Block layer (diskimg.[ch]
)
This defines and implements the interface for reading and writing sectors (note that the words block and sector are used interchangeably) on the disk image. We give you an implementation of this layer that should be sufficient for this assignment, so you don’t need to modify these files.
inode layer (inode.[ch]
)
This defines and implements the interface for reading the file system’s inodes. This includes the ability to look up inodes by inumber and to get the block/sector number of the nth block of the inode’s data. You will be modifying these files.
File layer (file.[ch]
)
This defines and implements the interface for reading blocks of data from a file by specifying its inumber. This is one of the two layers our tests explicitly exercise. You will be modifying these files.
Filename layer (directory.[ch]
)
This defines and implements the interface for implementing Unix directories on top of files. Its primary function is to get information about a single directory entry. You will be modifying these files.
Pathname layer (pathname.[ch]
)
This defines and implements the interface to look up a file by its absolute pathname. This is the second of the two layers we explicitly test. You will be modifying these files.
You are welcome to modify any of the files we give you if you would like. When
making changes in the test harness, do so in a backward compatible way so that
the testing scripts we give you continue to work. In other words, it’s fine to
add testing and debugging code, but make sure that when we run your program
with -i
and -p
, its output has the same format with or without your
changes, so our automated tests won’t break.
Before you try to write any code, you’ll want to familiarize yourself with the details of the Unix V6 file system by reading Section 2.5 of the Saltzer and Kaashoek textbook. You may also find it helpful to read and understand how the test script works.
Unix v6 file system supplement
Section 2.5 of the Salzer and Kaashoek book contains most of what you need to know about the Unix v6 file system in order to do this assignment. You may be able to complete this assignment with just the material I presented in lecture and by referencing my lecture notes, but S&K spells things out in more detail. The information below is supplementary to the textbook, so it assumes you’ve already read and understand the material there.
Header files and structures
In the starter code we’ve provided C struct
s corresponding to the file
system’s on-disk data structures. These struct
s have the same layout in
memory as the structures have on disk. They include:
struct filsys
(filsys.h
)
Corresponds to the superblock of the file system. This is a slightly modified copy of the header file from Version 6 Unix.
struct inode
(ino.h
)
Corresponds to a single inode. Again this comes from Version 6 of Unix, with some small modifications.
struct direntv6
(direntv6.h
)
Corresponds to a directory entry. This was copied from section 2.5 in the textbook.
In addition, unixfilesystem.h
contains a description of the file system
layout, including the sector address of the superblock and of the start of the
inode region.
Legacy of an old machine
Back in the 1970s, storage space – both on disk and in main memory – was at a
premium. As a result, the Unix v6 file system goes to lengths to reduce the
size of data it stores. You’ll notice that many integer values in the
struct
s we provided are stored using only 16 bits, rather than today’s more
standard 32 or 64. (In our code we use the type int16_t
from stdint.h
to
get a 16-bit integer, but back in the ‘70s, the C int
type was 16 bits wide.)
In another space-saving move, the designers of the file system stored the
inode’s size field as a 24-bit integer. There’s no 24-bit integer type in C,
so we represent this value using two fields in the inode
struct: i_size1
,
which contains the least-significant 16 bits of the value, and i_size0
, which
contains the most-significant 8 bits of the value. We provide a function
inode_getsize
in inode.c
that assembles these two fields into a normal C
integer for you.
The first inode
Since there is no inode with an inumber of 0, the designers of the file system
decided not to waste the 32 bytes of disk space to store it. The first inode
in the first inode block has inumber 1; this inode corresponds to the root
directory for the file system. (See unixfilesystem.h
for details.)
Be careful not to assume that the first inode has an inumber of 0! Off-by-one errors are the worst.
inode’s i_mode
The 16-bit integer i_mode
in the inode
struct isn’t really a number;
rather, the individual bits of the field indicate various properties of the
inode. ino.h
contains #define
s which describe what each bit means.
For instance, we say an inode is allocated if it points to an existing file.
The most-significant bit (i.e. bit 15) of i_mode
indicates whether the inode
is allocated or not. So the C expression (i_mode & IALLOC) == 0
is true
if
the inode is unallocated and false
otherwise.
Similarly, bit 12 of i_mode
indicates whether the file uses the large file
mapping scheme. So if (i_mode & ILARG) != 0
, then the inode’s i_addr
fields point at indirect and doubly-indirect blocks rather than directly at the
data blocks.
Bits 14 and 13 form a 2-bit wide field specifying the type of file. This field
is 0 for regular files and 2 (i.e. binary 10, or the constant IFDIR
) for
directories. So the expression (i_mode & IFMT) == IFDIR
is true
if the
inode is a directory, and false
otherwise.
Suggested Implementation Order
The starter code contains a number of unfinished functions. We suggest you attack them in the following order:
inode_iget
andinode_indexlookup
ininode.c
.file_getblock
infile.c
. After this step, your code should pass the inode level functionality tests.directory_findname
indirectory.c
.pathname_lookup
inpathname.c
. Refer to Section 2.5.6 of the Salzer and Kaashoek book for this part. Afterward, all the pathname tests should pass.
Grading
We will grade your assignment based on a combination of test results and code quality.
- Code test results: 60 points
- Clean build and clean
valgrind
reports: 6 points - Code quality: graded on bucket system [exceptional, solid, minor-problems, major-problems, etc.]
First, we’ll run your code on a set of test disk images and check that your program gives the expected output. You have access to all of the images we’ll use for grading and the correct output for each of them.
If this were a real archaeological dig, you would have been given only the disk image and a computer. But to make things easier, we’ve provided you with starter code that tries to read all the files on the image and outputs checksums of the inode and file contents. We’ve also computed these checksums using a working implementation of the assignment. If your checksums match ours, then your code is correctly reading the data off the disk.
Finally, we’ll read through your code and give you comments on style and
overall quality. We’ll look for places where you inappropriately break the
layering of the file system or make other mistakes that don’t necessarily cause
your program to generate the wrong output. We’ll also look for common C
programming errors such as memory leaks and use of the heap where stack
allocation is more appropriate, or places when you used void*
pointers when
you could have used stronger types. Finally, we’ll check that your code is easy
to read: It should be well-formatted, organized, and be easy to follow.