UWA Logo Computer Science & Software Engineering
C Programming (CITS1210) - 2nd project 2008
   Faculty Home  |  CSSE Home  |  csentry  |  CITS1210  |  help1210

Submission deadline: 12noon, Friday 31st October 2008

Also check the project clarifications webpage.

See also details of the tests used in the marking process.


Background:

In recent years, hard disk sizes and densities have increased dramatically, while their costs are currently approaching just 15 cents per gigabyte. As a consequence, we store files on our computers' disks in very different ways and, because we typically have a huge amount of free space available, we end up having multiple copies of many files on our disks. This presents few problems, until we eventually run out of disk space, or need to backup our files to another (smaller) disk, perhaps over a network with limited bandwidth. At that time we'd like to locate all duplicate files, and only make one copy of them to our backup destination.

  • Two or more files are defined to be duplicates iff their contents are identical - duplicate files will thus have the same size, but we're unconcerned about the files' names or their modification times.
  • A file is defined to be unique iff no other file has the same contents.

Aim:

The aim of this project is to design and develop a useful utility program, named duplicates, to locate and report on duplicate files in, and below, a named directory. Your implementation of duplicates will be invoked with zero or more valid command-line options, and one directory name.

With no command-line options (i.e. only a directory name is provided) duplicates will simply list 4 things (with just one integer per line):

  1. the number of files found,
  2. the total size of all files found (i.e. the count of all bytes occupied by all files),
  3. the number of unique files (i.e. any duplicate is only counted once), and
  4. the possible minimum total size of all files found (i.e. the sizes of duplicated files are only counted once).

Files and directories (other than the "starting" directory indicated on the command-line) which cannot be read should be silently ignored (no error messages should be printed).
For the standard project, the "starting" directory will contain only regular files and sub-directories. In particular, there will be no hard- or symbolic-links.

An explanation of each of the command-line options follows. Support for the command-line option marked with a chili is only required in the Advanced version of this project (see later):

-a By default, hidden and configuration files (conventionally those beginning with a '.' character) are not considered by duplicates. Providing the -a option requests that all files be considered.
-l duplicates lists the duplicate files found. Each line of output will consist of the names of two or more files that are duplicates of each other. The filenames of duplicate files (on each line) must be separated by the TAB character.
-m duplicates minimizes the total number of bytes required to store all files' data. As described in the "Advanced tasks" section
-t duplicates simply tests if the named directory contains any duplicate files. duplicates produces no output at all, simply terminating with EXIT_SUCCESS if there are no duplicates (i.e. storage is minimized) or with EXIT_FAILURE otherwise.
-v By default, duplicates performs its actions silently. No output, other than the requested output, appears on the stdout channel, although some errors may be reported on the stderr channel. Providing the -v option requests that duplicates be more verbose in its output, reporting to stderr its actions.

Detecting duplicate files:

To detect duplicate files we'll employ a cryptographic checksum function named SHA2 (pronounced 'shar-2'). SHA2 examines the contents of a file and produces a fixed-length summary of its contents. Cryptographic checksum functions are designed by mathematicians and those developing encryption and security software.

Here is an implementation of the function - strSHA2.c

Two or more files are considered identical if their cryptographic checksums are identical. To date, no two (different) files ever have been found with identical cryptographic checksums! For this project, we'll use a string to store this representation, and two files will be considered identical if their SHA2 string representations are identical. The function strSHA2, with the following prototype:

    char *strSHA2(char *filename);

will be provided for this project (it is not a standard C99 function). If strSHA2 can read the indicated file, it will returned a dynamically allocated string holding the SHA2 string representation of the file's contents. If the indicated file cannot be read, strSHA2 will return NULL. Note that you do not have to understand the SHA2 function or its implementation for this project.


Getting started:

A sample solution, named /cslinux/examples/CITS1210/WWW/project2/duplicates-sample, will be provided.
NOTE: your solution's output must be IDENTICAL to that of the sample solution.

There is no required sequence of steps to undertake the project, and no sequence of steps will guarantee success. However, the following sequence is strongly recommended (and this is how the sample solution was built). It is assumed (considered essential for success!) that each step:

  • is extensively tested before you proceed to the next step, and
  • reports any errors to stderr as they are found. Serious errors may require the whole program to terminate.

The sample solution uses the following C99 and Unix system functions. It is strongly recommended that you first read the online documentation for each of these:

getopt, malloc, realloc, strdup, free, opendir, readdir, closedir, stat.

and understand how to use the provided strSHA2 function.

The suggested steps:

  1. determine what activities are to be performed by the program. Determine what data needs to be stored during the execution of the program. Design one of more data types and data structures to store this long-term data, and create a new duplicates.h header file to define these constructs.
  2. break up the activities into fairly independent sets. Write a number of "empty" C files, each of which includes your duplicates.h header file, to implement the activities. Ensure that your file implementing the main() function is named duplicates.c.
  3. write a Makefile to compile and link your C files, each of which will be dependent on the header file.
  4. write the main() function, whose initial task is simply to receive and validate the command-line options and arguments. Write a usage() function to report the program's synopsis, should command-line processing detect any errors.
  5. ensure that the "real" command-line argument is the name of a directory that you can read it.
  6. open and read each directory, locating all regular files in the directory. At this stage (as a debugging aid) just print each filename as it is found. When directories are locate within directories, you should process those directories recursively.
  7. add support for the -a command-line option when finding files.
  8. store the name (location) and size of each file found in each directory.
  9. having finished reading the directories, identify all duplicate files.
  10. if no command-line options are provided, print out the required "statistics" about the files found and terminate.
  11. if the -t command-line option is provided, perform its role and terminate.
  12. if the -l command-line option is provided, perform its role and terminate.
  13. armed with a fully tested program, and overflowing with confidence, consider of the Advanced tasks (described below).

Advanced tasks:

If you would like a much greater challenge, you may like to attempt to an advanced version of this project.

Those who undertake and complete significant parts of the "Advanced tasks" will have the opportunity to "recover" any marks lost (deducted) in the first part of the project.

There are 3 additional tasks in the advanced section:

  1. support duplicate detection in two or more directories provided on the command-line .
  2. support the presence of hard-links (but not symbolic-links) in the directories being considered. When counting the total number of bytes occupied by all files, the "contents" of all files hard-linked together are counted only once.
  3. support the -m command-line option .

The first advanced task simply permits multiple directories to be searched. For example, if four directory names are provided, then all files in all four directories should be considered.

The second additional task requires you to identify the duplicate files and to then store only one instance of each. The Unix link system call (see man 2 link) provides this facility for us, by creating hard-links between two or more files. The actual file contents will only be stored only once, and multiple filenames will refer to that single copy.

WARNING: until you are very confident that your duplicates program is working correctly, you are strongly advised NOT to use your duplicates program to use the -m option to minimize the storage of important files and directories!


Assessment:

This project is worth 25% of your final mark for CITS1210: C Programming. It will be marked out of 25. You do not need to complete the tasks detailed in the "Advanced tasks" section to receive full marks (100%) for the project.

As always, you are expected to show professionalism in your approach by making appropriate use of the features of the C99 language, following the principles of good program design, and adhering to the programming conventions outlined in this unit.

Those who undertake and complete significant parts of the "Advanced tasks" will have the opportunity to "recover" any marks lost (deducted) in the first part of the project. As a rough estimate, completing the "Advanced tasks" of the project may allow a student to recover up to 10 (out of 25) lost marks. Note that a score of >100% is not possible.

During the marking, attention will obviously be given to the correctness and readability of your solution.

A correct and efficient solution should not be considered as the perfect nor only desirable form of solution. Preference will be given to well presented, well documented work. Do not expect to receive full marks for your program simply because it works correctly.

Marks will be allocated to each function according to:

  1. correctness: a function must implement the specification exactly and completely,
  2. clarity: a function must be easy to understand, and
  3. appropriate use of the features of the C99 language.

As a rough estimate, marks will be awarded as follows: 15 (out of 25) marks to program correctness (i.e., how well your functions match the specification of the questions asked), and 10 (out of 25) marks to coding style (i.e. how well you have written the functions in terms of clarity and use of the features of the C99 language).

Part marks to questions may be awarded, so if you are not able to get a function working correctly, you should still submit what you have done. However, you will be significantly penalised if your submitted program does not compile successfully (e.g. if it contains any compilation errors).


Working together:

This project may be completed in small teams of two students; you may choose to work individually, but you may not work in a team of three. The motivation for this is to develop communication skills amongst students, and to enable you to attempt a project considered of greater difficulty than would normally be reasonable for the time available. No allowance will be made in the marking if you choose to undertake the project individually.

You are expected to have read and understood the CSSE Policy on Plagiarism. In accordance with this policy, you may discuss with other teams the general principles required to understand this project, but the work you submit must be the sole result of your team's members.


Submission:

You must submit your project's files using cssubmit. No other method of submission will be accepted. If working in a team of two, only one student needs to submit the work using cssubmit. cssubmit will give you a receipt of your submission. You should print and retain this receipt in case of any dispute.

Please be fully aware of CSSE's Penalties for late submission. Submitting the wrong file(s) may result in a mark of zero for your project. Note also that the cssubmit facility does not archive submissions and will simply overwrite any previous submission with your latest submission.

Your project will be marked on CSSE's Apple Macs in Lab 2.01. Ensure that your project is compiling and working correctly on these machines before you submit your project. No allowance will be made for projects that "work at home" but do not work on CSSE's Apple Macs in Lab 2.01.

You should submit a Makefile and all C header (.h) and code (.c) files that you wish to be considered for assessment.

  • If you are attempting the project's standard requirements, ensure that your file implementing the main() function is named duplicates.c.
  • If you are attempting the project's advanced requirements, ensure that your file implementing the main() function is named duplicates-advanced.c.

Any submitted C file must begin with the following lines:

/*
   CITS1210 2nd Project 2008
   Name(s):               your name(s)
   Student number(s):     your student number(s)
 */

Clarifications:

Please post requests for clarification about any aspect of the project to help1210 so that all students may remain equally informed. Clarifications will be added to the project clarifications webpage.

Good luck!

Submission deadline: 12noon, Friday 31st October 2008

Top of Page
CRICOS Provider Code: 00126G