Your objective is to implement and examine the efficiency of alternative algorithms for finding duplicates in a collection of strings.
Here is a high-level description of what you are to do:
Read in a count, indicating the number of tests  to be run
For (I=1; I<=number_of_tests; I++)
    Read a text file called "TestDataI.txt" from disk
(same directory as code)
        Create a collection of the
words found (vector or array are equally good)
           
You will need to extract the words, treating punctuation as white space
               
white space is anything that is not alphanumeric
        Let N be number of words
extracted
        DO NOT look for duplicates
at this stage
    You now have a collection of strings with possible
duplicates
    Hash Table Test
        Enter these strings in a
hash table having N/2 buckets
           
Each entry in a bucket is a handle to a string and its repeat count
        Iterate across all entries
in the table
           
Print duplicated words and their repeat counts
               
Do not print out words that occur just once
    Tree Test
        Choose one of the following
"balanced" tree representations
           
AVL tree, splay tree, red-black tree
        Enter these strings in the
tree.
           
Each node is a handle to a string and its repeat count
        Walk the tree using an inorder
traversal
           
Print duplicated words and their repeat counts
    Produce a printed summary of the cost of the alternative
methods
       This needs to specify the techniques
compared (Hash vs ??)
       This needs to specify how many
words (N) were found
        The measures can be
           
Actual timing analyseis or
           
Counts of calls and iterations
        The choice of what you measure
is in your hands
    Printed output should be redirected to a file named
"TestOutI.txt"
end For
Turn in a disk with a directory whose name is the same as yours
    Use the convention <Last Name> <Initial of
First Name>
        So my folder would be named
"HughesC"
    The contents of the folder will be
        The source file(s) for your
program
        Executable files, as appropriate
to programming language
        Test Files named TestData1.txt,
TestData2.txt, ...
           
Include at least 3 files
               
One of these must have no duplicate words
               
One of these must have some words with at least 3 repeats
               
One of these must contain at least 100 distinct words
                   
Hint: Just use some paper you wrote for a class
                       
Be sure to save it as a text file
        Text files showing your
output for each test data file
           
These are named TestOut1.txt, TestOut2.txt, ...
        A report entitled Analysis.doc
that describes your observations
           
This must address the observed and theoretical performance of techniques
Due: By 4:00 PM on Monday, November 1, 1999
Lateness: 5% per day
Cutoff: No assignment will be accepted after Friday, November 5 at
4:00 PM
I/O code
Echoing a file in Java
import java.io.*;Echoing a file in C++
import java.util.*;
...
String fileName = "test1.txt";
StreamTokenizer tokens;
try {
tokens = new StreamTokenizer (new FileInputStream(fileName));
while (tokens.nextToken()!=StreamTokenizer.TT_EOF)
System.out.println(tokens.sval);
} catch (IOException e) {
System.err.println("Could not process file " + fileName);
System.exit(1);
}
#include <fstream>
#include <string>
#include <iostream>
using namespace std;
...
ifstream infile;
string s;
string fileName = "test1.txt";
infile.open(fileName);
if (infile.fail()) {
cout << "file error" << endl;
exit(1);
}
while (!infile.eof()) {
infile >> s;
cout << s << endl;
}