Your objective is to implement and examine the efficiency of alternative algorithms for finding duplicates in a collection of strings.
Here is a high-level description of what you are to do:
Read in a count, indicating the number of tests to be run
For (I=1; I<=number_of_tests; I++)
Read a text file called "TestDataI.txt" from disk
(same directory as code)
Create a collection of the
words found (vector or array are equally good)
You will need to extract the words, treating punctuation as white space
white space is anything that is not alphanumeric
Let N be number of words
extracted
DO NOT look for duplicates
at this stage
You now have a collection of strings with possible
duplicates
Hash Table Test
Enter these strings in a
hash table having N/2 buckets
Each entry in a bucket is a handle to a string and its repeat count
Iterate across all entries
in the table
Print duplicated words and their repeat counts
Do not print out words that occur just once
Tree Test
Choose one of the following
"balanced" tree representations
AVL tree, splay tree, red-black tree
Enter these strings in the
tree.
Each node is a handle to a string and its repeat count
Walk the tree using an inorder
traversal
Print duplicated words and their repeat counts
Produce a printed summary of the cost of the alternative
methods
This needs to specify the techniques
compared (Hash vs ??)
This needs to specify how many
words (N) were found
The measures can be
Actual timing analyseis or
Counts of calls and iterations
The choice of what you measure
is in your hands
Printed output should be redirected to a file named
"TestOutI.txt"
end For
Turn in a disk with a directory whose name is the same as yours
Use the convention <Last Name> <Initial of
First Name>
So my folder would be named
"HughesC"
The contents of the folder will be
The source file(s) for your
program
Executable files, as appropriate
to programming language
Test Files named TestData1.txt,
TestData2.txt, ...
Include at least 3 files
One of these must have no duplicate words
One of these must have some words with at least 3 repeats
One of these must contain at least 100 distinct words
Hint: Just use some paper you wrote for a class
Be sure to save it as a text file
Text files showing your
output for each test data file
These are named TestOut1.txt, TestOut2.txt, ...
A report entitled Analysis.doc
that describes your observations
This must address the observed and theoretical performance of techniques
Due: By 4:00 PM on Monday, November 1, 1999
Lateness: 5% per day
Cutoff: No assignment will be accepted after Friday, November 5 at
4:00 PM
I/O code
Echoing a file in Java
import java.io.*;Echoing a file in C++
import java.util.*;
...
String fileName = "test1.txt";
StreamTokenizer tokens;
try {
tokens = new StreamTokenizer (new FileInputStream(fileName));
while (tokens.nextToken()!=StreamTokenizer.TT_EOF)
System.out.println(tokens.sval);
} catch (IOException e) {
System.err.println("Could not process file " + fileName);
System.exit(1);
}
#include <fstream>
#include <string>
#include <iostream>
using namespace std;
...
ifstream infile;
string s;
string fileName = "test1.txt";
infile.open(fileName);
if (infile.fail()) {
cout << "file error" << endl;
exit(1);
}
while (!infile.eof()) {
infile >> s;
cout << s << endl;
}