COP 4610L (Spring 2004)
Operating Systems Lab (Distributed Applications in the Enterprise)
Dr. R. Lent

Programming Assignment 3

DUE: Monday, March 1, 8:00 PM

Assignment 3 will give you a better understanding of the Java-DBMS interaction via JDBC. You will have to extend the web crawler discussed in class to store the pages found in a MySQL database.

The web crawler works in the following way:

a. Receives an initial URL link from the command line
b. Fetches the page associated with the link(s)
c. Retrieves all URL links from the web page
d. Goes to (b) for every link. It repeats this process up to a predetermined depth.

Steps to complete:

1. Study the webcrawler code and fix any potential problem(s)

2.Create a batch file for MySQL to:

a. Create database cop4610se and user webs with password abc123. Grant permissions so this user be able to select/insert/update the database just from localhost.
b. Create the following three tables:
i. protocols with fields: protocolID and protocolName
ii. servers with fields: serverID and serverAddr
iii. pages with fields protocolID, serverID, pageName, pageContent, and timestamp

Select the most appropriate field type for each case and define a PRIMARY KEY for each table. For example, protocolID could be of type Integer with an AUTO_INCREMENT modifier. See below for a description of how each field is to be used.

A batch file is a plain text file containing a sequence of SQL commands. Example: see books.sql. MySQL can execute a batch file with the following command from the command prompt:mysql -u root -p < books.sql

3. Write java/jdbc/sql code to store each web page found by the crawler in the database:

a. Divide each URL into its three components (protocol, host, and path/page name). Store each component in protocols.protocolName, server.serverAddr, and pages.pageName
b. Assign and add a new identification number to new protocols or servers found by the crawler. Store this information in either protocols.protocolID or protocols.serverID. The protocol and server identification numbers indicate the corresponding protocol and server that is associated with a stored page, which use the fields pages.protocol and pages.serverID for this purpose.
c. Field pages.pageContent stores a web page. You will have to remove all single quote characters from the text before storing the page in the database because quotes are a reserve symbol in MySQL.
d. Update the DB with every html or text page fetched by the crawler, even if an entry for that page already exists. Each entry in the pages table needs to be tagged with a timestamp, which indicates the last fetching time.

4. (Extra credit, 10pt) Parse web pages to remove HTML tagging so as to store only plain text into pages.pageContent

Software

JDK 1.4.x, an IDE

MySQL

JDBC driver for MySQL (copy mysql-connector-java-3.0.8-stable-bin.jar to your %JAVA_HOME%/jre/lib/ext directory)

Instructions and comments

1. Insert appropriate comments to facilitate the grading of your work. You may attach an extra text file explaining your program.

2. Pack (zip, rar, tar) your source files and submit your assignment before the due time/date via WebCT