DUE: Monday, March 1, 8:00 PM
Assignment 3 will give you a better understanding of the Java-DBMS interaction via JDBC. You will have to extend the web crawler discussed in class to store the pages found in a MySQL database.
The web crawler works in the following way:
a. Receives an initial URL link from the command line
b. Fetches the page associated with the link(s)
c. Retrieves all URL links from the web page
d. Goes to (b) for every link. It repeats this process up to
a predetermined depth.
1. Study the webcrawler code and fix any potential problem(s)
2.Create a batch file for MySQL to:
a. Create database cop4610se and user webs with
password abc123. Grant permissions so this user be able to
select/insert/update the database just from localhost.
b. Create the following three tables:
i. protocols with fields: protocolID and protocolName
ii. servers with fields: serverID and serverAddr
iii. pages with fields protocolID, serverID, pageName, pageContent, and
timestamp
Select the most appropriate field type for each case and define a PRIMARY KEY for each table. For example, protocolID could be of type Integer with an AUTO_INCREMENT modifier. See below for a description of how each field is to be used.
A batch file is a plain text file containing a sequence of SQL commands. Example: see books.sql. MySQL can execute a batch file with the following command from the command prompt:mysql -u root -p < books.sql
3. Write java/jdbc/sql code to store each web page found by the crawler in the database:
a. Divide each URL into its three components (protocol, host, and
path/page name). Store each component in
protocols.protocolName,
server.serverAddr, and
pages.pageName
b. Assign and add a new identification number to new protocols
or servers found by the crawler. Store this information in either
protocols.protocolID or protocols.serverID.
The protocol and server identification numbers indicate the corresponding
protocol and server that is associated with a stored page, which use
the fields pages.protocol and pages.serverID for this
purpose.
c. Field pages.pageContent stores a web page.
You will have to remove all single quote characters from the text
before storing the page in the database because quotes are a reserve
symbol in MySQL.
d. Update the DB with every html or text page fetched by the crawler,
even if an entry for that page already exists. Each entry in the pages
table needs to be tagged with a timestamp, which indicates the last
fetching time.
4. (Extra credit, 10pt) Parse web pages to remove HTML tagging so as to store only plain text into pages.pageContent
JDK 1.4.x, an IDE
MySQL
JDBC driver for MySQL (copy mysql-connector-java-3.0.8-stable-bin.jar to your %JAVA_HOME%/jre/lib/ext directory)
1. Insert appropriate comments to facilitate the grading of your work. You may attach an extra text file explaining your program.
2. Pack (zip, rar, tar) your source files and submit your assignment before the due time/date via WebCT