Assignment 5

Due Monday April 14, 2008 6:00pm on-line

Supplemental pages

Site Scraping

Many web sites contain a plethora of useful information. However, separating the useful content from ads, graphics, layout tags, and other peripheral information can be challenging. Humans are quite adept at looking at a page and filtering items of interest as well as identifying appropriate relationships; computers are not.

A challenge with accessing web sites arises when we wish to extract information of interest from these sites in a programmatic way - through programs rather than through our eyes. We yearn for easy-to-parse, information-only representations. Instead, we are confronted with javascript, multiply-nested tables, font and color tags, and a dissociation between attibute names and their values (or even an absence of attribute names).

One of the more popular messages preached by XML zealots is that you will be able to run a program that will connect to Barnes & Noble, Amazon, Borders, and other bookstores and tell you what books are in stock and who has the best price. Such a program will be easy to write because all bookstores will store their data in a simple-to-parse XML format, containing only tagged information. While this has happened, and is happening, in many areas, a major obstacle is that it is not necessarily in the interest of bookstores (or other vendors) to offer this sort of service. A program-friendly interface does not give a vendor the opportunity to sell you on the product, offer ads, or sell you on using their service over others. As such, we are stuck with these difficult-to-parse web sites.

Because there is an interest to access the useful information from certain web sites in a programmatic way, we saw the emergence of an industry that provides software for site scraping. Site scraping is a term for connecting to a web site and parsing out needed content and covers a number of application areas. For instance, transcoding systems focus on converting formats for different media (e.g. render a graphics-rich web page on a PDA); some systems concentrate on grabbing headlines from popular sites (e.g. the Linux asDrinks toolset). Another form of site scraping deals with content aggregation. For example, companies such as yodlee.com aggregate financial records via one interface. The site scraping software will connect to the various web sites for a users' banking and brokerage accounts (which will usually involve executing a script to log in and go through several web pages) and parse out the relevant information to provide the user with a consolidated view.

Your assignment

Your assignment is to implement a site scraping service that grabs weather information on request from accuweather.com. The interface to this service is via Java RMI, so that programmers who would like to provide users with weather information can obtain the necessary data via a remote procedure call and not have to deal with parsing HTTP data.

Your site scraping service functions as both a client and a server: it is a service that accepts requests for weather via Java RMI. It then acts as a client and converts the requested ZIP code to an HTTP request from the accuweather.com server. The response is then parsed and the appropriate components are returned back to the requestor as the return from Java RMI. You will also write a client to demonstrate and test this interface.

The following figure illustrates the process:

Details

You will be writing two programs for this assignment:

client

The client is designed to test your weather server. It accepts a simple command-line interface where the user can enter one or more zip codes on the command line:
java WeatherClient [-h rmiregistryserver] [-p port] zipcode [zipcode ...]
where
  • rmiregistryserver is the name of the server where you are running the rmiregistry that the server used for registering itself (typically the same system as the server). The -h rmiregistryserver is optional. If it is omitted, your program should use localhost as a default.
  • port is the number of the port on which the rmiregistry is listening. If -p port is omitted, your program will use the default rmiregistry port, 1099.
  • zipcode is a five-digit zip code (for example, 07974).
The client looks up the RMI service named Weather and iterates over each supplied zip code calling an RMI procedure with the zip code as a parameter. This RMI procedure returns some details about the weather. This information is printed on the standard output. For example:
$ java WeatherClient 07974 23451 99999 
zipcode: 07974
town: New Providence, NJ
temp: 63
weather: Mostly Sunny
humidity: 23
Wind direction: SW
Wind speed: 10

zipcode: 23451
town: Virginia Beach, VA
temp: 59
weather: Sunny
humidity: 33
Wind direction: E
Wind speed: 10

zipcode: 99999
town: NA
temp: 0
weather: NA
humidity: 0
Wind direction: NA
Wind speed: 0

scraping server

Your scraping server, named WeatherServer accepts an optional argument identifying the port number on which the rmiregistry is listening: java WeatherServer [port] The rmiregistry process runs on the same server. The main program's (WeatherServer.java) only functions functions are to parse arguments, install a security manager, and register the Weather service with the RMI registry.

The Weather service implements a single function - one that accepts a single ticker symbol as input and returns a data structure containing:

  • zip code (String)
  • place name (String)
  • temperature (int)
  • weather conditions (String)
  • humidity (int)
  • wind direction (String)
  • wind speed (int)

The RMI interface for the weather server is the file WeatherInterface.java, which contains:

import java.rmi.*;
import java.io.Serializable;
/* import weatherinfo; */ // only for versions < 1.4 

public interface WeatherInterface extends Remote {
	public weatherinfo get(String zip) throws RemoteException;
}
The weatherinfo data structure is defined in weatherinfo.java and contains the following:
import java.io.*;
import java.io.Serializable;

public class weatherinfo implements Serializable {
	public String zip;
	public String place;
	public int temp;
	public String weather;
	public int humidity;
	public String wind_dir;
	public int wind_speed;
}

For the Weather service to do its work, it establishes a TCP connection to the server at wwwa.accuweather.com and sends an HTTP request with an encoded query for the symbol of interest. The response is read in, data of interest is extracted, and returned back to the requestor (via RMI).

The page you'll access, will be the one containing more detailed information about current conditions, for example: http://wwwa.accuweather.com/forecast-current-conditions.asp?zipcode=23451.

N.B.: You should not modify the data structure in WeatherInterface.java or weatherinfo.java.

Implementation Advice

  1. Before starting this assignment, make sure that you can understand, compile, and run programs that use Java RMI. Read the recitation notes for programming Java RMI (see the RPC lecture). Then download the sample program, (rmi-example.zip). (rmi-example.tar). Extract it with:
    unzip rmi-example.tar
    or:
    tar xvf rmi-example.tar
    This will create a directory named rmi-example with a bunch of files in it, including a README file.

  2. Using the Sample program as a guide, write your client and a dummy server that just returns hard-coded data using the weatherinfo data structure.

  3. Before going further, it's time to understand the interface to www.accuweather.com's web site:
    1. Go to www.accuweather.com.
    2. Type in a zip code in the Local Weather Forecast entry on the top-left.
    3. Note the URL that you were directed to. This will be the URL that you want to use, modifying the zip code in the string as needed. Note that you can omit the parter, traveler, and metric arguments in the URL.
    4. View the source. Search for the location of the town name, current temperature, and the detailed information in the table (humidity, wind speed, wind direction). For example, you will find that the current temperature is represented something like this:

      id="quicklook_current_temps">72&deg;F</div>

      In your program, you may search the returned text for the string Temperature:</div> and picking up the 72 to get the current local temperature.

  4. Next, write a stand-alone program that connects to a URL (with a specific zip code embedded in it) and retrieves the entire contents of the page (e.g. writes the page to the standard output). Save the output of a page and find where the information you need is stored and how it looks in the HTML output. A useful reference for help with using Java's URL class is How can I fetch files using HTTP? in the Java Network Programming FAQ.
  5. You will now have to figure out how to parse the returned HTML data and extract the fields you need. Do not assume that the data always starts at a specific byte offset! Instead, think about looking for specific tags in the file and use logic such as: "the datum I want is located after the first <b> following the fifth < and ends at the next <b>." In many cases, you can narrow it down to something like "the next occurence of "left"> after Humidity. You can then use Java's indexOf function to look for such tokens iteratively.
    N.B.: Do not read the HTTP data into a Java String - that's woefully inefficient since Java strings are immutable. Define a StringBuffer and read into that. When you're done reading, then convert the StringBuffer to a String.
  6. Once your parsing works on the stand-alone program, you can integrate the code from steps 4-5 into the RMI server. Name the client file WeatherClient.java. Name the server file WeatherServer.java. Name the file that implements the weather service Weather.java.

  7. Test your code. Make sure that you can handle non-existant ticker symbols gracefully. Make sure that you can handle blank or erroneous fields. Make sure that the -h and -p flags are optional and may be present (or not) in any order before the ticker symbols at the client. Make sure you exit gracefully if the client cannot find the service.

  8. Clean up your code! Make sure that you remove all commented-out code. Make sure that your code is indented cleanly and consistently and that your comments are readable. Make sure your variable names are sensible and not too long. Make sure the names are appropriate. For example, just because the sample Hello server has a constructor that accepts a string does not mean that you need one.

Submission

Make sure you name your client WeatherClient and your server WeatherServer so I won't have to figure out what's what. Name the program that RMI calls Weather.java so that I can run rmic Weather to generate the stubs.

Make sure all the files are in one directory and that you don't submit any extra files (e.g. server.java~). Do not include any class files.

Before submitting, create a file named id that contains your full name on the first line and your id number on the second line.

To submit the assignment, you will create a tar archive of all the files that are needed to compile the program and the id file. For example,

tar cvf a5.tar id WeatherClient.java Weather.java WeatherServer.java

Before submitting the file to me, make sure that all the components are there (minus the supplied files). If I can't compile it, you will lose virtually all credit:

mkdir test
cd test
tar xvf a5.tar
# copy in policy, weatherinfo.java, WeatherInferface.java
javac *.java
rmic Weather
Also, make sure that no stray files are present (e.g., .class files or emacs temporary files). You will lose points for sloppy submissions.

Hand the assignment in using the web-based handin procedure. Go to https://handin.rutgers.edu/. If you never used web-based handin, check out the instructions at https://handin.rutgers.edu/handin/manual/stud/stud.html . The project name for this assignment is Assignment 5.

Pre-submission notes

Important: please check my notes on submitting assignments before you hand in your assignment.