Get HTML From A URL (Part 2)

Last week we posted a tip on getting the HTML out of a web page. The tip had a "limitation" of the size of a string (2GB in ND6). This tip updates to remove that limitation.

To remove the limitation, all the HTML is put into a Java Vector. Each element of the vector will be a string (with its own 2GB limit), but the overall size of the vector is only limited by the amount of memory on the computer running the agent. Vectors don't have a direct correlation in LotusScript, so additional methods need to be created to pull stuff out of the elements in the vector. The updated Java here still goes into a Java library, just like the previous tip:

import java.io.*;
import java.net.*;
import java.util.*;

public class GetHTML {

   private Vector result = new Vector(); // An array of each line of HTML

   public void readHTML(String urlToRead) {
      URL url; // The URL to read
      HttpURLConnection conn; // The actual connection to the web page
      BufferedReader rd; // Used to read results from the web page
      String line; // An individual line of the web page HTML
      try {
         url = new URL(urlToRead);
         conn = (HttpURLConnection) url.openConnection();
         conn.setRequestMethod("GET");
         rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
         while ((line = rd.readLine()) != null) {
            result.addElement(line);
         }
         rd.close();
      } catch(Exception e) {
         e.printStackTrace();
      }
   }

   public int numLines() {
      return result.size();
   }

   public String getHTML(int pos) {
      if (pos < result.size()) {
         return (String)result.elementAt(pos);
      } else {
         return "";
      }
   }
}

There are three methods in this class. The first is called "readHTML" and simply reads the web page, storing the results in a vector, internal to the Java class (the variable is defined as private). It does not return a value. The second method is called "numLines" and returns the total number of lines in the vector. The third is called "getHTML" and returns the string value at the desired position in the vector.

The LotusScript changes a little bit. Now we read the html in one step, and then have to do extra to get the individual lines of HTML. Before, all the HTML was returned in one statement. Here's some updated LotusScript to access the Java class and read the web page. Refer to last week's tip to get more details on calling Java within LotusScript.

Const myURL = "http://www.breakingpar.com"
Dim js As JAVASESSION
Dim getHTMLClass As JAVACLASS
Dim getHTMLObject As JavaObject
Dim size As Integer
Dim i As Integer
Dim html As String

Set js = New JAVASESSION
Set getHTMLClass = js.GetClass("GetHTML")
Set getHTMLObject = getHTMLClass.CreateObject
Call getHTMLObject.readHTML(myURL)
size = getHTMLObject.numLines() ' Get the total vector size (elements 0 to size-1)
For i = 0 To size-1
   html = getHTMLObject.getHTML(i)
   Print html
Next

This LotusScript simply prints out the HTML one line at a time. But you could do whatever you want with the HTML (scan it for certain tags, for example). If you run into any limits with this script, then chances are you need more memory on your machine. However, this example should give you an idea of how you could work with Vectors in Java compared to arrays in LotusScript. Vectors are a little more powerful, so you could create a Java library that just acts as an interface to a Vector, then use that in LotusScript.

Breaking Par Consulting

exceeding expectations