Fetching HTML content of a Web Page

5 November 2007

Sometimes you are required to fetch and store data from web pages. If there are too many pages to parse, then obviously this cannot be done manually. Java provides support for web text extraction.

The approach is simple. You have to fetch all the HTML contents of a webpage and then you can write your own parser to extract the required info. For example: you might be asked to only store the text in table data tag with caption Hobbies. So you will store all the HTML contents of web page in your buffer and then will parse it for Hobbies:

Now lets see how to get HTML contents from a web page. Java.net package provides useful classes that will serve our purpose. We will need following two classes:
- URLConnection
- URL

First create a URL object and specify the address of the page for which you want to get the HTML contents. Then use openConnection method of URLConnection class with URL object to get URLConnection object. Now this URLConnection object can be used to create DataInputStream. Finally, we will create BufferedReader object using DataInputStream object and will fetch the contents line by line using readLine method of DataInputStream object.

URL url = new URL("http://www.java-forums.org/faq.php");
URLConnection conn = url.openConnection();
DataInputStream in = new DataInputStream ( conn.getInputStream (  )  ) ;
BufferedReader d = new BufferedReader(new InputStreamReader(in));
while(d.ready())
{
System.out.println( d.readLine());
}

Output will be HTML code of faq.php. Even the webpage is a PHP page, but our application cannot access the server for PHP (Server side code). We can only request and get HTML code.

del.icio.us:Fetching HTML content of a Web Page  digg:Fetching HTML content of a Web Page  spurl:Fetching HTML content of a Web Page  wists:Fetching HTML content of a Web Page  simpy:Fetching HTML content of a Web Page  newsvine:Fetching HTML content of a Web Page  blinklist:Fetching HTML content of a Web Page  furl:Fetching HTML content of a Web Page  reddit:Fetching HTML content of a Web Page  fark:Fetching HTML content of a Web Page  blogmarks:Fetching HTML content of a Web Page  Y!:Fetching HTML content of a Web Page  smarking:Fetching HTML content of a Web Page  magnolia:Fetching HTML content of a Web Page  segnalo:Fetching HTML content of a Web Page  gifttagging:Fetching HTML content of a Web Page

Top Of Page | Trackback

If you found this page useful, consider linking to it. Simply copy and paste the code below into your web site.

It will look like this: Fetching HTML content of a Web Page

Leave a Reply