0

Nutch…too much Nutch

-

Yesterday the whole day was spent in trying to go through the Nutch source code. Chris and Ashish helped me out alongwith this link
Dissecting the Nutch Crawler
. This showed me that :
The file Fetcher.java has a reference to the “content” variable (which is of type Content). I found that initially only the URLs are stored during the crawl, then a request is sent. Then based on the MIME type of the content returned, the ParserFactory class creates a parser (html parser, pdf parser etc.). The code for these parsers can be found at nutch-0.6/src/plugin/. These plugins do the parsing and get the content as a “Parse” object. Using the Parse.getText() method (which we also felt was interesting) we can get the text content of any page!!!!!

0

Nutching Nutching Nutching

-

The whole day today was spent in analyzing Nutch Source Code with Anshul. It is almost 8:00 pm now and nothing has been done yet! Have received an e-mail from Chris Mattman and Ashish Vaidya giving some pointers. Hopefully, it’s gonna help!
Had problems while compiling the code as well. It’s strange that when I installed j2sdk from Java Sun site I did not get javac in the /usr/java/jre1.5.0_02/bin directory. So since I did not have enough time to look for the files, I simply downloaded Netbeans and got the thing compiled with Daddu’s suggestion!
Will blog later when I can get some success with Nutch.
-Rajat.
Rajat’s Abode