Friday, April 17, 2015

R : Webcrawler Parser with Try-Catch

Well, I felt the need to do some analysis over content hosted on some initial set of web-sites and then aggregate it, plot. I created a very simple and easy parser which would crawl and parse the data read from these websites. I used the try-catch block for fault tolerance and resilience from error (website down, unavailable or trust error, etc). Here is a sample code, where I have used tryCatch and readLines methods :

  
>myUrlStats <- function(urlToCrawl) {
    statData <- tryCatch(
               {
   dataReadFromUrls <- readLines(con=urlToCrawl)
   myWebCrawlParser(dataReadFromUrls)
  },
        error=function(errorMessageStr) {
            message(paste("URL does not seem to exist:", urlToCrawl))
            message("Error message:")
            message(errorMessageStr)
            return(-1)
        },
        warning=function(warningMessageStr) {
            message(paste("URL caused a warning:", urlToCrawl))
            message("Warning message:")
            message(warningMessageStr)
            # Choose a return value in case of warning
            return(NULL)
        },
        finally={
   ##Clean up code
  }
 )
    return(statData)
}


> myWebCrawlParser <- function(dataReadFromURL){
# Do your analysis parsing here
# also like you can mine other outlinks from this data read for further traversing the web-links
return(1)
}


> urlToCrawl <- c(
  "http://superdevresources.com",
     "http://superbloggingresources.com"
     )
> finalReslt<- mapply(myUrlStats, urlToCrawl)
Happy programming!

No comments:

Troubleshooting Packet Drops in SolarFlare Onload 10G PCI Card

If you see lots of packet drops in your onload accelerated application even after going the troubleshooting discussion we did over here ,...