Friday, April 17, 2015

R : Webcrawler Parser with Try-Catch

Well, I felt the need to do some analysis over content hosted on some initial set of web-sites and then aggregate it, plot. I created a very simple and easy parser which would crawl and parse the data read from these websites. I used the try-catch block for fault tolerance and resilience from error (website down, unavailable or trust error, etc). Here is a sample code, where I have used tryCatch and readLines methods :

  
>myUrlStats <- function(urlToCrawl) {
    statData <- tryCatch(
               {
   dataReadFromUrls <- readLines(con=urlToCrawl)
   myWebCrawlParser(dataReadFromUrls)
  },
        error=function(errorMessageStr) {
            message(paste("URL does not seem to exist:", urlToCrawl))
            message("Error message:")
            message(errorMessageStr)
            return(-1)
        },
        warning=function(warningMessageStr) {
            message(paste("URL caused a warning:", urlToCrawl))
            message("Warning message:")
            message(warningMessageStr)
            # Choose a return value in case of warning
            return(NULL)
        },
        finally={
   ##Clean up code
  }
 )
    return(statData)
}


> myWebCrawlParser <- function(dataReadFromURL){
# Do your analysis parsing here
# also like you can mine other outlinks from this data read for further traversing the web-links
return(1)
}


> urlToCrawl <- c(
  "http://superdevresources.com",
     "http://superbloggingresources.com"
     )
> finalReslt<- mapply(myUrlStats, urlToCrawl)
Happy programming!

No comments:

[Windows] Refreshing Environment variable for windows command prompt or Powershell

Sometimes, after installing a package like chocolatey ,or npm packages, it asks you to restart the cmd.exe or powershell window to reload t...