Friday, April 17, 2015

R : Webcrawler Parser with Try-Catch

Well, I felt the need to do some analysis over content hosted on some initial set of web-sites and then aggregate it, plot. I created a very simple and easy parser which would crawl and parse the data read from these websites. I used the try-catch block for fault tolerance and resilience from error (website down, unavailable or trust error, etc). Here is a sample code, where I have used tryCatch and readLines methods :

>myUrlStats <- function(urlToCrawl) {
    statData <- tryCatch(
   dataReadFromUrls <- readLines(con=urlToCrawl)
        error=function(errorMessageStr) {
            message(paste("URL does not seem to exist:", urlToCrawl))
            message("Error message:")
        warning=function(warningMessageStr) {
            message(paste("URL caused a warning:", urlToCrawl))
            message("Warning message:")
            # Choose a return value in case of warning
   ##Clean up code

> myWebCrawlParser <- function(dataReadFromURL){
# Do your analysis parsing here
# also like you can mine other outlinks from this data read for further traversing the web-links

> urlToCrawl <- c(
> finalReslt<- mapply(myUrlStats, urlToCrawl)
Happy programming!

No comments:

[Windows] Refreshing Environment variable for windows command prompt or Powershell

Sometimes, after installing a package like chocolatey ,or npm packages, it asks you to restart the cmd.exe or powershell window to reload t...