Well, I felt the need to do some analysis over content hosted on some initial set of web-sites and then aggregate it, plot. I created a very simple and easy parser which would crawl and parse the data read from these websites. I used the try-catch block for fault tolerance and resilience from error (website down, unavailable or trust error, etc). Here is a sample code, where I have used tryCatch and readLines methods :
>myUrlStats <- function(urlToCrawl) {
statData <- tryCatch(
{
dataReadFromUrls <- readLines(con=urlToCrawl)
myWebCrawlParser(dataReadFromUrls)
},
error=function(errorMessageStr) {
message(paste("URL does not seem to exist:", urlToCrawl))
message("Error message:")
message(errorMessageStr)
return(-1)
},
warning=function(warningMessageStr) {
message(paste("URL caused a warning:", urlToCrawl))
message("Warning message:")
message(warningMessageStr)
# Choose a return value in case of warning
return(NULL)
},
finally={
##Clean up code
}
)
return(statData)
}
> myWebCrawlParser <- function(dataReadFromURL){
# Do your analysis parsing here
# also like you can mine other outlinks from this data read for further traversing the web-links
return(1)
}
> urlToCrawl <- c(
"http://superdevresources.com",
"http://superbloggingresources.com"
)
> finalReslt<- mapply(myUrlStats, urlToCrawl)
Happy programming!