Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restyle Modification of Ralger package to suit rvest 1.0.2 #14

Merged
merged 5 commits into from
Jun 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@ Title: Easy Web Scraping
Version: 2.2.4
Authors@R: c(
person("Mohamed El Fodil", "Ihaddaden", email = "[email protected]", role = c("aut", "cre")),
person("Ezekiel", "Ogundepo", role = c("ctb")),
person("Romain", "François", email = "[email protected]", role = c("ctb")))
person("Ezekiel", "Ogundepo", role = c("ctb")),
person("Romain", "François", email = "[email protected]", role = c("ctb")))
Maintainer: Mohamed El Fodil Ihaddaden <[email protected]>
Description: The goal of 'ralger' is to facilitate web scraping in R.
Description: The goal of 'ralger' is to facilitate web scraping in R.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
URL: https://github.com/feddelegrand7/ralger
BugReports: https://github.com/feddelegrand7/ralger/issues
VignetteBuilder: knitr
Imports:
Imports:
rvest,
xml2,
tidyr,
Expand All @@ -24,9 +24,9 @@ Imports:
crayon,
curl,
stringi
Suggests:
Suggests:
knitr,
testthat,
rmarkdown,
covr
RoxygenNote: 7.1.1
RoxygenNote: 7.2.0
75 changes: 36 additions & 39 deletions R/table_scrap.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
#' @param choose an integer indicating which table to scrape
#' @param header do you want the first line to be the leader (default to TRUE)
#' @param askRobot logical. Should the function ask the robots.txt if we're allowed or not to scrape the web page ? Default is FALSE.
#' @param fill logical. Should be set to TRUE when the table has an inconsistent number of columns.
#' @return a data frame object.
#' @examples \donttest{
#' # Extracting premier ligue 2019/2020 top scorers
Expand All @@ -30,60 +29,58 @@
table_scrap <- function(link,
choose = 1,
header = TRUE,
fill = FALSE,
askRobot = FALSE) {


if(missing(link)) {
stop("'link' is a mandatory parameter")
}
if(missing(link)) {
stop("'link' is a mandatory parameter")
}


if(!is.character(link)) {
stop("'link' parameter must be provided as a character string")
}
if(!is.character(link)) {
stop("'link' parameter must be provided as a character string")
}


if(!is.numeric(choose)){
stop(paste0("the 'choose' parameter must be provided as numeric not as "),
typeof(choose))
}
if(!is.numeric(choose)){
stop(paste0("the 'choose' parameter must be provided as numeric not as "),
typeof(choose))
}


############################## Ask robot part ###################################################
############################## Ask robot part ###################################################

if (askRobot) {
if (paths_allowed(link) == TRUE) {
message(green("the robot.txt doesn't prohibit scraping this web page"))
if (askRobot) {
if (paths_allowed(link) == TRUE) {
message(green("the robot.txt doesn't prohibit scraping this web page"))

} else {
message(bgRed(
"WARNING: the robot.txt doesn't allow scraping this web page"
))

}
} else {
message(bgRed(
"WARNING: the robot.txt doesn't allow scraping this web page"
))

}

#################################################################################################
}

#################################################################################################

tryCatch(
tryCatch(

expr = {
expr = {

table <- link %>%
read_html() %>%
html_table(header, fill = fill)
table <- link %>%
read_html() %>%
html_table(header)

chosen_table <- table[[choose]]
chosen_table <- table[[choose]]

return(chosen_table)
return(chosen_table)


},
},

error = function(cond){
error = function(cond){

if(!has_internet()){

Expand All @@ -93,18 +90,18 @@ error = function(cond){

} else if (grepl("current working directory", cond) || grepl("HTTP error 404", cond)) {

message(paste0("The URL doesn't seem to be a valid one: ", link))
message(paste0("The URL doesn't seem to be a valid one: ", link))

message(paste0("Here the original error message: ", cond))
message(paste0("Here the original error message: ", cond))

return(NA)
return(NA)


} else if(grepl("subscript out of bounds", cond)) {

message(
"Are you sure that your web page contains more than one HTML table ?"
)
"Are you sure that your web page contains more than one HTML table ?"
)

message(paste0("Here the original error message: ", cond))

Expand All @@ -117,6 +114,6 @@ error = function(cond){
return(NA)

}
}
}

)}
)}
34 changes: 17 additions & 17 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,20 +51,21 @@ devtools::install_github("feddelegrand7/ralger")
```
## `scrap()`

This is an example which shows how to extract [top ranked universities' names](http://www.shanghairanking.com/ARWU2020.html) according to the ShanghaiRanking Consultancy:
This is an example which shows how to extract [top ranked universities' names](http://www.shanghairanking.com/rankings/arwu/2021) according to the ShanghaiRanking Consultancy:


```{r example}
library(ralger)

my_link <- "http://www.shanghairanking.com/ARWU2020.html"
my_link <- "http://www.shanghairanking.com/rankings/arwu/2021"

my_node <- "#UniversityRanking a" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors
my_node <- "a span" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors

best_uni <- scrap(link = my_link, node = my_node)
clean <- TRUE # Should the function clean the extracted vector or not ? Default is FALSE

head(best_uni, 10)
best_uni <- scrap(link = my_link, node = my_node, clean = clean)

head(best_uni, 10)

```

Expand All @@ -88,27 +89,27 @@ head(scrap(links, node), 10) # printing the first 10 speakers

## `attribute_scrap()`

If you need to scrape some elements' attributes, you can use the `attribute_scrap()` function as in the following example:
If you need to scrape some elements' attributes, you can use the `attribute_scrap()` function as in the following example:


```{r}
# Getting all classes' names from the anchor elements
# from the ropensci website
# from the ropensci website

attributes <- attribute_scrap(link = "https://ropensci.org/",
attributes <- attribute_scrap(link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
)

head(attributes, 10) # NA values are a tags without a class attribute
```

Another example, let's we want to get all javascript dependencies within the same web page:
Another example, let's we want to get all javascript dependencies within the same web page:

```{r}

js_depend <- attribute_scrap(link = "https://ropensci.org/",
node = "script",
js_depend <- attribute_scrap(link = "https://ropensci.org/",
node = "script",
attr = "src")

js_depend
Expand Down Expand Up @@ -282,21 +283,21 @@ images_scrap(link = "https://rstudio.com/",
```


# Accessibility related functions
# Accessibility related functions


## `images_noalt_scrap()`
## `images_noalt_scrap()`


`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:
`images_noalt_scrap()` can be used to get the images within a specific web page that don't have an `alt` attribute which can be annoying for people using a screen reader:


```{r}

images_noalt_scrap(link = "https://www.r-consortium.org/")

```
If no images without `alt` attributes are found, the function returns `NULL` and displays an indication message:
If no images without `alt` attributes are found, the function returns `NULL` and displays an indication message:


```{r}
Expand All @@ -310,4 +311,3 @@ images_noalt_scrap(link = "https://webaim.org/techniques/forms/controls")
## Code of Conduct

Please note that the ralger project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.

Loading