diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/.nojekyll @@ -0,0 +1 @@ + diff --git a/404.html b/404.html new file mode 100644 index 0000000..bd363e1 --- /dev/null +++ b/404.html @@ -0,0 +1,81 @@ + + +
+ + + + +.github/CODE_OF_CONDUCT.md
+ We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, caste, color, religion, or sexual identity and orientation.
+We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
+Examples of behavior that contributes to a positive environment for our community include:
+Examples of unacceptable behavior include:
+Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
+Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
+This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at mrowlan1@gmail.com. All complaints will be reviewed and investigated promptly and fairly.
+All community leaders are obligated to respect the privacy and security of the reporter of any incident.
+Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
+Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
+Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
+Community Impact: A violation through a single incident or series of actions.
+Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
+Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
+Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
+Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
+Consequence: A permanent ban from any sort of public interaction within the community.
+This Code of Conduct is adapted from the Contributor Covenant, version 2.1, available at https://www.contributor-covenant.org/version/2/1/code_of_conduct.html.
+Community Impact Guidelines were inspired by [Mozilla’s code of conduct enforcement ladder][https://github.com/mozilla/inclusion].
+For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
+vignettes/intro.Rmd
+ intro.Rmd
The gutenbergr package helps you download and process public domain +works from the Project Gutenberg +collection. This includes both tools for downloading books (and +stripping header/footer information), and a complete dataset of Project +Gutenberg metadata that can be used to find words of interest. +Includes:
+gutenberg_download()
that downloads one or
+more works from Project Gutenberg by ID: e.g.,
+gutenberg_download(84)
downloads the text of
+Frankenstein.gutenberg_metadata
contains information about each
+work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors
contains information about each
+author, such as aliases and birth/death yeargutenberg_subjects
contains pairings of works with
+Library of Congress subjects and topicsThis package contains metadata for all Project Gutenberg works as R +datasets, so that you can search and filter for particular works before +downloading.
+The dataset gutenberg_metadata
contains information
+about each work, pairing Gutenberg ID with title, author, language,
+etc:
+library(gutenbergr)
+library(dplyr)
+gutenberg_metadata
+#> # A tibble: 75,473 × 8
+#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
+#> <int> <chr> <chr> <int> <chr> <chr>
+#> 1 1 "The De… Jeffe… 1638 en "Politics/American…
+#> 2 2 "The Un… Unite… 1 en "Politics/American…
+#> 3 3 "John F… Kenne… 1666 en ""
+#> 4 4 "Lincol… Linco… 3 en "US Civil War"
+#> 5 5 "The Un… Unite… 1 en "United States/Pol…
+#> 6 6 "Give M… Henry… 4 en "American Revoluti…
+#> 7 7 "The Ma… NA NA en ""
+#> 8 8 "Abraha… Linco… 3 en "US Civil War"
+#> 9 9 "Abraha… Linco… 3 en "US Civil War"
+#> 10 10 "The Ki… NA NA en "Banned Books List…
+#> # ℹ 75,463 more rows
+#> # ℹ 2 more variables: rights <chr>, has_text <lgl>
For example, you could find the Gutenberg ID(s) of Jane Austen’s +Persuasion by doing:
+
+
+gutenberg_metadata %>%
+ filter(title == "Persuasion")
+#> # A tibble: 3 × 8
+#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
+#> <int> <chr> <chr> <int> <chr> <chr>
+#> 1 105 Persuasi… Auste… 68 en ""
+#> 2 22963 Persuasi… Auste… 68 en ""
+#> 3 36777 Persuasi… Auste… 68 fr "FR Littérature"
+#> # ℹ 2 more variables: rights <chr>, has_text <lgl>
In many analyses, you may want to filter just for English works,
+avoid duplicates, and include only books that have text that can be
+downloaded. The gutenberg_works()
function does this
+pre-filtering:
+gutenberg_works()
+#> # A tibble: 59,146 × 8
+#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
+#> <int> <chr> <chr> <int> <chr> <chr>
+#> 1 1 "The De… Jeffe… 1638 en "Politics/American…
+#> 2 2 "The Un… Unite… 1 en "Politics/American…
+#> 3 3 "John F… Kenne… 1666 en ""
+#> 4 4 "Lincol… Linco… 3 en "US Civil War"
+#> 5 5 "The Un… Unite… 1 en "United States/Pol…
+#> 6 6 "Give M… Henry… 4 en "American Revoluti…
+#> 7 7 "The Ma… NA NA en ""
+#> 8 8 "Abraha… Linco… 3 en "US Civil War"
+#> 9 9 "Abraha… Linco… 3 en "US Civil War"
+#> 10 10 "The Ki… NA NA en "Banned Books List…
+#> # ℹ 59,136 more rows
+#> # ℹ 2 more variables: rights <chr>, has_text <lgl>
It also allows you to perform filtering as an argument:
+
+gutenberg_works(author == "Austen, Jane")
+#> # A tibble: 12 × 8
+#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
+#> <int> <chr> <chr> <int> <chr> <chr>
+#> 1 105 "Persua… Auste… 68 en ""
+#> 2 121 "Northa… Auste… 68 en "Gothic Fiction"
+#> 3 141 "Mansfi… Auste… 68 en ""
+#> 4 158 "Emma" Auste… 68 en ""
+#> 5 161 "Sense … Auste… 68 en ""
+#> 6 946 "Lady S… Auste… 68 en ""
+#> 7 1212 "Love a… Auste… 68 en ""
+#> 8 1342 "Pride … Auste… 68 en "Best Books Ever L…
+#> 9 31100 "The Co… Auste… 68 en ""
+#> 10 37431 "Pride … Auste… 68 en ""
+#> 11 42078 "The Le… Auste… 68 en ""
+#> 12 63569 "The Wa… Auste… 68 en ""
+#> # ℹ 2 more variables: rights <chr>, has_text <lgl>
+
+# or with a regular expression
+
+library(stringr)
+gutenberg_works(str_detect(author, "Austen"))
+#> # A tibble: 22 × 8
+#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
+#> <int> <chr> <chr> <int> <chr> <chr>
+#> 1 105 Persuas… Auste… 68 en ""
+#> 2 121 Northan… Auste… 68 en "Gothic Fiction"
+#> 3 141 Mansfie… Auste… 68 en ""
+#> 4 158 Emma Auste… 68 en ""
+#> 5 161 Sense a… Auste… 68 en ""
+#> 6 946 Lady Su… Auste… 68 en ""
+#> 7 1212 Love an… Auste… 68 en ""
+#> 8 1342 Pride a… Auste… 68 en "Best Books Ever L…
+#> 9 17797 Memoir … Auste… 7603 en ""
+#> 10 22536 Jane Au… Auste… 25392 en ""
+#> # ℹ 12 more rows
+#> # ℹ 2 more variables: rights <chr>, has_text <lgl>
The meta-data currently in the package was last updated on 29 +November 2023.
+The function gutenberg_download()
downloads one or more
+works from Project Gutenberg based on their ID. For example, we earlier
+saw that one version of Persuasion has ID 105 (see the URL here), so
+gutenberg_download(105)
downloads this text.
+persuasion <- gutenberg_download(105)
+persuasion
+#> # A tibble: 8,328 × 2
+#> gutenberg_id text
+#> <int> <chr>
+#> 1 105 "Persuasion"
+#> 2 105 ""
+#> 3 105 ""
+#> 4 105 "by"
+#> 5 105 ""
+#> 6 105 "Jane Austen"
+#> 7 105 ""
+#> 8 105 "(1818)"
+#> 9 105 ""
+#> 10 105 ""
+#> # ℹ 8,318 more rows
Notice it is returned as a tbl_df (a type of data frame) including
+two variables: gutenberg_id
(useful if multiple books are
+returned), and a character vector of the text, one row per line.
You can also provide gutenberg_download()
a vector of
+IDs to download multiple books. For example, to download Renascence,
+and Other Poems (book 109) along with
+Persuasion, do:
+books <- gutenberg_download(c(109, 105), meta_fields = "title")
+books
+#> # A tibble: 9,550 × 3
+#> gutenberg_id text title
+#> <int> <chr> <chr>
+#> 1 105 "Persuasion" Persuasion
+#> 2 105 "" Persuasion
+#> 3 105 "" Persuasion
+#> 4 105 "by" Persuasion
+#> 5 105 "" Persuasion
+#> 6 105 "Jane Austen" Persuasion
+#> 7 105 "" Persuasion
+#> 8 105 "(1818)" Persuasion
+#> 9 105 "" Persuasion
+#> 10 105 "" Persuasion
+#> # ℹ 9,540 more rows
Notice that the meta_fields
argument allows us to add
+one or more additional fields from the gutenberg_metadata
+to the downloaded text, such as title or author.
You may want to select books based on information other than their
+title or author, such as their genre or topic.
+gutenberg_subjects
contains pairings of works with Library
+of Congress subjects and topics. “lcc” means Library of Congress
+Classification, while “lcsh” means Library of Congress
+subject headings:
+gutenberg_subjects
+#> # A tibble: 241,189 × 3
+#> gutenberg_id subject_type subject
+#> <int> <chr> <chr>
+#> 1 1 lcsh United States -- History -- Revolution, 1775-1783 …
+#> 2 1 lcsh United States. Declaration of Independence
+#> 3 1 lcc E201
+#> 4 1 lcc JK
+#> 5 2 lcsh Civil rights -- United States -- Sources
+#> 6 2 lcsh United States. Constitution. 1st-10th Amendments
+#> 7 2 lcc JK
+#> 8 2 lcc KF
+#> 9 3 lcsh United States -- Foreign relations -- 1961-1963
+#> 10 3 lcsh Presidents -- United States -- Inaugural addresses
+#> # ℹ 241,179 more rows
This is useful for extracting texts from a particular topic or genre,
+such as detective stories, or a particular character, such as Sherlock
+Holmes. The gutenberg_id
column can then be used to
+download these texts or to link with other metadata.
+gutenberg_subjects %>%
+ filter(subject == "Detective and mystery stories")
+#> # A tibble: 843 × 3
+#> gutenberg_id subject_type subject
+#> <int> <chr> <chr>
+#> 1 170 lcsh Detective and mystery stories
+#> 2 173 lcsh Detective and mystery stories
+#> 3 244 lcsh Detective and mystery stories
+#> 4 305 lcsh Detective and mystery stories
+#> 5 330 lcsh Detective and mystery stories
+#> 6 481 lcsh Detective and mystery stories
+#> 7 547 lcsh Detective and mystery stories
+#> 8 863 lcsh Detective and mystery stories
+#> 9 905 lcsh Detective and mystery stories
+#> 10 1155 lcsh Detective and mystery stories
+#> # ℹ 833 more rows
+
+gutenberg_subjects %>%
+ filter(grepl("Holmes, Sherlock", subject))
+#> # A tibble: 55 × 3
+#> gutenberg_id subject_type subject
+#> <int> <chr> <chr>
+#> 1 108 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 2 221 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 3 244 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 4 834 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 5 1661 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 6 2097 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 7 2343 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 8 2344 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 9 2345 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> 10 2346 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
+#> # ℹ 45 more rows
gutenberg_authors
contains information about each
+author, such as aliases and birth/death year:
+gutenberg_authors
+#> # A tibble: 24,901 × 7
+#> gutenberg_author_id author alias birthdate deathdate wikipedia aliases
+#> <int> <chr> <chr> <int> <int> <chr> <chr>
+#> 1 1 United States U.S.… NA NA https://… U.S.A.
+#> 2 3 Lincoln, Abr… NA 1809 1865 https://… United…
+#> 3 4 Henry, Patri… NA 1736 1799 https://… NA
+#> 4 5 Adam, Paul NA 1849 1931 https://… NA
+#> 5 7 Carroll, Lew… Dodg… 1832 1898 https://… Dodgso…
+#> 6 8 United State… NA NA NA https://… Agency…
+#> 7 9 Melville, He… Melv… 1819 1891 https://… Melvil…
+#> 8 10 Barrie, J. M… NA 1860 1937 https://… Barrie…
+#> 9 11 Church of Je… NA NA NA https://… NA
+#> 10 12 Smith, Josep… Smit… 1805 1844 https://… Smith,…
+#> # ℹ 24,891 more rows
What’s next after retrieving a book’s text? Well, having the book as +a data frame is especially useful for working with the tidytext package for +text analysis.
+
+library(tidytext)
+
+words <- books %>%
+ unnest_tokens(word, text)
+
+words
+#> # A tibble: 90,532 × 3
+#> gutenberg_id title word
+#> <int> <chr> <chr>
+#> 1 105 Persuasion persuasion
+#> 2 105 Persuasion by
+#> 3 105 Persuasion jane
+#> 4 105 Persuasion austen
+#> 5 105 Persuasion 1818
+#> 6 105 Persuasion chapter
+#> 7 105 Persuasion 1
+#> 8 105 Persuasion sir
+#> 9 105 Persuasion walter
+#> 10 105 Persuasion elliot
+#> # ℹ 90,522 more rows
+
+word_counts <- words %>%
+ anti_join(stop_words, by = "word") %>%
+ count(title, word, sort = TRUE)
+
+word_counts
+#> # A tibble: 6,620 × 3
+#> title word n
+#> <chr> <chr> <int>
+#> 1 Persuasion anne 447
+#> 2 Persuasion captain 303
+#> 3 Persuasion elliot 254
+#> 4 Persuasion lady 214
+#> 5 Persuasion wentworth 191
+#> 6 Persuasion charles 155
+#> 7 Persuasion time 152
+#> 8 Persuasion sir 149
+#> 9 Persuasion miss 125
+#> 10 Persuasion walter 123
+#> # ℹ 6,610 more rows
You may also find these resources useful:
+wikipedia
column in
+gutenberg_author
to Wikipedia content with the WikipediR
+package or to pageview statistics with the wikipediatrend
+packageformat_reverse
function for reversing “Last, First”
+names).