NB_LaurenSandpit.Rmd

---
title: "Lauren's sandpit"
output: html_notebook
---

This notebook is just somewhere for Lauren to try stuff out.

```{r setup, include = FALSE}
library(tidyverse)
library(lubridate)
articles <- read_csv("data/master.csv")
subjects <- read_csv("data/subjects.csv")
```

looking at number of news mentions over time. First need to join mentions and articles_since_publication. Next group by DOI.

```{r news mentions with time}
article_mentions_since_publication <- left_join(articles_since_publication, mentions)

summarise(
  group_by(filter(article_mentions_since_publication, source == "news"), 
           article_title),)

news_mentions_since_publication <- mutate(
  filter(group_by(filter(article_mentions_since_publication, source == "news"),
                  doi),
                  print_publication_date > "2013-01-01"), 
  days_since_publication = ymd("2018-07-13")-as_date(print_publication_date) 
)

news_mentions_aggregated <- summarise(
  group_by(news_mentions_since_publication, doi, days_since_publication),
  total = n()
)

ggplot(data = news_mentions_aggregated) +
  geom_point(mapping = aes(x = days_since_publication, y = total))


```
# average time to first blog mention in open versus closed articles

The below filters out just the blog mentions, groups them and summarises on the min(date_posted) to give us the earliest blog mention. This is a direct copy of the average time to first policy mention code but with 'blog' replacing 'policy'.

Then it subtracts the date of publication from the first news mention date to give us the number of days between publication and mention in blogs.

And then it takes and average and standard deviation of those times for open and closed papers.


```{r Average time to first blogs mention - used in paper}

open_closed_with_first_blogs_mention <- filter(open_closed_with_mentions, source == 'blogs') %>%
                                          group_by(article_title, journal_title, print_publication_date, known_to_be_open) %>%
                                          summarise(first_blogs_pub_date = min(date_posted)) %>%
                                          mutate(
                                             days_to_first_blogs_mention = as_date(first_blogs_pub_date) - as_date(print_publication_date) 
                                          )

summarise(
  group_by(open_closed_with_first_blogs_mention, known_to_be_open),
  mean_average_days_to_first_blogs_mention = mean(days_to_first_blogs_mention, na.rm = TRUE),
  sd_average_days_to_first_blogs_mention = sd(days_to_first_blogs_mention, na.rm = TRUE)
)


```
# plot this graphically
plot as a bar chart using bins (every 60 days?)
```{r}
ggplot(data = open_closed_with_first_blogs_mention) +
  geom_point(mapping = aes(x = days_to_first_blogs_mention, y = total_mentions))
```


# looking at subjects
load in subjects
join to articles
just select scopus subjects


```{r}
articles_with_Subjects <- left_join(articles, subjects)
subject_totals <- filter(articles_with_Subjects, subject_scheme == "scopus") %>%
  group_by(subject_name) %>%
  summarise(total_articles = n()) %>%
  arrange (desc(total_articles))


```
# investigating the odd articles that Scopus returned as published between 2013-5 and Altmetric thought were published pre 2013.
Filter out articles that published pre-2013.
```{r articles published pre-2013}
articles_With_dodgy_dates <- filter(articles, print_publication_date < "2013-01-01")
```

There are 106 articles with print_publication_date older than 1/1/13. Random sample shows:
10.1002/jid.1820 - WoS & Scopus both show publication date in Oct 2013, Altmetric in Aug 2011. Paper was first published **online** in 2011. 
10.1097/ncq.0b013e3182902404 - WoS shows date of Oct-Dec 2013, Scopus is Oct 2013, Altmetric is 1970, article was published online in April 2013.