Merge pull request #4 from MediaComem/Giovanni1085-patch-1

Giovanni1085 patch 1
MediaComem · Nov 15, 2023 · d545e8d · d545e8d
2 parents 911f80d + 17853ad
commit d545e8d
Show file tree

Hide file tree

Showing 2 changed files with 21 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -2,21 +2,25 @@
 
 [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.10136232.svg)](https://doi.org/10.5281/zenodo.10136232)
 
+PLOS recently published an innovative [dataset of Open Science Indicators (OSI)](https://doi.org/10.6084/m9.figshare.21687686.v4), focused on its entire collection plus a comparison dataset from PubMed. We use here the OSI version 4, containing approximately 82,000 PMC and PLOS articles (of which 74,000 are from PLOS). The OSI is primarily concerned with indicators on: sharing of research data, in particular, data shared in data repositories; sharing of code; and posting of preprints.
+
+The [Media Engineering Institute (MEI)](https://heig-vd.ch/en/research/mei) has been involved in collecting data from the PubMed Open Access collection to equip the OSI dataset with citation data (article) and h-index data (author level), in preparation for further analysis. The data collection pipeline has been adapted following the process described in the previous work on Data Availability Statements, described below.
+
 ## Code and data
 
-* See the [dataset folder](dataset) to creates from the [PubMed Central OA collection](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist) dataset, an CSV file to analyze the h_index impact on publication citation.
-  * In `dataset/dev_set`, some articles are added to the previous ones to validate the `h_index` calculation.
-  * The source code has been updated to the latest python and packages release when necessary.
+* We start from the OSI dataset and the [PubMed Central Open Access collection](https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist). Our goal is to extract a CSV file containing citation data and h-index data for every article in OSI, calculated from PubMed OA.
+* See the [dataset folder](dataset) for more details on the steps taken:
   * Detect authors in the OSI dataset.
-  * Collect all citations given from any article in PubMed OA to OSI articles, using identifiers contained in the lists of references.
-  * Calculate citation counts for 1, 2, and possibly 3 years after publication of all OSI articles, using month level precision (e.g., for an article published in June 2019, a 2 year citation window comprises all citations received by articles published until June 2021). Furthermore, calculate the author-level H-index based on the same data.
-  * Compute the H-index and timed citation indicators as a dataset that can be joined with the OSI dataset.
-  * Develop and run satisfactory tests to insure the correctness of results.
-* To validate the code, please refer the [testing procedure](test.md)
-* The result could be found in `dataset/exports/export_plos.csv`
+  * Collect all citations given from any article in PubMed OA to any OSI article, using known identifiers contained in the lists of references.
+  * Calculate citation counts for 1, 2, and 3 years after the publication of all OSI articles, using month-level precision (e.g., for an article published in June 2019, a 2-year citation window comprises all citations received by articles published until June 2021). Furthermore, calculate the author-level h-index based on the same data.
+  * Compute the h-index and timed citation indicators as a dataset that can be joined with the OSI dataset.
+  * Develop and run satisfactory tests to ensure the correctness of results. In `dataset/dev_set`, some articles are added to the previous ones to validate the citation and h_index calculations.
+  * The source code has been updated to the latest Python and packages release when necessary.
+* To validate the code, please refer to the [testing procedure](test.md).
+* The final result can be found in [dataset/exports/export_plos.csv](dataset/exports/export_plos.csv).
 
 # Original work
-This repository is based on the previous work here:
+This repository is a fork of previous work that can be found here:
 
 [![DOI](https://zenodo.org/badge/180121200.svg)](https://zenodo.org/badge/latestdoi/180121200)
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/alan-turing-institute/das-public/master?filepath=notebooks%2FDescriptiveFigures.ipynb)
@@ -25,7 +29,7 @@ This repository is based on the previous work here:
 The original code is mentioned in the following papers:
 
 * 📃 Preprint: https://arxiv.org/abs/1907.02565.
-* 📝 Peer reviewed publication: https://doi.org/10.1371/journal.pone.0230416
+* 📝 Peer-reviewed publication: https://doi.org/10.1371/journal.pone.0230416
 
 Blogs and talks:
 * "A selfish reason to share research data": https://www.turing.ac.uk/blog/selfish-reason-share-research-data
@@ -41,7 +45,7 @@ Blogs and talks:
 
 ## Report issues
 
-Please add an issue or notify the authors should you find any error to correct or improvement to make.
+Please add an issue or notify the authors should you find any error to correct or improvements to make.
 Well-documented pull requests are particularly appreciated.
 
 ## How to cite

diff --git a/dataset/README.md b/dataset/README.md
@@ -13,14 +13,14 @@ Folder containing the necessary code to create a dataset for analysis from the P
 
 ## Instructions
 
-1. Download the Pubmed OA collection, e.g. via their FTP service: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp. For testing,you can use the data in the [dev set folder](dev_set).
-2. Setup a MongoDB and update the [config file](config/config.conf) or run `docker compose up` with the current config.
+1. Download the Pubmed OA collection, e.g. via their FTP service: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp. For testing, you can use the data in the [dev set folder](dev_set).
+2. Set up a MongoDB and update the [config file](config/config.conf) or run `docker compose up` with the current config.
 3. Uncompress `PLOS_Dataset_Classification.zip` in the config folder then move the folder content into the current folder.
-4. Run the [parser_main.py](parser_main.py) script, which will create a first collection of articles in Mongo.
+4. Run the [parser_main.py](parser_main.py) script, which will create the first collection of articles in Mongo.
 5. Run the [calculate_stats.py](calculate_stats.py) script, which will calculate citation counts for articles and authors and create the relative collections in Mongo.
 6. Run the [calculate_h_index.py](calculate_h_index.py) script, which will update the `h_indexes` elements of each documents with the result of the h_index calculaiton.
-7. Run the [get_export.py](get_export.py) script, which will create a first export of the dataset in the [exports folder](exports).
+7. Run the [get_export.py](get_export.py) script, which will create the first export of the dataset in the [exports folder](exports).
 
 ## Requirements
 
-See [requirements](../requirements.txt).
+See [requirements](../requirements.txt).