Merge pull request #387 from GateNLP/dev

Add django command for extracting annotations
GateNLP · Oct 2, 2023 · c05f692 · c05f692
2 parents f9a1a99 + 9692df0
commit c05f692
Show file tree

Hide file tree

Showing 36 changed files with 2,144 additions and 290 deletions.
diff --git a/.github/workflows/create-release.yml b/.github/workflows/create-release.yml
@@ -35,7 +35,7 @@ jobs:
       - name: Create release artifacts
         run: |
           sed "s/DEFAULT_IMAGE_TAG=latest/DEFAULT_IMAGE_TAG=${GITHUB_REF_NAME#v}/" install/get-teamware.sh > ./get-teamware.sh
-          tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile
+          tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile backup_manual.sh backup_restore.sh
 
       - name: Create release
         uses: softprops/action-gh-release@v1

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,28 @@
 
 ### Fixed
 
+In versions from 0.2.0 to 2.1.0 inclusive the default `docker-compose.yml` file fails to back up the database, due to a mismatch between the version of the database server and the version of the backup client.  This is now fixed, but in order to create a proper database backup before attempting to upgrade you will need to manually edit your `docker-compose.yml` file and change
+
+```yaml
+  pgbackups:
+    image: prodrigestivill/postgres-backup-local:12
+```
+
+to
+
+```yaml
+  pgbackups:
+    image: prodrigestivill/postgres-backup-local:14
+```
+ 
+(change the "12" to "14"), then run `docker compose up -d` (or `docker-compose up -d`) again to upgrade just the backup tool.  Once the correct backup tool is running you can start an immediate backup using
+
+```
+docker compose run --rm -it pgbackups /backup.sh
+```
+
+(or `docker-compose` if your version of Docker does not support compose v2).
+
 ## [2.1.0] 2023-05-03
 
 ### Added

diff --git a/CITATION.cff b/CITATION.cff
@@ -26,7 +26,7 @@ identifiers:
 - description: The collection of archived snapshots of all versions of GATE Teamware
     2
   type: doi
-  value: 10.5281/zenodo.7821718
+  value: 10.5281/zenodo.7899193
 keywords:
 - NLP
 - machine learning

diff --git a/README.md b/README.md
@@ -2,13 +2,13 @@
 
 ![](/frontend/public/static/img/gate-teamware-logo.svg "GATE Teamware")
 
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7821718.svg)](https://doi.org/10.5281/zenodo.7821718)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7899193.svg)](https://doi.org/10.5281/zenodo.7899193)
 
 A web application for collaborative document annotation. 
 
 Full documentation can be [found here][docs].
 
-GATE teamware provides a flexible web app platform for managing classification of documents by human annotators.
+GATE Teamware provides a flexible web app platform for managing classification of documents by human annotators.
 
 ## Key Features
 * Configure annotation options using a highly flexible JSON config.
@@ -37,6 +37,16 @@ bash ./get-teamware.sh
 
 [A Helm chart](https://github.com/GateNLP/charts/tree/main/gate-teamware) is also available to allow deployment on Kubernetes.
 
+### Upgrading
+
+**When upgrading GATE Teamware it is strongly recommended to ensure you have a recent backup of your database before starting the upgrade procedure.**  Database schema changes should be applied automatically as part of the upgrade but unexpected errors may cause data corruption - **always** take a backup before starting any significant changes to your database, so you can roll back in the event of failure.
+
+Check the [changelog](CHANGELOG.md) - any breaking changes and special considerations for upgrades to particular versions will be documented there.
+
+To upgrade a GATE Teamware installation that you installed using `get-teamware.sh`, simply download and run the latest version of the script in the same folder.  It will detect your existing configuration and prompt you for any new settings that have been introduced in the new version.  Note that any manual changes you have made to the `docker-compose.yml` and other files will not be duplicated automatically for the new version, you will have to port the necessary changes to the new files by hand.
+
+Upgrading a Kubernetes deployment generally consists simply of installing the new chart version with `help upgrade`.  As above, check the GATE Teamware changelog and the [chart readme](https://github.com/GateNLP/charts/tree/main/gate-teamware) for any special considerations, new or changed configuration values, etc. and ensure you have a recent database backup before starting the upgrade process.
+
 ## Building locally
 Follow these steps to run the app on your local machine using `docker-compose`:
 1. Clone this repository by running `git clone https://github.com/GateNLP/gate-teamware.git` and move into the `gate-teamware` directory.
@@ -63,7 +73,7 @@ Teamware is developed by the [GATE](https://gate.ac.uk) team, an academic resear
 ## Citation
 For published work that has used Teamware, please cite this repository. One way is to include a citation such as:
 
-> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 0.1.4) [Computer software]. https://github.com/GateNLP/gate-teamware
+> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 2.1.0) [Computer software]. https://github.com/GateNLP/gate-teamware
 
 Please use the `Cite this repository` button at the top of the [project's GitHub repository](https://github.com/GATENLP/gate-teamware) to get an up to date citation.
 

diff --git a/backend/management/commands/download_annotations.py b/backend/management/commands/download_annotations.py
@@ -0,0 +1,58 @@
+import json
+from django.core.management.base import BaseCommand, CommandError
+from django.template.loader import render_to_string
+from backend.rpcserver import JSONRPCEndpoint
+from backend.views import DownloadAnnotationsView
+import argparse
+
+class Command(BaseCommand):
+
+    help = "Download annotation data"
+
+
+
+    def add_arguments(self, parser):
+        parser.add_argument("output_path", type=str, help="Path of file output")
+        parser.add_argument("project_id", type=str, help="ID of the project")
+        parser.add_argument("doc_type", type=str, help="Document type all, training, test, or annotation")
+        parser.add_argument("export_type", type=str, help="Type of export json, jsonl or csv")
+        parser.add_argument("anonymize", type=self.str2bool, help="Data should be anonymized or not ")
+        parser.add_argument("-j", "--json_format", type=str, help="Type of json format: raw (default) or gate ")
+        parser.add_argument("-n", "--num_entries_per_file", type=int, help="Number of entries to generate per file, default 500")
+
+
+    def handle(self, *args, **options):
+
+        annotations_downloader = DownloadAnnotationsView()
+
+        output_path = options["output_path"]
+        project_id = options["project_id"]
+        doc_type = options["doc_type"]
+        export_type = options["export_type"]
+        anonymize = options["anonymize"]
+        json_format = options["json_format"] if options["json_format"] else "raw"
+        num_entries_per_file = options["num_entries_per_file"] if options["num_entries_per_file"] else 500
+
+        print(f"Writing annotations to {output_path} \n Project: {project_id}\n Document type: {doc_type}\n Export type: {export_type} \n Anonymized: {anonymize} \n Json format:  {json_format} \n Num entries per file: {num_entries_per_file}\n")
+
+        with open(output_path, "wb") as z:
+            annotations_downloader.write_zip_to_file(file_stream=z,
+                                                     project_id=project_id,
+                                                     doc_type=doc_type,
+                                                     export_type=export_type,
+                                                     json_format=json_format,
+                                                     anonymize=anonymize,
+                                                     documents_per_file=num_entries_per_file)
+
+
+    def str2bool(self, v):
+        if isinstance(v, bool):
+            return v
+        if v.lower() in ('yes', 'true', 't', 'y', '1'):
+            return True
+        elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+            return False
+        else:
+            raise argparse.ArgumentTypeError('Boolean value expected.')
+
+
diff --git a/backend/models.py b/backend/models.py
@@ -64,6 +64,13 @@ class ServiceUser(AbstractUser):
     agreed_privacy_policy = models.BooleanField(default=False)
     is_deleted = models.BooleanField(default=False)
 
+    def lock_user(self):
+        """
+        Lock this user with a SELECT FOR UPDATE.  This method must be called within a transaction,
+        the lock will be released when the transaction commits or rolls back.
+        """
+        return type(self).objects.filter(id=self.id).select_for_update().get()
+
     @property
     def has_active_project(self):
         return self.annotatorproject_set.filter(status=AnnotatorProject.ACTIVE).count() > 0
@@ -485,6 +492,9 @@ def reject_annotator(self, user, finished_time=timezone.now()):
             annotator_project.status = AnnotatorProject.COMPLETED
             annotator_project.rejected = True
             annotator_project.save()
+
+            Annotation.clear_all_pending_user_annotations(user)
+
         except ObjectDoesNotExist:
             raise Exception(f"User {user.username} is not an annotator of the project.")
 
@@ -589,6 +599,9 @@ def get_annotator_task(self, user):
             user from annotator list if there's no more tasks or user reached quota.
         """
 
+        # Lock required to prevent concurrent calls from assigning two different tasks
+        # to the same user
+        user = user.lock_user()
         annotation = self.get_current_annotator_task(user)
         if annotation:
             # User has existing task
@@ -623,7 +636,7 @@ def get_current_annotator_task(self, user):
 
         annotation = current_annotations.first()
         if annotation.document.project != self:
-            return RuntimeError(
+            raise RuntimeError(
                 "The annotation doesn't belong to this project! Annotator should only work on one project at a time")
 
         return annotation
@@ -724,9 +737,18 @@ def assign_annotator_task(self, user, doc_type=DocumentType.ANNOTATION):
         Annotation task performs an extra check for remaining annotation task (num_annotation_tasks_remaining),
         testing and training does not do this check as the annotator must annotate all documents.
         """
-        if (DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
-                DocumentType.TEST or DocumentType.TRAINING:
-            for doc in self.documents.filter(doc_type=doc_type).order_by('?'):
+        if (doc_type == DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
+                doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
+            if doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
+                queryset = self.documents.filter(doc_type=doc_type).order_by('?')
+            else:
+                # Prefer documents which have fewer complete or pending annotations, in order to
+                # spread the annotators as evenly as possible across the available documents
+                queryset = self.documents.filter(doc_type=doc_type).alias(
+                    occupied_annotations=Count("annotations", filter=Q(annotations__status=Annotation.COMPLETED)
+                                                                     | Q(annotations__status=Annotation.PENDING))
+                ).order_by('occupied_annotations', '?')
+            for doc in queryset:
                 # Check that annotator hasn't annotated and that
                 # doc hasn't been fully annotated
                 if doc.user_can_annotate_document(user):

diff --git a/backend/tests/test_models.py b/backend/tests/test_models.py
@@ -411,6 +411,26 @@ def test_reject_annotator(self):
         self.assertEqual(AnnotatorProject.COMPLETED, annotator_project.status)
         self.assertEqual(True, annotator_project.rejected)
 
+    def test_remove_annotator_clears_pending(self):
+        annotator = self.annotators[0]
+        # Start a task - should be one pending annotation
+        self.project.get_annotator_task(annotator)
+        self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())
+
+        # remove annotator from project - pending annotations should be cleared
+        self.project.remove_annotator(annotator)
+        self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())
+
+    def test_reject_annotator_clears_pending(self):
+        annotator = self.annotators[0]
+        # Start a task - should be one pending annotation
+        self.project.get_annotator_task(annotator)
+        self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())
+
+        # reject annotator from project - pending annotations should be cleared
+        self.project.reject_annotator(annotator)
+        self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())
+
     def test_num_documents(self):
         self.assertEqual(self.project.num_documents, self.num_docs)
 

diff --git a/backend/tests/test_rpc_endpoints.py b/backend/tests/test_rpc_endpoints.py
@@ -8,6 +8,7 @@
 
 from django.utils import timezone
 import json
+import logging
 
 from backend.models import Annotation, Document, DocumentType, Project, AnnotatorProject, UserDocumentFormatPreference
 from backend.rpc import create_project, update_project, add_project_document, add_document_annotation, \
@@ -28,7 +29,7 @@
 from backend.tests.test_rpc_server import TestEndpoint
 
 
-
+LOGGER = logging.getLogger(__name__)
 
 class TestUserAuth(TestCase):
 
@@ -1379,7 +1380,7 @@ def setUp(self):
         self.num_training_docs = 5
         self.training_docs = []
         for i in range(self.num_training_docs):
-            self.docs.append(Document.objects.create(project=self.proj,
+            self.training_docs.append(Document.objects.create(project=self.proj,
                                                      doc_type=DocumentType.TRAINING,
                                                      data={
                                                         "text": f"Document {i}",
@@ -1396,7 +1397,7 @@ def setUp(self):
         self.num_test_docs = 10
         self.test_docs = []
         for i in range(self.num_test_docs):
-            self.docs.append(Document.objects.create(project=self.proj,
+            self.test_docs.append(Document.objects.create(project=self.proj,
                                                      doc_type=DocumentType.TEST,
                                                      data={
                                                          "text": f"Document {i}",
@@ -1609,10 +1610,11 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
         self.proj.save()
 
         docs_annotated_per_user = []
-        for (i, (ann_user, _)) in enumerate(self.annotators):
+        for (ann_user, _) in self.annotators:
             # Add to project
             self.assertTrue(add_project_annotator(self.manager_request, self.proj.id, ann_user.username))
 
+        for (i, (ann_user, _)) in enumerate(self.annotators):
             # Every annotator should be able to complete every training document, even though
             # max annotations per document is less than the total number of annotators
             self.assertEqual(self.num_training_docs,
@@ -1623,6 +1625,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
             self.assertEqual(self.num_training_docs,
                              self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))
 
+        for (i, (ann_user, _)) in enumerate(self.annotators):
             # Every annotator should be able to complete every test document, even though
             # max annotations per document is less than the total number of annotators
             self.assertEqual(self.num_test_docs,
@@ -1633,6 +1636,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
             self.assertEqual(self.num_training_docs,
                              self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))
 
+        for (i, (ann_user, _)) in enumerate(self.annotators):
             # Now attempt to complete task normally
             num_annotated = self.complete_annotations(self.num_docs, "Annotation", annotator=i)
             docs_annotated_per_user.append(num_annotated)
@@ -1662,15 +1666,30 @@ def complete_annotations(self, num_annotations_to_complete, expected_doc_type_st
 
         # Expect to get self.num_training_docs tasks
         num_completed_tasks = 0
+        if expected_doc_type_str == 'Annotation':
+            all_docs = self.docs
+        elif expected_doc_type_str == 'Training':
+            all_docs = self.training_docs
+        else:
+            all_docs = self.test_docs
+
+        annotated_docs = {doc.pk: ' ' for doc in all_docs}
         for i in range(num_annotations_to_complete):
             task_context = get_annotation_task(ann_req)
             if task_context:
                 self.assertEqual(expected_doc_type_str, task_context.get("document_type"),
                                  f"Document type does not match in task {task_context!r}, " +
                                  "annotator {ann.username}, document {i}")
+                annotated_docs[task_context['document_id']] = "\u2714"
                 complete_annotation_task(ann_req, task_context["annotation_id"], {"sentiment": answer})
                 num_completed_tasks += 1
 
+        # Draw a nice markdown table of exactly which documents each annotator was given
+        if annotator == 0:
+            LOGGER.debug("Annotator | " + (" | ".join(str(i) for i in annotated_docs.keys())))
+            LOGGER.debug(" | ".join(["--"] * (len(annotated_docs)+1)))
+        LOGGER.debug(ann.username + " | " + (" | ".join(str(v) for v in annotated_docs.values())))
+
         return num_completed_tasks
 
 class TestAnnotationChange(TestEndpoint):