Skip to content

Commit

Permalink
Merge pull request #387 from GateNLP/dev
Browse files Browse the repository at this point in the history
Add django command for extracting annotations
  • Loading branch information
twinkarma authored Oct 2, 2023
2 parents f9a1a99 + 9692df0 commit c05f692
Show file tree
Hide file tree
Showing 36 changed files with 2,144 additions and 290 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/create-release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ jobs:
- name: Create release artifacts
run: |
sed "s/DEFAULT_IMAGE_TAG=latest/DEFAULT_IMAGE_TAG=${GITHUB_REF_NAME#v}/" install/get-teamware.sh > ./get-teamware.sh
tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile
tar cvzf install.tar.gz README.md docker-compose*.yml generate-docker-env.sh create-django-db.sh nginx custom-policies Caddyfile backup_manual.sh backup_restore.sh
- name: Create release
uses: softprops/action-gh-release@v1
Expand Down
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,28 @@

### Fixed

In versions from 0.2.0 to 2.1.0 inclusive the default `docker-compose.yml` file fails to back up the database, due to a mismatch between the version of the database server and the version of the backup client. This is now fixed, but in order to create a proper database backup before attempting to upgrade you will need to manually edit your `docker-compose.yml` file and change

```yaml
pgbackups:
image: prodrigestivill/postgres-backup-local:12
```
to
```yaml
pgbackups:
image: prodrigestivill/postgres-backup-local:14
```
(change the "12" to "14"), then run `docker compose up -d` (or `docker-compose up -d`) again to upgrade just the backup tool. Once the correct backup tool is running you can start an immediate backup using

```
docker compose run --rm -it pgbackups /backup.sh
```
(or `docker-compose` if your version of Docker does not support compose v2).
## [2.1.0] 2023-05-03
### Added
Expand Down
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ identifiers:
- description: The collection of archived snapshots of all versions of GATE Teamware
2
type: doi
value: 10.5281/zenodo.7821718
value: 10.5281/zenodo.7899193
keywords:
- NLP
- machine learning
Expand Down
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

![](/frontend/public/static/img/gate-teamware-logo.svg "GATE Teamware")

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7821718.svg)](https://doi.org/10.5281/zenodo.7821718)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7899193.svg)](https://doi.org/10.5281/zenodo.7899193)

A web application for collaborative document annotation.

Full documentation can be [found here][docs].

GATE teamware provides a flexible web app platform for managing classification of documents by human annotators.
GATE Teamware provides a flexible web app platform for managing classification of documents by human annotators.

## Key Features
* Configure annotation options using a highly flexible JSON config.
Expand Down Expand Up @@ -37,6 +37,16 @@ bash ./get-teamware.sh

[A Helm chart](https://github.com/GateNLP/charts/tree/main/gate-teamware) is also available to allow deployment on Kubernetes.

### Upgrading

**When upgrading GATE Teamware it is strongly recommended to ensure you have a recent backup of your database before starting the upgrade procedure.** Database schema changes should be applied automatically as part of the upgrade but unexpected errors may cause data corruption - **always** take a backup before starting any significant changes to your database, so you can roll back in the event of failure.

Check the [changelog](CHANGELOG.md) - any breaking changes and special considerations for upgrades to particular versions will be documented there.

To upgrade a GATE Teamware installation that you installed using `get-teamware.sh`, simply download and run the latest version of the script in the same folder. It will detect your existing configuration and prompt you for any new settings that have been introduced in the new version. Note that any manual changes you have made to the `docker-compose.yml` and other files will not be duplicated automatically for the new version, you will have to port the necessary changes to the new files by hand.

Upgrading a Kubernetes deployment generally consists simply of installing the new chart version with `help upgrade`. As above, check the GATE Teamware changelog and the [chart readme](https://github.com/GateNLP/charts/tree/main/gate-teamware) for any special considerations, new or changed configuration values, etc. and ensure you have a recent database backup before starting the upgrade process.

## Building locally
Follow these steps to run the app on your local machine using `docker-compose`:
1. Clone this repository by running `git clone https://github.com/GateNLP/gate-teamware.git` and move into the `gate-teamware` directory.
Expand All @@ -63,7 +73,7 @@ Teamware is developed by the [GATE](https://gate.ac.uk) team, an academic resear
## Citation
For published work that has used Teamware, please cite this repository. One way is to include a citation such as:

> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 0.1.4) [Computer software]. https://github.com/GateNLP/gate-teamware
> Karmakharm, T., Wilby, D., Roberts, I., & Bontcheva, K. (2022). GATE Teamware (Version 2.1.0) [Computer software]. https://github.com/GateNLP/gate-teamware
Please use the `Cite this repository` button at the top of the [project's GitHub repository](https://github.com/GATENLP/gate-teamware) to get an up to date citation.

Expand Down
58 changes: 58 additions & 0 deletions backend/management/commands/download_annotations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import json
from django.core.management.base import BaseCommand, CommandError
from django.template.loader import render_to_string
from backend.rpcserver import JSONRPCEndpoint
from backend.views import DownloadAnnotationsView
import argparse

class Command(BaseCommand):

help = "Download annotation data"



def add_arguments(self, parser):
parser.add_argument("output_path", type=str, help="Path of file output")
parser.add_argument("project_id", type=str, help="ID of the project")
parser.add_argument("doc_type", type=str, help="Document type all, training, test, or annotation")
parser.add_argument("export_type", type=str, help="Type of export json, jsonl or csv")
parser.add_argument("anonymize", type=self.str2bool, help="Data should be anonymized or not ")
parser.add_argument("-j", "--json_format", type=str, help="Type of json format: raw (default) or gate ")
parser.add_argument("-n", "--num_entries_per_file", type=int, help="Number of entries to generate per file, default 500")


def handle(self, *args, **options):

annotations_downloader = DownloadAnnotationsView()

output_path = options["output_path"]
project_id = options["project_id"]
doc_type = options["doc_type"]
export_type = options["export_type"]
anonymize = options["anonymize"]
json_format = options["json_format"] if options["json_format"] else "raw"
num_entries_per_file = options["num_entries_per_file"] if options["num_entries_per_file"] else 500

print(f"Writing annotations to {output_path} \n Project: {project_id}\n Document type: {doc_type}\n Export type: {export_type} \n Anonymized: {anonymize} \n Json format: {json_format} \n Num entries per file: {num_entries_per_file}\n")

with open(output_path, "wb") as z:
annotations_downloader.write_zip_to_file(file_stream=z,
project_id=project_id,
doc_type=doc_type,
export_type=export_type,
json_format=json_format,
anonymize=anonymize,
documents_per_file=num_entries_per_file)


def str2bool(self, v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')


30 changes: 26 additions & 4 deletions backend/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,13 @@ class ServiceUser(AbstractUser):
agreed_privacy_policy = models.BooleanField(default=False)
is_deleted = models.BooleanField(default=False)

def lock_user(self):
"""
Lock this user with a SELECT FOR UPDATE. This method must be called within a transaction,
the lock will be released when the transaction commits or rolls back.
"""
return type(self).objects.filter(id=self.id).select_for_update().get()

@property
def has_active_project(self):
return self.annotatorproject_set.filter(status=AnnotatorProject.ACTIVE).count() > 0
Expand Down Expand Up @@ -485,6 +492,9 @@ def reject_annotator(self, user, finished_time=timezone.now()):
annotator_project.status = AnnotatorProject.COMPLETED
annotator_project.rejected = True
annotator_project.save()

Annotation.clear_all_pending_user_annotations(user)

except ObjectDoesNotExist:
raise Exception(f"User {user.username} is not an annotator of the project.")

Expand Down Expand Up @@ -589,6 +599,9 @@ def get_annotator_task(self, user):
user from annotator list if there's no more tasks or user reached quota.
"""

# Lock required to prevent concurrent calls from assigning two different tasks
# to the same user
user = user.lock_user()
annotation = self.get_current_annotator_task(user)
if annotation:
# User has existing task
Expand Down Expand Up @@ -623,7 +636,7 @@ def get_current_annotator_task(self, user):

annotation = current_annotations.first()
if annotation.document.project != self:
return RuntimeError(
raise RuntimeError(
"The annotation doesn't belong to this project! Annotator should only work on one project at a time")

return annotation
Expand Down Expand Up @@ -724,9 +737,18 @@ def assign_annotator_task(self, user, doc_type=DocumentType.ANNOTATION):
Annotation task performs an extra check for remaining annotation task (num_annotation_tasks_remaining),
testing and training does not do this check as the annotator must annotate all documents.
"""
if (DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
DocumentType.TEST or DocumentType.TRAINING:
for doc in self.documents.filter(doc_type=doc_type).order_by('?'):
if (doc_type == DocumentType.ANNOTATION and self.num_annotation_tasks_remaining > 0) or \
doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
if doc_type == DocumentType.TEST or doc_type == DocumentType.TRAINING:
queryset = self.documents.filter(doc_type=doc_type).order_by('?')
else:
# Prefer documents which have fewer complete or pending annotations, in order to
# spread the annotators as evenly as possible across the available documents
queryset = self.documents.filter(doc_type=doc_type).alias(
occupied_annotations=Count("annotations", filter=Q(annotations__status=Annotation.COMPLETED)
| Q(annotations__status=Annotation.PENDING))
).order_by('occupied_annotations', '?')
for doc in queryset:
# Check that annotator hasn't annotated and that
# doc hasn't been fully annotated
if doc.user_can_annotate_document(user):
Expand Down
20 changes: 20 additions & 0 deletions backend/tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,26 @@ def test_reject_annotator(self):
self.assertEqual(AnnotatorProject.COMPLETED, annotator_project.status)
self.assertEqual(True, annotator_project.rejected)

def test_remove_annotator_clears_pending(self):
annotator = self.annotators[0]
# Start a task - should be one pending annotation
self.project.get_annotator_task(annotator)
self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())

# remove annotator from project - pending annotations should be cleared
self.project.remove_annotator(annotator)
self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())

def test_reject_annotator_clears_pending(self):
annotator = self.annotators[0]
# Start a task - should be one pending annotation
self.project.get_annotator_task(annotator)
self.assertEqual(1, annotator.annotations.filter(status=Annotation.PENDING).count())

# reject annotator from project - pending annotations should be cleared
self.project.reject_annotator(annotator)
self.assertEqual(0, annotator.annotations.filter(status=Annotation.PENDING).count())

def test_num_documents(self):
self.assertEqual(self.project.num_documents, self.num_docs)

Expand Down
27 changes: 23 additions & 4 deletions backend/tests/test_rpc_endpoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from django.utils import timezone
import json
import logging

from backend.models import Annotation, Document, DocumentType, Project, AnnotatorProject, UserDocumentFormatPreference
from backend.rpc import create_project, update_project, add_project_document, add_document_annotation, \
Expand All @@ -28,7 +29,7 @@
from backend.tests.test_rpc_server import TestEndpoint



LOGGER = logging.getLogger(__name__)

class TestUserAuth(TestCase):

Expand Down Expand Up @@ -1379,7 +1380,7 @@ def setUp(self):
self.num_training_docs = 5
self.training_docs = []
for i in range(self.num_training_docs):
self.docs.append(Document.objects.create(project=self.proj,
self.training_docs.append(Document.objects.create(project=self.proj,
doc_type=DocumentType.TRAINING,
data={
"text": f"Document {i}",
Expand All @@ -1396,7 +1397,7 @@ def setUp(self):
self.num_test_docs = 10
self.test_docs = []
for i in range(self.num_test_docs):
self.docs.append(Document.objects.create(project=self.proj,
self.test_docs.append(Document.objects.create(project=self.proj,
doc_type=DocumentType.TEST,
data={
"text": f"Document {i}",
Expand Down Expand Up @@ -1609,10 +1610,11 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.proj.save()

docs_annotated_per_user = []
for (i, (ann_user, _)) in enumerate(self.annotators):
for (ann_user, _) in self.annotators:
# Add to project
self.assertTrue(add_project_annotator(self.manager_request, self.proj.id, ann_user.username))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Every annotator should be able to complete every training document, even though
# max annotations per document is less than the total number of annotators
self.assertEqual(self.num_training_docs,
Expand All @@ -1623,6 +1625,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.assertEqual(self.num_training_docs,
self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Every annotator should be able to complete every test document, even though
# max annotations per document is less than the total number of annotators
self.assertEqual(self.num_test_docs,
Expand All @@ -1633,6 +1636,7 @@ def test_annotations_per_doc_not_enforced_for_training_or_test(self):
self.assertEqual(self.num_training_docs,
self.proj.get_annotator_document_score(ann_user, DocumentType.TRAINING))

for (i, (ann_user, _)) in enumerate(self.annotators):
# Now attempt to complete task normally
num_annotated = self.complete_annotations(self.num_docs, "Annotation", annotator=i)
docs_annotated_per_user.append(num_annotated)
Expand Down Expand Up @@ -1662,15 +1666,30 @@ def complete_annotations(self, num_annotations_to_complete, expected_doc_type_st

# Expect to get self.num_training_docs tasks
num_completed_tasks = 0
if expected_doc_type_str == 'Annotation':
all_docs = self.docs
elif expected_doc_type_str == 'Training':
all_docs = self.training_docs
else:
all_docs = self.test_docs

annotated_docs = {doc.pk: ' ' for doc in all_docs}
for i in range(num_annotations_to_complete):
task_context = get_annotation_task(ann_req)
if task_context:
self.assertEqual(expected_doc_type_str, task_context.get("document_type"),
f"Document type does not match in task {task_context!r}, " +
"annotator {ann.username}, document {i}")
annotated_docs[task_context['document_id']] = "\u2714"
complete_annotation_task(ann_req, task_context["annotation_id"], {"sentiment": answer})
num_completed_tasks += 1

# Draw a nice markdown table of exactly which documents each annotator was given
if annotator == 0:
LOGGER.debug("Annotator | " + (" | ".join(str(i) for i in annotated_docs.keys())))
LOGGER.debug(" | ".join(["--"] * (len(annotated_docs)+1)))
LOGGER.debug(ann.username + " | " + (" | ".join(str(v) for v in annotated_docs.values())))

return num_completed_tasks

class TestAnnotationChange(TestEndpoint):
Expand Down
Loading

0 comments on commit c05f692

Please sign in to comment.