Skip to content

Commit

Permalink
feat: Add option to enable/disable regex matching for expected fields…
Browse files Browse the repository at this point in the history
… in file blueprint settings.
  • Loading branch information
Nico-AP committed Dec 3, 2023
1 parent 76ddd81 commit e5041b5
Show file tree
Hide file tree
Showing 12 changed files with 96 additions and 102 deletions.
5 changes: 3 additions & 2 deletions ddm/forms.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,9 @@ class BlueprintEditForm(forms.ModelForm):

class Meta:
model = DonationBlueprint
fields = ['name', 'description', 'regex_path', 'exp_file_format', 'csv_delimiter',
'file_uploader', 'json_extraction_root', 'expected_fields']
fields = ['name', 'description', 'regex_path', 'exp_file_format',
'csv_delimiter', 'file_uploader', 'json_extraction_root',
'expected_fields', 'expected_fields_regex_matching']
widgets = {
'expected_fields': forms.Textarea(attrs={'rows': 1}),
'regex_path': forms.Textarea(attrs={'rows': 1}),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Generated by Django 3.2.13 on 2023-12-03 13:31

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('ddm', '0045_alter_processingrule_comparison_operator'),
]

operations = [
migrations.AddField(
model_name='donationblueprint',
name='expected_fields_regex_matching',
field=models.BooleanField(default=False, help_text='Select if you use regex expressions in the "Excpected fields".'),
),
]
7 changes: 7 additions & 0 deletions ddm/models/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,6 +466,12 @@ class FileFormats(models.TextChoices):
)
)

expected_fields_regex_matching = models.BooleanField(
default=False,
null=False,
help_text='Select if you use regex expressions in the "Excpected fields".'
)

file_uploader = models.ForeignKey(
'FileUploader',
null=True,
Expand Down Expand Up @@ -505,6 +511,7 @@ def get_config(self):
'format': self.exp_file_format,
'json_extraction_root': self.json_extraction_root,
'expected_fields': json.loads("[" + str(self.expected_fields) + "]"),
'exp_fields_regex_matching': self.expected_fields_regex_matching,
'fields_to_extract': self.get_fields_to_extract(),
'regex_path': self.regex_path,
'filter_rules': self.get_filter_rules(),
Expand Down
2 changes: 1 addition & 1 deletion ddm/static/ddm/vue/js/vue_questionnaire.js

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions ddm/static/ddm/vue/js/vue_uploader.js

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion ddm/static/ddm/vue/webpack-stats.json
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,7 @@
"js/vue_questionnaire.js"
]
},
"publicPath": "/static/ddm/vue/"
"publicPath": "/static/ddm/vue/",
"error": "ESLintError",
"message": "[eslint] \nC:\\Files\\Arbeit\\Projekte\\Data Donation Lab\\Code\\DDM\\ddm\\frontend\\src\\components\\FileUploader.vue\n 510:52 error 'field' is not defined no-undef\n\n✖ 1 problem (1 error, 0 warnings)\n"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
<h5 class="pt-4"><b>Data Extraction</b></h5>
<ul>
<li>The data extraction follows <i>extraction rules</i> which can be configured below. These rules are applied
consecutively in the defined order.
</li>
<li><b>Keep data:</b> For every field/column/variable that you want to
keep in the donated data, you first have to define an extraction rule with the "Keep field" operator.
</li>
<li><b>Filter and alter data:</b> Next, you can add rules to filter (i.e., delete) or alter entries in the
uploaded data
(e.g., to delete all entries where the date is < 01.01.2020, or to replace
e-mail-addresses with "ANONYMIZED EMAIL"). For this, there are several comparison and regex operations
available.
</li>
</ul>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<div class="ddm-admin-form pt-5">
<h4>Data Extraction Settings</h4>

<p>Data Extraction is a two-step process consisting of first the <b>file validation</b> and second the <b>data extraction</b>.</p>

<h5><b>File Validation</b></h5>
<ul>
<li>First, it is checked whether the expected file is included in the uploaded data (only applies to ZIP uploads).
If the associated File Uploader expects a ZIP Upload, the correct file is identified
using the provided <code>file path</code> (this is skipped for single file uploads).</li>
<li>Second, it is checked whether the uploaded file is in the <code>expected file format</code>.</li>
<li>Third, it is checked whether the identified file contains <b>all</b> <code>expected fields</code>.</li>
<li>
If any of these validation steps fail, the participant will be shown an
exception message explaining what went wrong and no data is extracted.</li>
</ul>
</div>
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@

<div class="ddm-admin-form">
{% for field in form %}
{% if field.name not in "expected_fields,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
{% if field.name not in "expected_fields,expected_fields_regex_matching,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
<p>
{{ field.label_tag }}
{{ field.errors }}
Expand All @@ -34,34 +34,11 @@
{% endfor %}
</div>

<div class="ddm-admin-form">
<h5>Data Extraction Settings</h5>
</div>

<div>
<p>Data Extraction is a two-step process:</p>

<h6><b>1. File Validation</b></h6>
<p>
First, it is checked whether the file that you expect is included in the download.
This means that if the associated File Uploader expects a ZIP Upload, it tries to find the correct
file according to the <i>file path</i> you defined (this is skipped for single file uploads).
</p>
<p>
Next, it is checked whether the uploaded file has the expected format defined in the <i>Expected File Format</i>
setting (and other settings, depending on the file format).
</p>
<p>
Lastly, it is checked whether the identified file contains the expected fields
defined in the <i>Expected Fields</i> setting.<br>
If any of these validation steps fail, the participant will be shown an
exception message explaining what went wrong and the file upload and extraction is aborted.
</p>
</div>
{% include "ddm/admin/data_donation/donation_blueprint/block_file_validation.html" %}

<div class="ddm-admin-form">
{% for field in form %}
{% if field.name in "expected_fields,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
{% if field.name in "expected_fields,expected_fields_regex_matching,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
<p>
{{ field.label_tag }}
{{ field.errors }}
Expand All @@ -72,29 +49,7 @@ <h6><b>1. File Validation</b></h6>
{% endfor %}
</div>

<div>
<h6><b>2. Data Extraction</b></h6>
<p>
For the data extraction, the Data Donation Module follows the data sparsity paradigm.
This means that the base assumption is, that you do not want any data from your participants,
and you have to explicitly indicate which data fields you want to have included.
</p>
<p>
To keep data in the data donation, you must define <i>Extraction Rules</i>.<br>
An Extraction Rule is always related to one field/column in the uploaded data file
and a data field will only be kept in a participant's donation if it is explicitly
mentioned in at least one of the extraction rules.
</p>
<p>
An extraction rule can either indicate to just keep a field in the donation
(by mentioning the field/column in an extraction rule without defining any concrete comparison operator),
use data contained in a field to delete data entries (i.e., rows) from the donation
(e.g., to delete all entries where the date is < 01.01.2020) or
alter the data contained in a field (e.g., anonymize an e-mail address by replacing "[email protected]" with "EMAIL").<br>
For this, there are several comparison and regex operations available. For the comparison operations, a match
means that a data entry will be deleted. The rules are applied to the uploaded file in the indicated order.
</p>
</div>
{% include "ddm/admin/data_donation/donation_blueprint/block_data_extraction.html" %}

<div class="pb-3"><i>You will be able to define the extraction rules once you have saved and created the File blueprint.</i></div>

Expand Down
41 changes: 4 additions & 37 deletions ddm/templates/ddm/admin/data_donation/donation_blueprint/edit.html
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

<div class="ddm-admin-form">
{% for field in form %}
{% if field.name not in "expected_fields,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
{% if field.name not in "expected_fields,expected_fields_regex_matching,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
<p>
{{ field.label_tag }}
{{ field.errors }}
Expand All @@ -25,29 +25,11 @@
{% endfor %}
</div>

<div class="ddm-admin-form pt-5">
<h4>Data Extraction Settings</h4>
</div>

<div>
<p>Data Extraction is a two-step process consisting of first the <b>file validation</b> and second the <b>data extraction</b>.</p>

<h5 class="pt-3"><b>File Validation</b></h5>
<ul>
<li>First, it is checked whether the expected file is included in the uploaded data (only applies to ZIP uploads).
If the associated File Uploader expects a ZIP Upload, the correct file is identified
using the provided <code>file path</code> (this is skipped for single file uploads).</li>
<li>Second, it is checked whether the uploaded file is in the <code>expected file format</code>.</li>
<li>Third, it is checked whether the identified file contains <b>all</b> <code>expected fields</code>.</li>
<li>
If any of these validation steps fail, the participant will be shown an
exception message explaining what went wrong and no data is extracted.</li>
</ul>
</div>
{% include "ddm/admin/data_donation/donation_blueprint/block_file_validation.html" %}

<div class="ddm-admin-form">
{% for field in form %}
{% if field.name in "expected_fields,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
{% if field.name in "expected_fields,expected_fields_regex_matching,regex_path,exp_file_format,csv_delimiter,json_extraction_root" %}
<p>
{{ field.label_tag }}
{{ field.errors }}
Expand All @@ -58,22 +40,7 @@ <h5 class="pt-3"><b>File Validation</b></h5>
{% endfor %}
</div>


<h5 class="pt-4"><b>Data Extraction</b></h5>
<ul>
<li>The data extraction follows <i>extraction rules</i> which can be configured below. These rules are applied
consecutively in the defined order.
</li>
<li><b>Keep data:</b> For every field/column/variable that you want to
keep in the donated data, you first have to define an extraction rule with the "Keep field" operator.
</li>
<li><b>Filter and alter data:</b> Next, you can add rules to filter (i.e., delete) or alter entries in the
uploaded data
(e.g., to delete all entries where the date is < 01.01.2020, or to replace
e-mail-addresses with "ANONYMIZED EMAIL"). For this, there are several comparison and regex operations
available.
</li>
</ul>
{% include "ddm/admin/data_donation/donation_blueprint/block_data_extraction.html" %}

{{ formset.management_form }}
<div class="ddm-admin-form">
Expand Down
7 changes: 6 additions & 1 deletion docs/modules/ROOT/pages/for_researchers.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -271,8 +271,13 @@ that can help you test regex patterns.
====

Expected fields:: The fields that must be contained in the donated file. If a file does not contain
one or more of the fields defined here, it will not be accepted as a donation.
*all* fields defined here, it will not be accepted as a donation. +
Put the field names in double quotes (") and separate them with commas ("Field A", "Field B").
You can also use regular expressions (regex) to match expected fields - for this, you
must enable the `expected field regex matching` option (see below).

Expected field regex matching:: Select if you use a regex expression in the `Expected fields`
setting.

Expected File Format:: The file format of the expected data donation. Currently, only JSON and CSV is implemented.

Expand Down
23 changes: 15 additions & 8 deletions frontend/src/components/FileUploader.vue
Original file line number Diff line number Diff line change
Expand Up @@ -497,18 +497,25 @@ export default {
fileContent.forEach(entry => {
// Check if all expected fields are here
// Check if file contains all expected fields
let missingFields = [];
if (!blueprint.expected_fields.every(element => {
let eleRegex = new RegExp(element);
if (Object.keys(entry).filter(entry => eleRegex.test(entry)).length > 0){
return true;
if (!blueprint.expected_fields.every(field => {
if (blueprint.exp_fields_regex_matching) {
let fieldRegex = new RegExp(field);
if (Object.keys(entry).filter(key => fieldRegex.test(key)).length > 0){
return true;
}
} else if (Object.keys(entry).filter(key => field === key).length > 0) {
return true;
} else {
missingFields.push(element)
missingFields.push(field)
nEntriesWithMissingFields += 1;
return false;
}})) {
// Go to next entry and record exception
}
})) {
// TODO: Check wheter and how to implement this:
// Go to next entry and record exception
// let errorMsg = `Entry does not contain the expected field(s) "${missingFields.toString()}".`;
// uploader.postError(4203, errorMsg, blueprint.id);
return;
Expand Down

0 comments on commit e5041b5

Please sign in to comment.