Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File description metadata of ingested files are not in the DDI exported metadata #5051

Closed
jggautier opened this issue Sep 12, 2018 · 20 comments · Fixed by #10938
Closed

File description metadata of ingested files are not in the DDI exported metadata #5051

jggautier opened this issue Sep 12, 2018 · 20 comments · Fixed by #10938
Labels
Feature: Metadata FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) Size: 3 A percentage of a sprint. 2.1 hours.
Milestone

Comments

@jggautier
Copy link
Contributor

jggautier commented Sep 12, 2018

I've seen this omission/bug since at least Dataverse version 4.9 and verified it in Dataverse version 6.1:

In the exported metadata of datasets with ingested files, for files that Dataverse ingests, the file description text (which depositors enter for uploaded files) are not in the exported DDI.xml. For example, for the dataset at https://doi.org/10.7910/DVN/1ZPAKL, the ingested file's file description, "Data from vignette survey experiment conducted in Denmark in June 2023", is not in the exported DDI.

I think next steps would be to figure out if:

  • this is a bug, so the file descriptions of ingested files should be in the DDI, or
  • this is just an omission, and we have to figure out where in the DDI xml the file description should be
@pdurbin pdurbin added the Type: Suggestion an idea label Nov 16, 2023
@jggautier jggautier removed the Type: Suggestion an idea label May 1, 2024
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project May 6, 2024
@cmbz
Copy link

cmbz commented May 6, 2024

2024/05/06

  • Determine if this is an omission or a bug.

@cmbz cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Jul 10, 2024
@cmbz
Copy link

cmbz commented Jul 10, 2024

2024/07/10

  • Sized at 3 and assigned to @jggautier for assessment. Resize and reassign as needed.

@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Jul 10, 2024
@pdurbin
Copy link
Member

pdurbin commented Jul 10, 2024

@amberleahey or @stevenmce might know off the top of their head where a file description should go in DDI.

@stevenmce
Copy link

The spec is here: https://ddialliance.org/Specification/DDI-Codebook/2.5/

Are we updating to DDI version 2.1 or 2.5 (2.6 is also on it's way, but only just released for review: https://github.com/ddialliance/ddi-c_2).

@stevenmce
Copy link

Here's a sample XML from the DDI spec site:
https://ddialliance.org/sites/default/files/1990STF1_2_5.xml
(From a study from Minnesota Population Center)

@jggautier
Copy link
Contributor Author

jggautier commented Jul 12, 2024

Thanks for the links @stevenmce.

I think Dataverse is using DDI 2.5 already, right? That's what we say in the Appendix page of the Dataverse Guides and I see references to that version in the DDI exports. And when I opened this GitHub issue, I was referring to how Dataverse uses that version of DDI Codebook.

I think it'll be helpful to add what some folks from the Dataverse core team said about this GitHub issue during a planning meeting this week:

  • We weren't sure if this was a bug and the file descriptions of ingested files were meant to be included in DDI-C exports, or if those file descriptions were intentionally left out of the DDI-C export.
  • @pdurbin asked why this is important. I wondered if one reason is that excluding file descriptions of ingested files might make some data less discoverable. For example, if DDI-C metadata of one Dataverse repository is harvested into another repository, that repository won't be able to index what's in the description metadata of ingested files, which might help others find that dataset. And I recommended reaching out to learn more from other repositories that seem interested in this GitHub issue, like Dataverse SODHA.

With other priorities I'm not able to focus on this issue, so I'm recommending we move it out of the sprint ready column of the IQSS Dataverse Project board. @sbarbosadataverse, do you agree?

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2024

I think exposing file descriptions via DDI is a great idea. I took a quick look at the links above but I wasn't able to quickly figure out which DDI field to use. 🤷

@amberleahey
Copy link

amberleahey commented Jul 15, 2024

A few things are happening for file metadata and DDI Codebook exports:

  1. Only tabular ingested files are getting added to the File Description DDI tag set , AND all other files in the DV Dataset are added to the DDI tag (Other materials) see example (https://odesi.ca/api/ddi?id=/odesi/doi__10-5683_SP3_LDJZ8Y.xml)
  2. For tabular ingested files , the descriptions of these are not included in the section, but for non-tabular files that are referenced in OthMat the descriptions are being included and are mapped to DDI / e.g. "Command code - STATA format" for example (see full dataset in Borealis and tabular file with description not included in DDI exported XML here
  3. It's interesting that for OthMat files the notes is autogenerated by Dataverse for the MIME type (e.g. "text/x-spss-syntax")

Overall, I think the tabular data ingested files could remain in the File Dscr section and we add a TXT or NOTE tag to the set for the descriptions. We also noticed there were issues with mapping the new standard CC licenses (these do not get into the DDI) but custom licenses do so we had to set this up for all of Odesi. There are other mapping issues with Codebook that could be tackled by the DDI community and a new exporter could be built to support 2.5 , 2.6 with these improved mappings....

@pdurbin
Copy link
Member

pdurbin commented Jul 15, 2024

@amberleahey thanks, that helped me find the writeFileDescription method that does indeed write to the DDI txt field, like you're saying, such as <txt>Command code - STATA format</txt> below.

<otherMat ID="f663995" URI="https://borealisdata.ca/api/access/datafile/663995" level="datafile" restricted="false">
<labl>CTNS2022_P_BSW.dct</labl>
<txt>Command code - STATA format</txt>
<notes level="file" subject="Content/MIME Type" type="DATAVERSE:CONTENTTYPE">application/octet-stream</notes>
</otherMat>
private static void writeFileDescription(XMLStreamWriter xmlw, FileDTO fileDTo) throws XMLStreamException {
    xmlw.writeStartElement("txt");
    String description = fileDTo.getDataFile().getDescription();
    if (description != null) {
        xmlw.writeCharacters(description);
    }
    xmlw.writeEndElement(); // txt
}

@jggautier jggautier removed their assignment Aug 13, 2024
@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
@cmbz
Copy link

cmbz commented Aug 23, 2024

2024/08/23: Reopening because issue was already sized and prioritized.

@cmbz cmbz reopened this Aug 23, 2024
@landreev landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Oct 15, 2024
@landreev landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Oct 15, 2024
@landreev landreev self-assigned this Oct 15, 2024
@cmbz cmbz added the FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) label Oct 15, 2024
@landreev
Copy link
Contributor

landreev commented Oct 17, 2024

Just a quick note before I make a PR:

Overall, I think the tabular data ingested files could remain in the File Dscr section and we add a TXT or NOTE tag to the set for the descriptions.

Tabular ("ingested") files do need to remain in fileDscr sections - that's required by the schema essentially, since fileDscr provides the dedicated fields that encode information specific to tabular data (such as dimensns, caseQnty and varQnty).
We cannot add a txt field for the description text there, like we do with otherMat, because it's not in the schema. But a note with an appropriate attribute seems like a good solution - and yes, we should have handled it like that all along.

I'm seeing that this was estimated as a "3", which is what we use for most straightforward fixes - like the amount of effort it would take to implement what I just described above, so I'll try and stay within that. :)

@landreev
Copy link
Contributor

So, it'll look like this:

<notes level="file" type="DATAVERSE:FILEDESC" subject="DataFile Description">
   This is a tabular file produced from a Stata .dta file with rich descriptive metadata
</notes>

@jggautier
Copy link
Contributor Author

Thanks @landreev!

I opened this GitHub issue and merely described something that seemed inconsistent to me. But I think I should have also encouraged us to think about how we'll know that however this is resolved was a good way to resolve it. And I hope that we can discuss this now while considering your solution.

I imagine this would help anyone who needs to export the DDI-Codebook metadata of data in their repository in order to preserve that metadata. Does that sound right?

This change has no affect on how findable harvested datasets are, since I think Dataverse doesn't index any of the file-level metadata that it harvests from DDI-Codebook metadata.

@lubitchv
Copy link
Contributor

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

@landreev
Copy link
Contributor

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

I'll test it with your Data Explorer also. I can't imagine it actually causing a problem - since the new note has attributes clearly marking it as different from the other kinds of notes that can be found under <fileDesc>, I expect it to just be skipped. But yes, needs to be tested of courses.

Was good to see you at Dagstuhl! 🙂

@landreev
Copy link
Contributor

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

FWIW, the test in EditDDIIT is passing and EditDDI does not appear to be using <fileDsrc>.

@landreev
Copy link
Contributor

@jggautier Yeah, it was just a weird inconsistency. Was worth fixing just for the sake of striving to export as much of the information about the data as possible. Whether it'll ever benefit anyone significantly in real life, idk.
As I mentioned in the PR, I'm guessing we haven't been exporting it for ingested files because there was no obvious place for it under <fileDscr> in the schema; but we should have used another free text note for it all along.

@landreev landreev removed their assignment Oct 18, 2024
@lubitchv
Copy link
Contributor

This change may potentially affect our data explore and our other tool (odesi). We will need to test that.

FWIW, the test in EditDDIIT is passing and EditDDI does not appear to be using <fileDsrc>.

Right, I thought about DataDscr that is using additional note sections for curation, so yes, you are right, for data curation it should not matter, it does not using fileDscr. Although, I should talk to my colleague @nana-boateng. He is using xml codebook for our search tool odesi. I believe it should not matter, but we need to test it too.

@lubitchv
Copy link
Contributor

lubitchv commented Oct 24, 2024

@nana-boateng confirms that the change should not affect our odesi.

ofahimIQSS added a commit that referenced this issue Nov 13, 2024
adding description info to the fileDsc seciton in DDI CodeBook. #5051
@pdurbin pdurbin added this to the 6.5 milestone Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata FY25 Sprint 8 FY25 Sprint 8 (2024-10-09 - 2024-10-23) Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants