Welcome to "Research Data Management in the Energy Sector"! This course teaches all skills necessary to understand the principles and motivation behind Research Data Management (RDM) and enable you to implement RDM in your work and research group.
The course focuses on applicability in the energy sector.
There are three possible ways to work with this course:
- Do you want to gain a thorough understanding of RDM? Follow the course outline described in the following graph. When you complete the course, you will have established a basic Data Management Plan for a project of your choice that you can build upon and adapt.
- Are you looking for a challenge? Answer the questions at the end of each chapter to find the solution to our RDM-puzzle.
- Do you already have some knowledge of RDM? Great! Choose individual chapters that are of interest for your field of work by clicking on the plus-signs.
A quick word on the course format. The course is written in Markdown and implemented in LiaScript. In the upper right corner, you can switch between textbook mode, presentation mode and slides. You may choose to have the course read aloud to you by clicking on the symbol on the bottom of the page in presentation mode. If you want to adapt the course for your own use, you may do so by going on GitHub and opening a branch of your own or downloading individual files.
You can give feedback to this course on github: FlourBerry#2
{{1}} In early 2020, the COVID-19 disease, caused by the coronavirus SARS-CoV2, broke out globally, which led to the closure of many shops and businesses for quarantine reasons. The result, especially in the USA, was a large number of unemployed people who urgently needed money for their next rent payment, food or other expenses. As a consequence, the government decided to set up a relief package for anyone who registers as unemployed - but why didn't the money get to the people?
{{1}} The reason for this was the overload of critical systems on which COBOL is still running. COBOL is a programming language that was developed in the late 1950s to control commercial applications. From today's perspective, the programming language is very outdated and no longer taught in the training of programmers. That is why there was no personnel to take care of the systems when they collapsed. Unfortunately, many applications with the outdated programming language are still running in the business sector.
{{1}} Source: FDM Thüringen: Scarytales, Licensed under CC-BY 4.0
{{2}}
Exercise: Suggest two possible process changes that could have prevented the outcome of the scenario:
{{2}}
Solution (click to enlarge):
There are several solutions possible: Existing systems should be questioned, since requirements can change and established habits can lead to problems from today's perspective. For example, at some point data might no longer be able to be called up or might exist in formats that are increasingly difficult to be processed. A thorough documentation of the programs might also help in some cases to rebuild them in other languages. For timely relief, the administration called out to retired COBOL programmers to work on the issues.
{{3}} While maybe not as critical for society at large, scientists can face similar problem when trying to access old data or programs that were written for other purposes but are know needed for the current tasks: Data are insufficiently labeled, have been overwritten, commercial computer programs have been discontinued or process details have not been recorded.
{{3}} Research Data Management (RDM) aims to break this dynamic by ensuring a sustainable and coherent strategy for all data types throughout the research process, enabling researchers to store, access and re-use their own work effectively and safely and opening their findings worldwide to improve on cross-disciplinary collaboration, monitoring and replication.
{{3}} RDM includes all activities associated with processing, storage, preserving and publication of research data.
The Research Data Lifecycle. Original source: UK Data Service, recreated by Antje Ahrens. Licensed under CC-BY-4.0
{{2}}
Exercise: Look at each step of the cycle with your data in mind:
Which tools or software do you use at each step?
Which infrastructure is needed?
How sensitive is the data at this point?
How well is the data protected against data loss? Where are possible flaws where data could get lost?
Does the cycle "close"? Which repositories or platforms do you use to store your results?
Researchers are currently spending a significant portion of their own time dealing with data curation; in some cases, over 50% of their funded time. Scientific equity cannot be fully achieved when individual scientists act as gatekeepers for new models, data, and software.
-- Mullendore et al (2021)
You…
- have a lack of extra time to devote to RDM?
- don´t have enough money for structured RDM?
- don´t have the expertise, resources, or incentives to share your data?
- worry whether your data is transferable because some data have ethical or epistemological restrictions?
- have many stakeholders with competing interests in your project?
In this chapter, we will prove to you that RDM is not as tricky as you think.
If done right, RDM will
...save you time, resources, effort and money:
-
preserve time that is otherwise lost while searching for, recovering, and deciphering data
-
make date reusable for you, so you can revisit and reuse your data later
- enable you to benefit from high quality datasets from other researchers
… improve impact and speed up scientific progress:
-
make research reproducible and reusable, so others can verify and build upon your findings. This strengthens integrity and increases your citations.
-
improve your research reputation and increase your visibility
-
continually influence research developments long after the original research has been completed
-
will permit new and innovative research to be built on existing information, especially within and across disciplines.
… help to prevent error and enhance data security:
-
secure your data against loss and protect sensitive data against misuse, theft, damage and disaster.
-
ensure records are synchronized, complete and reliable.
-
improve research workflows to make them more resilient to human error and software vulnerability
… show compliance with research obligations of your institution and funder:
-
fulfill grant requirements and comply with practices conducted in industry and commerce in order to demonstrate a high-quality research output.
-
ensure transparency since individual contributions are documented on-the-fly throughout the complete research and archiving process.
-
increase your chances of funding since many funding organizations require a DMP.
-
simplify cooperation and collaboration (e.g. through better documentation and affiliation of the data collected).
Open Science strives to to make scientific research and its dissemination accessible to all levels of society. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open-notebook science (such as openly sharing data and code), broader dissemination and engagement in science and generally making it easier to publish, access and communicate scientific knowledge.
Source: https://en.wikipedia.org/wiki/Open_science
Research Question:
Is there an Open Science or Research Data Policy at your institution?
What is its scope? What is regulated and how?
If not: Would you like to have a Research Data Policy? What content should it have?
The main idea behind the FAIR principles is:As open as possible, as closed as necessary! The acronym stands for
- Findable
- Accessible
- Interoperable
- Re-Usable
{{1}}
Exercise: Look closely at the graph to identify which measures especially apply to your area of work and data types you are using.
{{2}}
Quiz: What can you do to make your data FAIRer?
[[X]] use Creative-Commons or GNU Licenses [[ ]] keep processing details undisclosed [[ ]] ensure access security [[X]] create detailed Metadata
[[X]] ensure long-term accessibility in repositories
The Data Management Plan (DMP) contains all information that describes and documents sufficiently the collection, processing, storage, archiving and publication of research data within a research project.
Exercise: Watch the following video and answer the questions.
!?Ghent University Data Stewards (2020) --> enter interactive h5p Video link with questions in it
Many public funding organizations require a DMP prior to granting funds for research projects, thus making DMPs an integral part of the scientific process, especially in data-intensive research fields such as the energy sector.
Funder | Plan demanded? | Content | Updates? |
---|---|---|---|
Horizon Europe | Data Management Plan | see Horizon Europe Online Manual | Yes |
German Research Foundation (DFG) | Information on the handling of research data | DFG Guidelines on the Handling of Research Data, Checklist | No |
German Ministry for Economic Affairs and Climate Action (BMWK) | Research Data Exploitation Plan | scientific and economic potential, connectivity and transferability | once a year |
German Ministry of Education and Research (BMBF) | Plan sometimes required | Content depends on the respective programme | |
German Ministry for Digital and Transport (BMDV) | Research Data Exploitation Plan sometimes required | Content depends on the respective programme |
Tools can help you
-
to comply with data management requirements by providing guidance and templates for data management planning. They facilitate data sharing and preservation.
-
to organize, analyze, and visualize your data in order to make it easier to draw insights and conclusions from your research. Many research projects generate large and complex datasets that can be difficult to manage without the right tools.
-
to facilitate collaboration by allowing multiple researchers to access and analyze data.
-
to ensure accuracy and reliability of your data by providing features such as data validation, version control and audit trails
-
to long-term preserve your data with backups, archives, and structured metadata.
How? The specific features and capabilities of research data management tools can vary widely depending on the tool and the intended use case. In general there are 6 steps:
- Choose the right tool
- Plan your data management strategy
- Set up the tool
- Input and manage your data
- Collaborate and share your data
- Preserve and archive your data
There is a wide range of tools for RDM, for example:
- Toolbox of the FDM Team at Leibniz Uni Hannover: https://www.fdm.uni-hannover.de/en/tools
Tools for internal DM:
Data Management Plan Tools:
You can create a DMP in a simple Excel-Sheet or use format templates or questionnaire-based tools. In the latter, you can choose a set of questions that you want to have included in your DMP. Some are focused on data and data formats, other focus on the project structure. If you want to take a look at an example DMP, you can look here.
- DMPOnline (https://dmponline.dcc.ac.uk/)
- DMPTooI (https //dmptool.org/)
- RDMO (https://rdmorganiser.github.io/en/)
DMP Task:
Now it is time to start your DMP! Choose a tool and a set of questions that you want to work with and enter your basic project parameters:
- Title and Research Question
- Project Partners and Institutions
Depending on your settings, you can add other details required by the respective questionnaire.
Congratulations! You have started your first DMP!
<iframe src="https://wp.uni-oldenburg.de/innovative-hochschule-jade-oldenburg-wise20182019/wp-admin/admin-ajax.php?action=h5p_embed&id=20" width="660" height="564" frameborder="0" allowfullscreen="allowfullscreen" title="Tools Quiz"></iframe><script src="https://wp.uni-oldenburg.de/innovative-hochschule-jade-oldenburg-wise20182019/wp-content/plugins/h5p/h5p-php-library/js/h5p-resizer.js" charset="UTF-8"></script>Licenses, informed consent, protection of personal data, access
Data Protection and Backup Strategies Secure data handling is crucial for several reasons, primarily focused on:
-
Protection against data theft: Data theft refers to the unauthorized acquisition or copying of sensitive information by malicious individuals or groups. This stolen data can be exploited for various purposes, such as identity theft, financial fraud, or corporate espionage.
-
Prevention of data misuse: Data misuse involves the unauthorized or inappropriate use of data. This can include accessing, disclosing, or manipulating data in a way that goes against legal or ethical guidelines. Misuse of data can have severe consequences, including reputational damage, legal liabilities, and violation of privacy regulations. Secure data handling practices, such as access controls or encryption, help prevent unauthorized access and ensure data is used appropriately.
-
Protection of individuals' sensitive data: Individuals entrust organizations with their sensitive information, such as personal details, financial data, or medical records. It is essential for organizations to handle this data securely to protect individuals' privacy and prevent potential harm. Secure data handling practices involve implementing strong security measures and ensuring that only authorized personnel have access to sensitive data.
By prioritizing secure data handling, organizations can instill trust among their customers or users, maintain compliance with relevant regulations, and mitigate the risks associated with data breaches or misuse.
Moreover there are also various risks can lead to data loss:
- Technical defects: Malfunctions in hardware or software can result in the loss of data.
- Catastrophes: Natural disasters or severe weather conditions can cause damage to data storage systems.
- Oblivion: Accidental deletion or failure to back up data can result in permanent loss.
Thats why protection against data loss is very essential.
But there are some strategies to ensure the security and availability of data:
-
Storage on servers with automatic regular backup: Storing data on servers, such as university servers, provides a centralized and controlled environment. Automatic regular backups are performed to create copies of data at specific intervals. This strategy helps protect against data loss due to hardware failures, accidental deletions, or other unforeseen events. By maintaining backups, organizations can restore data to a previous state if the original data becomes compromised or unavailable.
-
Backup of important files:
- To ensure data resilience, it is recommended to have at least three copies of a file. This redundancy minimizes the risk of data loss. The copies should be stored on two different media, such as hard drives, tapes, or cloud storage, to protect against media failures.
- Setup of backups: It is crucial to regularly back up data according to a defined schedule. This ensures that recent changes and updates are included in the backup. Additionally, testing the data recovery process periodically helps validate the integrity and effectiveness of the backup system.
- Protect sensitive data: To maintain the confidentiality and integrity of sensitive data, several security measures can be implemented: A. Aspects of access security refer to various factors that need to be considered when ensuring the security of access to information, systems, or resources:
- Privacy: Privacy involves protecting sensitive information from unauthorized access or disclosure. It ensures that only authorized individuals or entities can access and view sensitive data. Privacy measures often involve access controls, encryption, and secure communication protocols.
- Access controls: Granting access to authorized individuals: Limiting access to sensitive data to only authorized personnel reduces the risk of unauthorized use or disclosure. Granting access to at least two individuals helps ensure continuity even if one person is unavailable or compromised.
- Access rights and role assignment are important aspects of access control in an organization's data handling practices:
- Access rights: Access rights refer to the permissions and privileges granted to individuals or groups to access specific resources or perform certain actions within a system or organization. These rights define what actions a user can perform, what data they can view or modify, and what functionalities they can access. Access rights are typically defined and managed through access control mechanisms.
- Role assignment: Role assignment involves assigning specific roles or responsibilities to individuals within an organization. Roles are defined based on job functions, responsibilities, and hierarchical levels. Each role is associated with a set of access rights that are necessary to carry out the duties and tasks associated with that role.
- When it comes to access rights and role assignment, the following principles are typically followed:
- Principle of least privilege: This principle states that individuals should be granted the minimum access rights necessary to perform their job functions effectively. Assigning excessive privileges increases the risk of unauthorized access or misuse of data. By adhering to the principle of least privilege, organizations reduce the attack surface and limit potential damage in case of a security breach.
- Role-based access control (RBAC): RBAC is a widely used approach for access control. It involves assigning roles to individuals and associating access rights with those roles. This simplifies the management of access rights as permissions are granted to roles rather than to individual users. Role assignment is based on the individual's job responsibilities, ensuring that access rights are aligned with their needs.
- Regular review and updates: Access rights and role assignments should be periodically reviewed and updated to reflect changes in job roles, responsibilities, or organizational structure. When individuals change positions or leave the organization, their access rights should be revoked or modified accordingly. Similarly, new employees should be assigned appropriate roles and access rights based on their job requirements.
-
Hardware security:
-
Implement physical protection measures such as lockable rooms, safes, or data trustees
-
Safeguarde the physical infrastructure where the data is stored, such as servers or data centers, by using access controls, surveillance systems, and environmental controls (e.g., temperature and humidity monitoring).
-
File encryption: Utilizing encryption techniques to protect the contents of files during storage and transmission. Encryption ensures that even if unauthorized individuals gain access to the data, it remains unreadable and unusable without the encryption key.
-
Utilize automatic encryption options like FileVault, Bitlocker, or dm-crypt.
-
Password security: Enforcing strong password policies, such as using complex passwords, regularly changing them, and avoiding password reuse, helps prevent unauthorized access to the data.
-
Use strong passwords with at least eight characters, including upper and lower case letters, special characters, and numbers.
-
Avoid sequential characters on the keyboard and dictionary words.
B. Integrity ensures that data remains accurate, complete, and unaltered during storage, processing, or transmission. It involves preventing unauthorized modifications, deletions, or additions to data. Techniques such as checksums, digital signatures, and access controls help maintain data integrity.
- Availability: Availability ensures that information and resources are accessible and usable when needed by authorized users. It involves preventing or mitigating disruptions or unauthorized denial of service. Measures like redundancy, backup systems, disaster recovery plans, and network monitoring help maintain availability.
- Controllability: Controllability refers to the ability to manage and control access to information or resources. It involves defining and enforcing access control policies, granting appropriate privileges to users, and monitoring and auditing access activities. Controllability measures include authentication mechanisms, authorization frameworks, and security monitoring systems.
- Archiving refers to the process of creating backups of selected data and storing it for long-term retention. It involves preserving important information, such as final versions of documents or records, in a secure and accessible manner. Archiving serves multiple purposes, including compliance with legal and regulatory requirements, preservation of historical data, and ensuring the availability of critical information for future reference. When archiving data, it is crucial to ensure its integrity. This means that the data remains unchanged and tamper-proof during the archival process and throughout its storage. Techniques such as checksums or digital signatures can be used to verify the integrity of archived data, providing assurance that it has not been altered or corrupted. An essential aspect of archiving is the ability to search and retrieve archived data efficiently. This is particularly important when there is a need to access specific information within the archived data. Implementing effective indexing, metadata tagging, and search capabilities enables users to locate and retrieve the required data without having to sift through vast amounts of archived content manually. Requirements for long-term archives are:
- Technical requirements:
- Seek trustworthy long-term archives with seals such as CoreTrustSeal, nestor seal, or DIN 31644.
- Consider expenses, data availability, and longevity of the service provider.
- Sustainable file formats:
--> tables
Header 1 | Header 2 | Header 3 | Header 4 | Header 5 |
---|---|---|---|---|
Item 1 | Item 2 | Item 3 | Item 4 | Item 5 |
Header 1 | Header 2 |
---|---|
Item 1 | Item 2 |
Depending on your field of expertise, there will be different data types relevant to your work. In your DMP, you can specify individual datasets and name their identifiers, file size, origin and so forth. Simulations and software programmes are a special case. Here, not only the resulting datasets are of scientific value but also the programmes, algorithms and settings leading up to the data. ... The data types you choose to publish have consequences for your Research Data Management in regard to the "Accessible", "Interoperable" and "Re-Usable" Criteria of the FAIR Guidelines.
1. Accessible
--> Make sure that (meta)data are long-term accessible
The amount of memory space needed for long-term, save storage may vary greatly between data types. For example, if your reasearch data consist of high-resolution images of potential wind turbine sites, memory space needed for backup and repositories will be much higher than a handful .csv files with projected power loads.
2. Interoperable --> Use tools and software that are FAIR themselves Try to remember the idea of "as open as possible, as closed as necessary.": Strive to choose, where possible, programs and data types that are open and FAIR. For example, do not describe your processing details in a .pdf file, but rather use a .txt-file that can be read and edited with various programs.
3. Re-Usable --> know and meet community standards As the expert in your own field, you are the one to decide here: which data types are most likely to be found, easily re-used and shared in your scientific community? If these are "closed", you could provide two files: one that adheres to the common standards and a more open alternative that promotes the idea of open science.
Exercise:
Make a list of the data types commonly used in your working group and their "FAIR value". Use the criteria in the table below.
FAIR value | Machine Readability | Human Readability | Long-time Stability | Metadata | Example |
---|---|---|---|---|---|
very good | with common open software | yes and without special software | normed standard | completely preserved | .csv |
good | with common and well-documented software | compressed with standard procedures, but practically yes | longterm or widely established | technical details are provided | .xlsx |
moderate | proprietary standard format | with open software (reliably?) convertible to higher class | relatively new format | some important ones (e.g. units) are included | MATLAB file |
bad | self-developed reading software | no | newly developed | no information |
Source: translated and adapted from forschungsdaten.info, CC-BY 4.0
DMP Task:
If you already know datasets that you will be (re-)using or producing, enter them in your DMP as thoroughly as you can at this point.
If you do not know your datasets yet, enter one or two "dummy" datasets and follow the respective questionnaire to get an idea of the amount of detail required in the DMP.
Research programmes and simulation software are absolutely essential in energy research. Some researchers are very reluctant in sharing this part of their research work, sometimes because of economic considerations.
The German Research Alliance (DFG) adresses software development in a research context in its Guidelines for Safeguarding Good Research Practice, stating that "software programmed by researchers is made publicly available along with the source code". The source code must be persistent, citable and documented. Certain individual exceptions are possible, though.
So, whether you plan to publish some, all or none of your self-developed software, you have to document well, plan ahead and ensure a good access strategy in your development team.
-
Versioning
Various cloud programming platforms such as GitLab ensure a consistent preservation of all of your software versions, make contributions transparent and simplify collaboration.
-
Documentation
Programmes should be documented and commented, e.g. by stating origin, purpose and scope of a programme, limitations of variables and ideally a short manual.
-
Publication
With journal publication of an article, corresponding software should be cited with version number and a persistent identifier such as a DOI. Choose a software archive with version control. Sometimes, timely open access is not possible. In this case, at least the algorithm used must be outlined completely. If the source code is not published, state the reason and potentially the time period until release.
With regard to simulation, we are presented with some additional challenges.
The longevity of simulation outputs is harder to assess than of observational data. In a 2021 paper on Open Science in Earth System Modelling, Mullendore et. al. (2021) indentify the following problems:
- Simulations can generate massive output
- Interdependencies between hardware and software can limit the portability of models and make the longterm accessibility of their output problematic. Models in many cases involve interconnections between community models, open source software components, and custom code written to investigate particular scientific questions.
- A lack of standardization and documentation for models and their output makes it difficult to achieve the goals of open and FAIR data initiatives
The following strategies may be applied when working with simulations:
- Analyze your project: does it focus on knowledge or data production? "Most scientific research projects are undertaken with the main goal of knowledge production (e.g., running an experiment with the goal of publishing research findings). Other projects are designed and undertaken with the specific goal of data production, that is, they produce data with the intention that those data will be used by others to support knowledge production research." (Mullendore et. al. (2021))
- Publish all elements relevant to the simulation, not just source code (see Figure 3).
- Publish enough output data to evaluate and re-use your findings, but not all simulation runs
- Some software may be released openly while others remain restricted due to security or proprietary concerns. In this case, the documentation should provide enough information to reproduce the process.
Figure 3: Mullendore et. al. (2021): Data and software elements to be preserved and shared by all projects.
Checklist for FDM in simulation and software development:
[[x]] Train yourself and your team in software development quality [[x]] State the purpose of each programme in a few words [[x]] Keep your software up to date with quality management [[x]] Keep the intention of every function clear (by naming or comments) [[x]] Choose an appropriate license [[x]] Establish quality management in your simulation process
DMP Task:
If you have self-developed programmes in your project, enter them as a separate dataset in your DMP, name all contributors and the purpose of each programme.
In this chapter, we will discuss how to handle your research data files both in handling and storage.
You should develop a back-up policy early on in your project, since later down the road it is very hard to re-structure established processes.
- establish a folder structure that is intuitive and simple.
A handy rule is the 3-2-1 rule: Figure 4: I. Lang/Bearbeitung E. Böker: 3-2-1 Backupregel. Edited by A. Ahrens: Translation. Licensed under CC-BY 4.0
--{{2}}-- The 3-2-1 rule strives states that you are to create at least three copies of your data, secure them on two different kinds of storage medium and make sure that one of those mediums is located off-site. This way, your data is protected from natural disaster, which will most likely only hit one location at a time, against accidental deletion and against deterioration of one kind of storage medium. For example, CDs have quite a long lifespan but CD drives as a reading device have become increasingly scarce.
There are various commercial providers for back-up solutions. Consider the following questions before choosing one: [[x]] Does it carry a seal for trustworthy long-term archives (e.g. CoreTrustSeal, nestor seal, DIN 31644)? [[x]] Does it fulfill your technical requirements? [[x]] Can you cover the expenses? [[x]] Is the data readily available to you? [[x]] Does the service provider guarantee long-term storage?
DMP Task:
Which backup solutions does your institution offer? If they do not meet your expectations, calculate additional costs in order to include them in your project proposal.
You might want to limit access to your data to prevent data loss or protect sensitive data - during the research process or before publishing.
In process, review your folder structure and create groups according to their access permissions, e.g.:
Admin - Projekt Lead - Researchers - Students
You might also consider encrypting sensitive project data.
Installing and administrating access permissions requires some technical abilities, so you should take that into your time and financial plan.
If you want to limit access of the published data, this does not contradict the FAIR guidelines, but you should make the access scheme transparent (e.g. by using appropriate licenses) and give access to as much detail as possible.
--> muss noch übersetzt werden
Well-structured metadata are of high value for re-users. Therefore, all four of the FAIR criteria stress their importance.
Metadaten im Forschungskontext enthalten strukturierte Informationen über Forschungsergebnisse, zum Beispiel Datensätze oder auch Code. Sie werden mit den beschreibenden Daten gemeinsam abgespeichert oder verknüpft.
Verschiedene Arten von Metadaten erfüllen dabei unterschiedliche Funktionen:
- Bibliografische Metadaten wie Titel, Autoren, Beschreibung oder Keywords ermöglichen die Zitation von Daten und Code und helfen bei der Auffindbarkeit und thematischen Eingrenzung.
- Administrative Metadaten zu Dateitypen, Standorten, Zugriffsrechten und Lizenzen helfen bei der Verwaltung und langfristigen Erhaltung der Daten. Prozessmetadaten beschreiben die Schritte und Aktionen mit ihren verwendeten Methoden und Hilfsmitteln, die zur Entstehung und Verarbeitung der Daten angewendet wurden.
- Inhaltsbeschreibende bzw. deskriptive Metadaten können je nach Disziplin sehr unterschiedlich aufgebaut sein und geben zusätzliche Informationen zu Inhalt und Entstehung der Daten.
Während sich bibliografische und administrative Metadaten disziplinübergreifend standardisieren lassen, haben Metadaten zum Prozess und Inhalt von Forschungsergebnissen oft einen sehr fachspezifischen Aufbau und Inhalt. Gerade diese fachspezifischen Informationen sind oft entscheidend für die Auffindbarkeit und Nachvollziehbarkeit von Forschungsdaten. Entsprechend gibt es viele verschiedene Metadatenstandards, die eine Struktur für die relevanten Informationen in einem Bereich oder einer Fachdisziplin vorgeben.
Choosign the right standard Ein weit verbreiteter Standard für die bibliographische Beschreibung von Forschungsdaten ist das Metadatenschema zur Registrierung von DOIs (digital object identifiers) von DataCite. Dieses gibt vor, welche Information zu einem Datensatzes verpflichtend angegeben werden müssen (z. B. Autor, Titel), welche Angaben empfohlen werden (z. B. Fachbereich, Beschreibung) und welche optional sind (z. B. Finanzierung, Nutzungsrechte). Diese und weitere Metadaten werden im XML-Format für die interoperable Nutzung zur Verfügung gestellt.
Ein Standard für administrative Metadaten in der Langzeitarchivierung ist PREMIS. Mit Hilfe dieses Standards können Objekte in Beziehung zu Akteuren, Ereignissen und Rechten beschrieben werden.
METS (Metadata Encoding & Transmission Standard) ist dagegen ein Beispiel für ein Container-Standard, der eine Struktur von sieben Abschnitten vorgibt (Kopfteil, Erschließungsangaben, Verwaltungsangaben, Dateiabschnitt, Strukturbeschreibung, Strukturverknüpfung und Verhalten), für deren Inhalt dann jeweils unterschiedliche Metadatenstandards gewählt werden können.
Für fachspezifische Metadaten existiert eine Vielzahl von Standards. Eine Übersicht über existierende Standards geben der Metadata Standards Catalog der RDA und die Seite der RDA Metadata Standards Directory Working Group, FairSharing.org oder DDC (Digital Curation Centre).
Während XML-basierte Metadatenschemata eine Struktur vorgeben, also festlegen welche Informationen in welchem Format angegeben werden müssen, sollen und können, unterstützen Vokabulare und Terminologien bei der Standardisierung der Inhalte. Dies reicht von kontrollierten Wortlisten, die fehlerhafte oder unterschiedliche Schreibweisen von Konzepten vereinheitlichen, über Taxonomien und Thesauri, die Über- und Unterbegriffe wie auch Synonyme zu Konzepten enthalten, bis hin zu Ontologien, die Eigenschaften und Relationen zwischen Konzepten modellieren. Einen Überblick über bestehende Terminologien gibt das Basic Register of Thesauri, Ontologies and Classifications BARTOC. Terminologie-Services ermöglichen - oft fachspezifisch - die Suche nach Terminologie-Termen.
Die konkrete Umsetzung im Forschungsalltag hängt stark von den Prozessen, Umgebungen und verwendeten Tools ab. Grundlegende Schritte sind aber:
Identifikation relevanter Metadaten: Welche Informationen sind notwendig, um ein Forschungsergebnis nachvollziehen zu können? Nach welchen Kriterien würde man gerne suchen, bzw. die Daten filtern können. Festlegen eines Datenerhebungsprozesses: Zu welchem Zeitpunkt im Forschungsprozess und in welcher Form liegen die identifizierten Informationen vor? Können die Informationen automatisiert aus vorhandenen Quellen entnommen oder erzeugt werden? In welcher Form können die Informationen praktikabel im Forschungsprozess dokumentiert werden? Wie können sie sinnvoll mit den Daten verknüpft werden? Welche Tools stehen dazu zur Verfügung? Festlegen eines Metadatenformats: Wie können die Metadaten so strukturiert wie möglich gespeichert werden? Wie können Tippfehler/unterschiedliche Schreibweisen vermieden werden. Gibt es kontrollierte Vokabulare oder Ontologien, die verwendet werden können? Wo sollen die Daten letztendlich landen? Gibt es Vorgaben des Zielsystems (Repositoriums, Archiv) zu Inhalten und Formaten der Metadaten? Wie können die Metadaten auch schon während der Bearbeitung Mehrwert bei der Verwaltung bringen? Erprobung und Verbesserung des Prozesses: Welche Teile der Dokumentation können automatisiert werden? Können Template und Vorlagen Arbeit ersparen?
For your own data, you should
- develop a consistent naming scheme for all files.
- describe your variables thoroughly
- describe processing, e.g. schematically or with a manual
Source: forschungsdaten.info, translated by A.Ahrens, licensed under CC-BY 4.0.
As a last step, you need to decide which data should be permanently archived. As we discussed in Section "Simulation", you should carefully decide which data to archive in a repository and which to discard. The following questions can lead you in the decision process:
- Are the data the foundation of an article?
- Can the data be reproduced with reasonable effort?
- Is a re-use likely?
- How high is the data quality?
- Is the dataset unique?
- How much storage volume is needed?
Choosing the right repository is often regulated by your institution or funding agency. In-house repositories might be mandatory for research data and you should find out which policies apply to you and your project.
If you are free to choose a repository, you should use one that is common for your field or discpline. If you are unsure, go to the Registry of Research Data Repositories (r3data), where a wide range of platforms are listed.
If your license terms allow, you may upload your data on multiple platforms. Make sure to use PIDs and cross-reference to link the data to your published article.
Open Energy Platform For energy-related topics, we suggest the Open Energy Platform. This platform helps you to set up a thorough ontology, and provides tutorials in order to prevent confusion or mistakes.
- Upload via github
- Data Comparison with other projects
- take part in developing a domain ontology for energy system sciences
- Data Review Service
Please visit https://openenergy-platform.org/ for detailed descriptions of the upload process.
Exercise:
Choose a repository for your data and create an account.
DMP Task:
Look through your DMP to find open questions that you have not covered so far and fill any gaps you can find.
By the way - did you find all clues and solve all puzzles? Then you can enter the resulting letters in the box below to see if you are a DMP-Champion!
-
Maxi Kindling, Peter Schirmbacher, Elena Simukovic: Forschungsdatenmanagement an Hochschulen: das Beispiel der Humboldt-Universität zu Berlin. LIBREAS. Library Ideas, 23 (2013). https://doi.org/10.18452/9041
-
Biernacka, Katarzyna; Maik Bierwirth; Petra Buchholz, Dominika Dolzycka; Kerstin Helbig; Janna Neumann; Carolin Odebrecht; Cord Wiljes and Ulrike Wuttke: Train-the-Trainer Concept on Research Data Management. Version 3.0. Berlin, 2020. https://doi.org/10.5281/zenodo.4071471
-
FAIR Guiding Principles: https://www.go-fair.org/
-
UK Data Service: https://ukdataservice.ac.uk/learning-hub/research-data-management/
-
https://forschungsdaten.info/praxis-kompakt/glossar/#c269829
-
Mullendore GL, Mayernik MS and Schuster DC (2021): Open Science Expectations for Simulation-Based Research. Front. Clim. 3:763420. https://doi.org/10.3389/fclim.2021.763420
This work is licensed under a Creative Commons Attribution 4.0 International License.