Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project: Pilot Large Data Support Service #178

Open
19 of 25 tasks
cmbz opened this issue Feb 9, 2024 · 10 comments
Open
19 of 25 tasks

Project: Pilot Large Data Support Service #178

cmbz opened this issue Feb 9, 2024 · 10 comments
Assignees
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository Project: Large Data Support Pilot Pilot of large data support services using NESE resources

Comments

@cmbz
Copy link
Contributor

cmbz commented Feb 9, 2024

Overview

Pilot Harvard Dataverse large data support services using NESE tape resources for several datasets from Harvard affiliates.

Tasks

  • Identify prospective data collections
  • Confirm pilot participants
  • Coordinate with pilot collection owners to estimate collection size and curation needs
  • Create Harvard Dataverse collections
  • Coordinate with NESE support staff for tape provisioning
  • Coordinate with data collection owners, NESE support, and Dataverse team to upload data
  • Coordinate with IQSS finance re. costs
  • Assess pilot and propose workflow improvements

Process Development and Management

  • Develop an intake process including basic and extended consultation process
  • Develop and launch RT queue, including service attributes (will map to service offerings)
  • Develop large data collection intake form @sbarbosadataverse
  • Define what is included in the Large Data Technical Support service component
  • Define what is included in the Large Data Administration service component
  • Define what is included in the Large Data Monitoring service component
  • End-user instructions for NESE data access, add to Dataverse documentation in Guide and HDV support website
  • Document large data curation services process (see hdv-curation issues on large data)

Team

Pilot Participants

School of Engineering and Applied Sciences (Harvard SEAS)

Terabyte-Scale Dataset for Partial Discharge 2 Detection in Covered Conductors via Contact 3 Galvanic Method

-https://db.cs.vsb.cz/datacolls/pdp2020/
-https://help.hmdc.harvard.edu/Ticket/Display.html?id=366158&results=56072ee681951385a6015bad30361ae4

The Oregon-Massachusetts Mammography Database (OMAMA-DB)

See also: https://fly.cs.umb.edu/omama/ and https://github.com/IQSS/dataverse-HDV-Curation/issues/443

  • Size: 8T
  • Contact: Daniel Haehn, SEAS & UMass Boston

GEOS-Chem 1 and 10 year Benchmark data

*recontact early next year - they lost staff and can't devote to this effort at this time
See also:
https://help.hmdc.harvard.edu/Ticket/Display.html?id=361614
https://github.com/IQSS/dataverse-HDV-Curation/issues/456

DrivAerNet: A Parametric Car Dataset for Data-driven Aerodynamic Design and Graph-Based Drag Prediction

See also: https://github.com/IQSS/dataverse-HDV-Curation/issues/444

  • moving their ‘streamlined’ dataset (<1TB) to the DeCoDE lab’s dataverse
  • <1TBish
  • File sizes: Largest: 1.6G and smallest: 400bytes,
  • Total number of files: not sure, user quoted 10^6, but was fuzzy on that.
  • File types: .tar.gz, txt, vtk, stl, .sh, data will be packed in .zip format
  • By mid to end of May: the full 16TB.
  • Author needs to repackage one dataset to meet packaging standards

Under discussion for support

-https://help.hmdc.harvard.edu/Ticket/Display.html?id=362556
-https://help.hmdc.harvard.edu/Ticket/Display.html?id=358471

Related RT Tickets

Issues

Related

Resources

@cmbz cmbz added Project: Large Data Support Pilot Pilot of large data support services using NESE resources Harvard Dataverse Issues related to Harvard Dataverse Repository Dataverse Project Issues related to Dataverse Project software labels Mar 2, 2024
@cmbz
Copy link
Contributor Author

cmbz commented Apr 1, 2024

Status: March 2024

Closed

@cmbz
Copy link
Contributor Author

cmbz commented Apr 10, 2024

Status: April 2024

Improvements made to Globus: "The PR fixes the issue with multifile Globus transfers out/downloads not working for draft datasets. It also improves handling of cases where ineligible files are selected for download or Globus transfer by only showing the download/transfer mechanisms that will work (on some files) given the files in the dataset and by improving the UI messages to indicate that files may not be eligible either because the user doesn't have permission (restricted or embargoed), or because the files can't be downloaded/transferred (i.e. files not in a Globus store when the user tries a Globus transfer or files in a Globus store that doesn't support normal downloads when the user selects download.)"

@sbarbosadataverse
Copy link

sbarbosadataverse commented May 21, 2024

Status: May 2024

  • Added new Process Development and Management section to this issue
  • Tested Globus/NESE download for OMAMA dataset, download successful but found that guestbook is not collecting downloads, could be due to the dataset being in "draft"
  • Leonid created a demo dataset for testing Globus access for non Harvard affiliates: https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/URDDBC
  • Ceilyn revised documentation for large data services, budget planner, and internal HDV service components
  • New section added to record groups interested, but not yet confirmed for the pilot

@cmbz
Copy link
Contributor Author

cmbz commented May 28, 2024

Status: June 2024

@sbarbosadataverse
Copy link

sbarbosadataverse commented Jul 8, 2024

Status: July 2024

@cmbz
Copy link
Contributor Author

cmbz commented Jul 30, 2024

Status: August 2024

@cmbz cmbz mentioned this issue Sep 11, 2024
9 tasks
@sbarbosadataverse
Copy link

sbarbosadataverse commented Oct 7, 2024

Status: September 2024

@cmbz
Copy link
Contributor Author

cmbz commented Oct 28, 2024

Status: October 2024

  • Quote was prepared for Bertarelli Collections, standard pricing
  • Testing and troubleshooting of Globus upload instructions performed by MIT researcher and Dataverse team
  • Draft of Large Data Services intake Qualtrics form was developed and will be tested in November.

@cmbz
Copy link
Contributor Author

cmbz commented Nov 7, 2024

Status: November 2024

  • Published: DrivAerNet: A Parametric Car Dataset for Data-driven Aerodynamic Design and Graph-Based Drag Prediction. Discussed how to invoice MIT-content (As affiliates) and the invoice process for this collaboration. This dataset may need to rework some data packaging as they did not follow the required stamdards for one of their datasets.
  • Our IT/RT Support team has completed testing the uploading and downloading via Globus and the scope of their work in assisting with these services has been entered into the services table, as requested.
  • Large Data Services intake form was finalized

@sbarbosadataverse
Copy link

Status: December 2024

  • Initiated Bertarelli large data collection support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataverse Project Issues related to Dataverse Project software GREI 5 Use Cases Harvard Dataverse Issues related to Harvard Dataverse Repository Project: Large Data Support Pilot Pilot of large data support services using NESE resources
Projects
None yet
Development

No branches or pull requests

2 participants