Skip to content

Commit

Permalink
Merge branch 'develop' into 6783-s3-tests #6783
Browse files Browse the repository at this point in the history
  • Loading branch information
pdurbin committed Dec 5, 2023
2 parents b9f4891 + e3e122a commit ab47848
Show file tree
Hide file tree
Showing 55 changed files with 2,062 additions and 661 deletions.
3 changes: 3 additions & 0 deletions doc/release-notes/8549-collection-quotas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS.
Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the first, initial deployment of 6.1

15 changes: 15 additions & 0 deletions doc/release-notes/8760-bagit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
For BagIT export, it is now possible to configure the following information in bag-info.txt...

Source-Organization: Harvard Dataverse
Organization-Address: 1737 Cambridge Street, Cambridge, MA, USA
Organization-Email: [email protected]

... using new JVM/MPCONFIG options:

- dataverse.bagit.sourceorg.name
- dataverse.bagit.sourceorg.address
- dataverse.bagit.sourceorg.email

Previously, customization was possible by editing `Bundle.properties` but this is no longer supported.

For details, see https://dataverse-guide--10122.org.readthedocs.build/en/10122/installation/config.html#bag-info-txt
3 changes: 3 additions & 0 deletions doc/release-notes/8760-download-tmp-file.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
A new API has been added for testing purposes that allows files to be downloaded from /tmp.

See
19 changes: 19 additions & 0 deletions doc/sphinx-guides/source/admin/collectionquotas.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

Storage Quotas for Collections
==============================

Please note that this is a new and still experimental feature (as of Dataverse v6.1 release).

Instance admins can now define storage quota limits for specific collections. These limits can be set, changed and/or deleted via the provided APIs (please see the :ref:`collection-storage-quotas` section of the :doc:`/api/native-api` guide). The Read version of the API is available to the individual collection admins (i.e., a collection owner can check on the quota configured for their collection), but only superusers can set, change or disable storage quotas.

Storage quotas are *inherited* by subcollections. In other words, when storage use limit is set for a specific collection, it applies to all the datasets immediately under it and in its sub-collections, unless different quotas are defined there and so on. Each file added to any dataset in that hierarchy counts for the purposes of the quota limit defined for the top collection. A storage quota defined on a child sub-collection overrides whatever quota that may be defined on the parent, or inherited from an ancestor.

For example, a collection ``A`` has the storage quota set to 10GB. It has 3 sub-collections, ``B``, ``C`` and ``D``. Users can keep uploading files into the datasets anywhere in this hierarchy until the combined size of 10GB is reached between them. However, if an admin has reasons to limit one of the sub-collections, ``B`` to 3GB only, that quota can be explicitly set there. This both limits the growth of ``B`` to 3GB, and also *guarantees* that allocation to it. I.e. the contributors to collection ``B`` will be able to keep adding data until the 3GB limit is reached, even after the parent collection ``A`` reaches the combined 10GB limit (at which point ``A`` and all its subcollections except for ``B`` will become read-only).

We do not yet know whether this is going to be a popular, or needed use case - a child collection quota that is different from the quota it inherits from a parent. It is likely that for many instances it will be sufficient to be able to define quotas for collections and have them apply to all the child objects underneath. We will examine the response to this feature and consider making adjustments to this scheme based on it. We are already considering introducing other types of quotas, such as limits by users or specific storage volumes.

Please note that only the sizes of the main datafiles and the archival tab-delimited format versions, as produced by the ingest process are counted for the purposes of enforcing the limits. Automatically generated "auxiliary" files, such as rescaled image thumbnails and metadata exports for datasets are not.

When quotas are set and enforced, the users will be informed of the remaining storage allocation on the file upload page together with other upload and processing limits.

Part of the new and experimental nature of this feature is that we don't know for the fact yet how well it will function in real life on a very busy production system, despite our best efforts to test it prior to the release. One specific issue is having to update the recorded storage use for every parent collection of the given dataset whenever new files are added. This includes updating the combined size of the root, top collection - which will need to be updated after *every* file upload. In an unlikely case that this will start causing problems with race conditions and database update conflicts, it is possible to disable these updates (and thus disable the storage quotas feature), by setting the :ref:`dataverse.storageuse.disable-storageuse-increments` JVM setting to true.
1 change: 1 addition & 0 deletions doc/sphinx-guides/source/admin/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ This guide documents the functionality only available to superusers (such as "da
solr-search-index
ip-groups
mail-groups
collectionquotas
monitoring
reporting-tools-and-queries
maintenance
Expand Down
5 changes: 5 additions & 0 deletions doc/sphinx-guides/source/api/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ API Changelog
6.1
---

New
~~~
- **/api/admin/clearThumbnailFailureFlag**: See :ref:`thumbnail_reset`.
- **/api/admin/downloadTmpFile**: See :ref:`download-file-from-tmp`.

Changes
~~~~~~~
- **/api/datasets/{id}/versions/{versionId}/citation**: This endpoint now accepts a new boolean optional query parameter "includeDeaccessioned", which, if enabled, causes the endpoint to consider deaccessioned versions when searching for versions to obtain the citation. See :ref:`get-citation`.
Expand Down
63 changes: 62 additions & 1 deletion doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,41 @@ The following attributes are supported:
* ``affiliation`` Affiliation
* ``filePIDsEnabled`` ("true" or "false") Restricted to use by superusers and only when the :ref:`:AllowEnablingFilePIDsPerCollection <:AllowEnablingFilePIDsPerCollection>` setting is true. Enables or disables registration of file-level PIDs in datasets within the collection (overriding the instance-wide setting).

.. _collection-storage-quotas:

Collection Storage Quotas
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block::
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota"
Will output the storage quota allocated (in bytes), or a message indicating that the quota is not defined for the specific collection. The user identified by the API token must have the ``Manage`` permission on the collection.


To set or change the storage allocation quota for a collection:

.. code-block::
curl -X PUT -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota/$SIZE_IN_BYTES"
This is API is superuser-only.


To delete a storage quota configured for a collection:

.. code-block::
curl -X DELETE -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota"
This is API is superuser-only.

Use the ``/settings`` API to enable or disable the enforcement of storage quotas that are defined across the instance via the following setting. For example,

.. code-block::
curl -X PUT -d 'true' http://localhost:8080/api/admin/settings/:UseStorageQuotas
Datasets
--------
Expand Down Expand Up @@ -5334,7 +5369,6 @@ A curl example using allowing access to a dataset's metadata
Please see :ref:`dataverse.api.signature-secret` for the configuration option to add a shared secret, enabling extra
security.
.. _send-feedback:
Send Feedback To Contact(s)
Expand All @@ -5361,6 +5395,33 @@ A curl example using an ``ID``
Note that this call could be useful in coordinating with dataset authors (assuming they are also contacts) as an alternative/addition to the functionality provided by :ref:`return-a-dataset`.
.. _thumbnail_reset:
Reset Thumbnail Failure Flags
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If Dataverse attempts to create a thumbnail image for an image or PDF file and the attempt fails, Dataverse will set a flag for the file to avoid repeated attempts to generate the thumbnail.
For cases where the problem may have been temporary (or fixed in a later Dataverse release), the API calls below can be used to reset this flag for all files or for a given file.
.. code-block:: bash
export SERVER_URL=https://demo.dataverse.org
export FILE_ID=1234
curl -X DELETE $SERVER_URL/api/admin/clearThumbnailFailureFlag
curl -X DELETE $SERVER_URL/api/admin/clearThumbnailFailureFlag/$FILE_ID
.. _download-file-from-tmp:
Download File from /tmp
~~~~~~~~~~~~~~~~~~~~~~~
As a superuser::
GET /api/admin/downloadTmpFile?fullyQualifiedPathToFile=/tmp/foo.txt
Note that this API is probably only useful for testing.
MyData
------
Expand Down
53 changes: 53 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1629,6 +1629,25 @@ The workflow id returned in this call (or available by doing a GET of /api/admin

Once these steps are taken, new publication requests will automatically trigger submission of an archival copy to the specified archiver, Chronopolis' DuraCloud component in this example. For Chronopolis, as when using the API, it is currently the admin's responsibility to snap-shot the DuraCloud space and monitor the result. Failure of the workflow, (e.g. if DuraCloud is unavailable, the configuration is wrong, or the space for this dataset already exists due to a prior publication action or use of the API), will create a failure message but will not affect publication itself.

.. _bag-info.txt:

Configuring bag-info.txt
++++++++++++++++++++++++

Out of the box, placeholder values like below will be placed in bag-info.txt:

.. code-block:: text
Source-Organization: Dataverse Installation (<Site Url>)
Organization-Address: <Full address>
Organization-Email: <Email address>
To customize these values for your institution, use the following JVM options:

- :ref:`dataverse.bagit.sourceorg.name`
- :ref:`dataverse.bagit.sourceorg.address`
- :ref:`dataverse.bagit.sourceorg.email`

Going Live: Launching Your Production Deployment
------------------------------------------------

Expand Down Expand Up @@ -2510,6 +2529,13 @@ This setting was added to keep S3 direct upload lightweight. When that feature i

See also :ref:`s3-direct-upload-features-disabled`.

.. _dataverse.storageuse.disable-storageuse-increments:

dataverse.storageuse.disable-storageuse-increments
++++++++++++++++++++++++++++++++++++++++++++++++++

This setting serves the role of an emergency "kill switch" that will disable maintaining the real time record of storage use for all the datasets and collections in the database. Because of the experimental nature of this feature (see :doc:`/admin/collectionquotas`) that hasn't been used in production setting as of this release, v6.1 this setting is provided in case these updates start causing database race conditions and conflicts on a busy server.

dataverse.auth.oidc.*
+++++++++++++++++++++

Expand All @@ -2527,6 +2553,33 @@ See also :ref:`guestbook-at-request-api` in the API Guide, and .

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_FILES_GUESTBOOK_AT_REQUEST``.

.. _dataverse.bagit.sourceorg.name:

dataverse.bagit.sourceorg.name
++++++++++++++++++++++++++++++

The name for your institution that you'd like to appear in bag-info.txt. See :ref:`bag-info.txt`.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_BAGIT_SOURCEORG_NAME``.

.. _dataverse.bagit.sourceorg.address:

dataverse.bagit.sourceorg.address
+++++++++++++++++++++++++++++++++

The mailing address for your institution that you'd like to appear in bag-info.txt. See :ref:`bag-info.txt`. The example in https://datatracker.ietf.org/doc/html/rfc8493 uses commas as separators: ``1 Main St., Cupertino, California, 11111``.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_BAGIT_SOURCEORG_ADDRESS``.

.. _dataverse.bagit.sourceorg.email:

dataverse.bagit.sourceorg.email
+++++++++++++++++++++++++++++++

The email for your institution that you'd like to appear in bag-info.txt. See :ref:`bag-info.txt`.

Can also be set via *MicroProfile Config API* sources, e.g. the environment variable ``DATAVERSE_BAGIT_SOURCEORG_EMAIL``.

.. _feature-flags:

Feature Flags
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/DataFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -640,7 +640,7 @@ public String getFriendlySize() {
return BundleUtil.getStringFromBundle("file.sizeNotAvailable");
}
}

public boolean isRestricted() {
return restricted;
}
Expand Down
113 changes: 47 additions & 66 deletions src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
import edu.harvard.iq.dataverse.ingest.IngestServiceBean;
import edu.harvard.iq.dataverse.search.SolrSearchResult;
import edu.harvard.iq.dataverse.settings.SettingsServiceBean;
import edu.harvard.iq.dataverse.storageuse.StorageQuota;
import edu.harvard.iq.dataverse.storageuse.StorageUseServiceBean;
import edu.harvard.iq.dataverse.storageuse.UploadSessionQuotaLimit;
import edu.harvard.iq.dataverse.util.FileSortFieldAndOrder;
import edu.harvard.iq.dataverse.util.FileUtil;
import edu.harvard.iq.dataverse.util.SystemConfig;
Expand Down Expand Up @@ -41,8 +44,6 @@
*
* @author Leonid Andreev
*
* Basic skeleton of the new DataFile service for DVN 4.0
*
*/

@Stateless
Expand All @@ -66,6 +67,9 @@ public class DataFileServiceBean implements java.io.Serializable {

@EJB SystemConfig systemConfig;

@EJB
StorageUseServiceBean storageUseService;

@PersistenceContext(unitName = "VDCNet-ejbPU")
private EntityManager em;

Expand Down Expand Up @@ -139,39 +143,6 @@ public class DataFileServiceBean implements java.io.Serializable {
*/
public static final String MIME_TYPE_PACKAGE_FILE = "application/vnd.dataverse.file-package";

public class UserStorageQuota {
private Long totalAllocatedInBytes = 0L;
private Long totalUsageInBytes = 0L;

public UserStorageQuota(Long allocated, Long used) {
this.totalAllocatedInBytes = allocated;
this.totalUsageInBytes = used;
}

public Long getTotalAllocatedInBytes() {
return totalAllocatedInBytes;
}

public void setTotalAllocatedInBytes(Long totalAllocatedInBytes) {
this.totalAllocatedInBytes = totalAllocatedInBytes;
}

public Long getTotalUsageInBytes() {
return totalUsageInBytes;
}

public void setTotalUsageInBytes(Long totalUsageInBytes) {
this.totalUsageInBytes = totalUsageInBytes;
}

public Long getRemainingQuotaInBytes() {
if (totalUsageInBytes > totalAllocatedInBytes) {
return 0L;
}
return totalAllocatedInBytes - totalUsageInBytes;
}
}

public DataFile find(Object pk) {
return em.find(DataFile.class, pk);
}
Expand Down Expand Up @@ -965,7 +936,7 @@ public boolean isThumbnailAvailable (DataFile file) {
}

// If thumbnails are not even supported for this class of files,
// there's notthing to talk about:
// there's nothing to talk about:
if (!FileUtil.isThumbnailSupported(file)) {
return false;
}
Expand All @@ -980,16 +951,16 @@ public boolean isThumbnailAvailable (DataFile file) {
is more important...
*/


if (ImageThumbConverter.isThumbnailAvailable(file)) {
file = this.find(file.getId());
file.setPreviewImageAvailable(true);
this.save(file);
return true;
}

return false;
file = this.find(file.getId());
if (ImageThumbConverter.isThumbnailAvailable(file)) {
file.setPreviewImageAvailable(true);
this.save(file);
return true;
}
file.setPreviewImageFail(true);
this.save(file);
return false;
}


Expand Down Expand Up @@ -1396,28 +1367,38 @@ public Embargo findEmbargo(Long id) {
return d.getEmbargo();
}

public Long getStorageUsageByCreator(AuthenticatedUser user) {
Query query = em.createQuery("SELECT SUM(o.filesize) FROM DataFile o WHERE o.creator.id=:creatorId");

try {
Long totalSize = (Long)query.setParameter("creatorId", user.getId()).getSingleResult();
logger.info("total size for user: "+totalSize);
return totalSize == null ? 0L : totalSize;
} catch (NoResultException nre) { // ?
logger.info("NoResultException, returning 0L");
return 0L;
/**
* Checks if the supplied DvObjectContainer (Dataset or Collection; although
* only collection-level storage quotas are officially supported as of now)
* has a quota configured, and if not, keeps checking if any of the direct
* ancestor Collections further up have a configured quota. If it finds one,
* it will retrieve the current total content size for that specific ancestor
* dvObjectContainer and use it to define the quota limit for the upload
* session in progress.
*
* @param parent - DvObjectContainer, Dataset or Collection
* @return upload session size limit spec, or null if quota not defined on
* any of the ancestor DvObjectContainers
*/
public UploadSessionQuotaLimit getUploadSessionQuotaLimit(DvObjectContainer parent) {
DvObjectContainer testDvContainer = parent;
StorageQuota quota = testDvContainer.getStorageQuota();
while (quota == null && testDvContainer.getOwner() != null) {
testDvContainer = testDvContainer.getOwner();
quota = testDvContainer.getStorageQuota();
if (quota != null) {
break;
}
}
if (quota == null || quota.getAllocation() == null) {
return null;
}
}

public UserStorageQuota getUserStorageQuota(AuthenticatedUser user, Dataset dataset) {
// this is for testing only - one pre-set, installation-wide quota limit
// for everybody:
Long totalAllocated = systemConfig.getTestStorageQuotaLimit();
// again, this is for testing only - we are only counting the total size
// of all the files created by this user; it will likely be a much more
// complex calculation in real life applications:
Long totalUsed = getStorageUsageByCreator(user);

return new UserStorageQuota(totalAllocated, totalUsed);
// Note that we are checking the recorded storage use not on the
// immediate parent necessarily, but on the specific ancestor
// DvObjectContainer on which the storage quota is defined:
Long currentSize = storageUseService.findStorageSizeByDvContainerId(testDvContainer.getId());

return new UploadSessionQuotaLimit(quota.getAllocation(), currentSize);
}
}
Loading

0 comments on commit ab47848

Please sign in to comment.