This is an application for files synchronization from different source repositories into a Dataverse installation. This application uses background processes for the synchronization of the files. The background processes are also used for hashing of the Dataverse files when the source repository uses different hash type than the Dataverse installation. These hashes are needed for the comparison of the files, allowing easier versioning of the files between the dataset versions (only the files that effectively have changed would be replaced and added to the overview of changes between different dataset versions). The frontend application does not need to be running when the synchronization is running on the server (users can close their browsers once that the synchronization has been set up), and multiple synchronizations for different users can run simultaneously, each on its own goroutine, scheduled as a "job" in the background. The number of simultaneously running jobs is adjustable, and the jobs are scheduled in "First In First Out" order.
Support for different repositories is implemented as plugins. More plugins will be added in the feature. At this moment, the following plugins are provided with the latest version:
- GitHub
- GitLab
- IRODS
- Dataverse (use other Dataverse as source to import the data)
- Mircrosoft OneDrive
- Microsoft SharePoint Online
- OSF
- SFTP
- REDCap
- Globus (this plugin is not yet released)
Add the included Keyclok (for OIDC flow) and S3 implementations (for upload and download redirects using S3 storage) to your /etc/hosts
or equivalent configuration, for example:
127.0.0.1 keycloak.mydomain.com
127.0.0.1 localstack.mydomain.com
127.0.0.1 minio.mydomain.com
You can start the demo with the following command:
make up
Wait until everything is initialized and started (follow the status in the terminal output). Then you can start testing, for example, go to the main page of the newly created Dataverse: http://localhost:8080, and click on the Log In
button. Choose the OpenID Connect
option at the button. On the Log In
page, click on the Log In with OpenID Connect
. Log-in with admin/admin credentials:
Complete the new user form by choosing a username and by agreeing to the terms:
After creating that new account, go to the API Token
menu option:
Create a new token and go back to the main page to create a new dataset:
Fill out the form and click on Save Dataset
. In the new dataset choose RDM-integration upload
option from Edit Dataset
menu:
Agree to the popups from the localhost, you will be redirected to log in:
After logging in, you can test different plugins and scenarios. You can also go directly to the RDM-integration tool http://localhost:4180 to initiate different use-cases, e.g., creating new empty dataset with the Create new dataset
button:
For building the frontend, you need to have Angular CLI installed. You will need to have the latest Go installed for compiling the code. If you wish to build the application's container, you will need to have the Docker installed. Finally, the state of the application (calculated hashes, scheduled jobs, etc.) is maintained by a Redis data store. When running this application on the server, you will need either access to an external Redis server, or one run by you locally. The stand-alone tool does not require any Redis server (or any other tool or library installed on your system), and can be simply run by executing a binary built for your operating system.
This application can be used by accessing the API directly (from cron jobs, etc.), or with a frontend providing GUI for the end users. The frontend source code can be found here: (rdm-integration-frontend). Besides the frontend dependency, the build process use the following libraries and their dependencies (go build
command resolves them from go.mod
and go.sum
files, and they do not need to be installed separately):
The backend configuration is loaded by the application from a file specified by the path stored in the BACKEND_CONFIG_FILE
environment variable. In order to set a value for that variable, you will need to export that variable to the OS running the application, e.g.:
export BACKEND_CONFIG_FILE=../conf/backend_config.json
Note that the stand-alone version does not need the backend configuration file and is configured by the -X
ldflags passed to the build command. You can also override these flags by adding arguments to the execution command, as described in the sections above.
An example of backend configuration can be found in backend-config.json. Another example, as can be used to connect to the Demo Dataverse, can be found in backend_config_demo.json. The BACKEND_CONFIG_FILE
environment variable specifies which configuration file will be loaded. The only two mandatory fields in the configuration file are the following:
- dataverseServer: URL of the server where Detaverse API is deployed.
- redisHost: the host containing the Redis data store (storing the application state).
Additionally, the configuration can contain the following fields in the optional "options" field:
- dataverseExternalUrl: this field is used to generate a link to the dataset presented to the user. Set this value if it is different from dataverseServer value, otherwise you can omit it.
- rootDataverseId: root Dataverse collection ID, needed for creating new dataset when no collection was chosen in the UI.
- defaultHash: as mentioned earlier, "MD5" is the default hash for most Dataverse installations. Change this only when your installation uses a different hashing algorithm (e.g., SHA-1).
- myDataRoleIds: role IDs for querying my data, as explained earlier in this section.
- pathToUnblockKey: path to the file containing the API unblock key. Configure this value to enable checking permissions before requesting jobs.
- pathToApiKey: path to the file containing the admin API key. Configure this value to enable url signing i.s.o. using the users Dataverse API tokens.
- pathToRedisPassword: by default no password is set, if you need to authenticate with Redis, store the path to the file containing the Redis password in this field.
- redisDB: by default, DB 0 is used. If you need to use another DB, specify it here.
- defaultDriver: default driver as used by the Dataverse installation, only "file" and "s3" are supported. See also the next section.
- pathToFilesDir: path to the folder where Dataverse files are stored (only needed when using the "file" driver).
- s3Config: configuration when using the "s3" driver, similar to the settings for the s3 driver in your Dataverse installation. Only needed when using S3 file system that is not mounted as a volume. See also the next section.
- pathToOauthSecrets: path to the file containing the OATH client secrets and POST URLs for the plugins configured to use OAuth for authentication. An example of a secrets file can be found in example_oath_secrets.json. As shown in that example, each OAuth client has its own entry, identified by the application ID. Each entry contains two fields: clientSecret containing the client secret, and postURL containing the URL where the post request for acquiring tokens should be sent to. See the frontend configuration section for information on configuration of OAuth authorization for the plugins.
- maxFileSize: maximum size of a file that can be uploaded to the Dataverse installation. When not set, or set to 0 (or value less than 0), there is no limit on file size that can be uploaded. The files that cannot be uploaded due to the file size limit are filtered out by the frontend and the user is notified with a warning.
- userHeaderName: URL signing needs the username in order to know for which user to sign, the user name should be passed in the header of the request. The default is "Ajp_uid", as send by the Shibboleth IDP.
- smtpConfig: configure this when you wish to send notification emails to the users: on job error and on job completion. For example, the configuration could look like this:
"smtpConfig": {
"host": "smtp.gmail.com",
"port": "587",
"from": "[email protected]"
},
"pathToSmtpPassword": "/path/to/password/file"
- pathToSmtpPassword: path to the file containing the password needed to authenticate with the SMTP server
When running this tool on the server, you can take the advantage of directly uploading files to the file system where Dataverse files are stored (assuming that you have direct access to that file system from the location where this application is running). The most generic way is simply mounting the file system as a volume and configuring the application (in the backend configuration file) to use the "file" driver pointing to the mounted volume. For example:
{
"dataverseServer": "https://some.dataverse.com",
"redisHost": "localhost:6379",
"options": {
"dataverseServer": "https://some.dataverse.com",
"defaultDriver": "file",
"pathToFilesDir": "/path/to/mounted/volume"
}
}
As an alternative, you can access an s3 storage directly from this application, without the need of mounting it. First, you will need to configure the authentication by setting the following environment variables on the OS running this application:
- Access Key ID:
AWS_ACCESS_KEY_ID
orAWS_ACCESS_KEY
- Secret Access Key:
AWS_SECRET_ACCESS_KEY
orAWS_SECRET_KEY
The s3 driver is then configured in the backend configuration file, for example:
{
"dataverseServer": "localhost:8080",
"redisHost": "localhost:6379",
"options": {
"dataverseExternalUrl": "https://demo.dataverse.org",
"defaultDriver": "s3",
"s3Config": {
"awsEndpoint": "http://some.endpoint.here",
"awsRegion": "region",
"awsPathstyle": "true",
"awsBucket": "bucket"
}
}
}
Notice that the driver configuration is optional. When it is not set, no direct uploading is in use and simply the Dataverse API is called for storing the files. However, this can result in unnecessary usage of resources (network, CPU, etc.) and might slow down the Dataverse installation.
There are two types of possible customizations to the frontend. The first type is the customization done by the replacement of the HTML files, e.g., the footer.html and the header.html. The files that are going to be replaced are placed in the conf/customizations directory, that can also contain the files referenced by the custom HTML files. By default, only the make executable
and make multiplatform_demo
commands effectively replace these files while building. In order to add customizations into your make script, add the following line to the script: cp -r conf/customizations/* image/app/frontend/dist/datasync/
.
The second type is the configuration with a configuration file. The default configuration file (used when the configuration file is not specified in the FRONTEND_CONFIG_FILE
environment variable) can be found in default_frontend_config.json. In order to use a custom configuration file, set the FRONTEND_CONFIG_FILE
environment variable accordingly. An example of the configuration file, also used by the make scripts and the docker commands, can be found in frontend_config.json.
The configuration file can contain the following options for the frontend:
- dataverseHeader: the display name of the Dataverse installation.
- collectionOptionsHidden: if set to
false
(or omitted), an extra dropdown is shown that allows for collection selection within the Dataverse installation. The selected installation is then used for creating new dataset, when that option is enabled, and for filtering of the available datasets. - collectionFieldEditable: if set to
true
, the user can paste or type collection identifiers directly, without the use of the dropdown. - createNewDatasetEnabled: if set to
true
, it enables the "Create new dataset" button. - datasetFieldEditable: if set to
true
, the user can paste or type DOI identifiers directly, without the use of the dropdown. - externalURL: this option if filled out by the backend from the
dataverseExternalUrl
backend configuration file field, and should not be set manually. - showDvTokenGetter: set it to
true
to show the "Get token" button next to the Dataverse token field. - showDvToken: set it to
true
to show the token field (set it tofalse
when using URL signing). - redirect_uri: when using OAuth, this option should be set to the
redirect_uri
as configured in the OAuth application setting (e.g., GitHub application settings as described in this guide). The redirect URI must point to the/connect
page of this application. - storeDvToken: set it to
true
to allow storing Dataverse API token in the browser of the user. - sendMails: set it to
true
to enable sending mails to the user (you need to configure smtp settings in the backend configuration). - plugins: contains one entry for each repository instance, as described below.
Having multiple instances for plugin types is useful when certain features, e.g., OAuth authentication, can be configured for specific installations of a given repository type. It is perfectly possible to have at most one instance for each plugin type, as it is the case in the default_frontend_config.json. Plugins that er not configured will not be shown in the UI. The repository instance, configured as an entry in plugins
setting of the frontend configuration, can contain the following fields:
- id: unique identifier for the repository instance configuration.
- name: name of the instance, as displayed in the "Repository instance" field on the connect page, e.g. "KU Leuven GitLab".
- plugin: the identifier of the plugin, as implemented in registry.go, e.g.,
irods
,github
,gitlab
, etc. - pluginName: Display name of the plugin, as displayed in the "Repository type" dropdown.
- optionFieldName: when the plugin implements
Options
function, this field is set to the name of the implemented option, e.g., "branch" or "folder". - optionFieldPlaceholder: the placeholder for option field.
- tokenFieldName: when the user needs to authenticate with a API token or password to the given repository (e.g., OAuth is not configured for this repository instance), this field should be set to the name of the needed credential, e.g., "Token" or "Password"
- tokenFieldPlaceholder: the placeholder for the token field.
- sourceUrlFieldName: when configured, the UI will show the source URL field, where the user can enter the URL of the repository to connect to.
- sourceUrlFieldPlaceholder: the placeholder for the source URL field.
- sourceUrlFieldValue: when configured, it contains the default value for the source URL field. When this value is always the same for a given plugin, e.g.,
https://github.com
, then the sourceUrlFieldName can be left empty, and the field will not be shown (but will always contain the configured default value). - usernameFieldName: when the user needs to authenticate with a username to the given repository (e.g., OAuth is not configured for this repository instance), this field should be set to the name of this field, e.g., "Username"
- usernameFieldPlaceholder: the placeholder for the username field.
- repoNameFieldName: repository selection field name.
- repoNameFieldPlaceholder: the placeholder for the repository selection field.
- repoNameFieldEditable: if set to
true
, the user can paste or type repository name directly, without the use of the dropdown. - repoNameFieldValues: suggested or possible repository names. When this is filled out, a dropdown will be presented to the user, otherwise a text field will be presented.
- repoNameFieldHasSearch: when the plugin implements
Search
function, this field can be set totrue
for searchable repository names. - repoNameFieldHasInit: when the plugin implements
Search
function, this field can be set totrue
for initial search without the search term (initializes the dropdown to the default values as implemented by the search function). - parseSourceUrlField: when set to true, the repoName field can be left not configured and the repository name is parsed from the source URL field.
- tokenName: when set to a unique value, the credential needed for authentication is stored in the browser.
- tokenGetter: OAuth configuration for the repository instance containing the URL where authorizations should be redirected to, and the oauth_client_id from the OAuth application setting (e.g., GitHub application settings as described in this guide). See also the backend configuration section on how to configure the needed client secrets.
In order to integrate a new repository type, you need to implement a new plugin for the backend. The plugins are implemented in the image/app/plugin/impl folder (each having its own package). The new plugin implementation must be then registered in the registry.go file. As can be seen in the same file, a plugin implements functions that are required by the Plugin type:
type Plugin struct {
Query func(ctx context.Context, req types.CompareRequest, dvNodes map[string]tree.Node) (map[string]tree.Node, error)
Options func(ctx context.Context, params types.OptionsRequest) ([]string, error)
Search func(ctx context.Context, params types.OptionsRequest) ([]string, error)
Streams func(ctx context.Context, in map[string]tree.Node, streamParams types.StreamParams) (map[string]types.Stream, error)
}
Each plugin implements at leas these two functions:
- Query: using the standard fields as provided in the "types.CompareRequest" (username, API token, URL, etc.) this function queries the repository for files. The result is a flat mapping of files found on the repository to their paths. A file is represented by a "tree.Node" type containing the file name, file path, hash type and hash value, etc. Notice that it does not contain the file itself. The
dvNodes
parameters holds a copy of the nodes as present in the Dataset on the Dataverse installation (and can be ignored in most cases). - Streams: files are synchronized using streams from the source repository to the file system, where each file has its own stream. This function implements "types.Stream" objects for the provided files (the "in" parameter contains a filtered list of files that are going to be copied from the repository). Notably, a "types.Stream" object contains a function for opening a stream to the provided file and a function to close that stream.
Additionally, the plugins can implement the following functions:
- Options: this function lists branches (or folders in the case of IRODS) applicable for the current repository. It can be only called when the user has provided the credentials needed to call the repository (this is verified at the frontend) and the repository name that the options will apply to. These credentials and the repository name are then provided in the "types.OptionsRequest" value. This function needs only to be implemented when this functionality is needed by the given type of the repository.
- Search: when implemented, this function can be used for searching repositories by name, based on the search term provided by the user. It makes the selection of the repository process easier for the users.
After implementing the above-mentioned functions on the backend, the plugin needs to be configured at the frontend. It becomes then selectable by the user, with the possibility of different configurations for the specific repositories instances. See the section on frontend configuration for further details.
The sequence diagrams for search
and oauthtoken
are very similar to this one, and are not shown separately.
sequenceDiagram
Frontend->>+Backend: /api/plugin/options
Backend->>Repo: Specific call, e.g., list branches
Repo-->>Backend: List of branches
Backend-->>-Frontend: List of options for the dropdown
sequenceDiagram
Frontend->>+Backend: /api/common/dvobjects
loop Until all pages are retrieved
Backend->>Dataverse: /api/v1/mydata/retrieve
Dataverse->>Backend: Dataverse collections
end
Backend-->>-Frontend: Dataverse collections
sequenceDiagram
Frontend->>+Backend: /api/common/newdataset
Backend->>Dataverse: POST /api/dataverses/{{Dataverse collection}}/datasets
Dataverse-->>Backend: Response
Backend-->>-Frontend: Persistent ID of the new dataset
sequenceDiagram
Frontend->>+Backend: /api/plugin/compare
Backend->>+Goroutine: Compare using Key as ref.
activate Goroutine
Backend-->>Frontend: Key
loop Until cached response ready
Frontend->>Backend: /api/common/cached
Backend->>Redis: Get(key)
Redis-->>Backend: Cached response if ready
Backend-->>Frontend: Cached response if ready
end
Goroutine->>Dataverse: List files
Dataverse-->>Goroutine: List of files
Goroutine->>Repo: List files
Repo-->>Goroutine: List of files
Goroutine->>Redis: Get known hashes
Redis-->>Goroutine: Known hashes
Goroutine->>Redis: Hashing job for unknown hashes
Goroutine->>Redis: Cached response is ready
deactivate Goroutine
loop Until all hashes known
Frontend->>Backend: /api/common/compare
Backend->>Redis: Get known hashes
Redis-->>Backend: Response
Backend-->>Frontend: Not all hashes known
end
Worker->>Redis: Get new job
Redis-->>Worker: Hashing job
activate Worker
loop Until all hashes known
Worker-->>Worker: Calculate N hashes
Worker->>Redis: Store calculated hashes
end
Worker->>Redis: All hashes known
deactivate Worker
sequenceDiagram
Frontend->>+Backend: /api/common/store
Backend->>Redis: Add new job
Backend->>Frontend: Job added
loop Until all files processed
Frontend->>Backend: /api/common/compare
Backend->>Redis: get processed files list
Redis-->>Backend: Response
Backend-->>Frontend: Not all files processed
end
Worker->>Redis: Get new job
Redis-->>Worker: Persisting job
activate Worker
loop Until all files processed
Worker-->>Worker: Process file (write or delete in dataset)
Worker-->>Redis: Notify file is processed
end