How to populate installations with "real world" sample data #5235

matthew-a-dunlap · 2018-10-25T17:17:05Z

With Dataverse we need better means to populate test data. Current methods such as our existing scripts and dumping a database work but they are flawed. This has gained greater importance with our ability to automate deployment (via #4990), as these newly spinned up environments come with no data. The easier and more open we make this, the easier it will be for people to demo and develop on Dataverse.

Specific needs:

Data files
Large quantities of data in general (Dataverses, Datasets, Users, etc)
Data capturing various edge cases (some of which may be hard to know unless the data is "real").

Possible solutions:

Leverage the existing harvesting system?
Manually dump and sanitize public data for testing use
Continue to build out the scripts/deploy/phoenix.dataverse.org/post automation for populating more normal data

The text was updated successfully, but these errors were encountered:

djbrooke · 2019-02-06T04:18:42Z

@matthew-a-dunlap @mheppler @pdurbin @scolapasta

Now that we live in a post #4990 world, I'd be interested to hear y'all's thoughts (and the thoughts of any other test data enthusiasts out there) on what the next step here is. Is there enough that we should start working on this or do we need more discussion? I'll bring it to backlog grooming tomorrow to remind us to discuss.

mheppler · 2019-02-06T17:02:04Z

@pdurbin is the local expert on scripting test data. He and I have discussed our hopes and dreams for this feature, and I think we're on the same page. He mentioned there are plans to include dataset scripts in the “dataverse-ansible” repo.

Here is more about the Ansible configuration management tool from the Choose Your Own Installation Adventure pg of our Installation Guide.

There are some community-lead projects to use configuration management tools such as Ansible and Puppet to automate Dataverse installation and configuration, but support for these solutions is limited to what the Dataverse community can offer as described in each project’s webpage:

https://github.com/IQSS/dataverse-ansible
https://github.com/IQSS/dataverse-puppet

(Please note that the “dataverse-ansible” repo is used in a script that allows Dataverse to be installed on Amazon Web Services (AWS) from arbitrary GitHub branches as described in the Deployment section of the Developer Guide.)

@donsizemore has contributed most of the Ansible installer, and I believe had plans for also building on this "test data" script solution.

djbrooke · 2019-02-13T19:31:00Z

Hey @donsizemore - we tried to estimate this today and I heard you may be working on this already. Can you leave a comment or some links here? Thank you!

pdurbin · 2019-02-13T20:19:00Z

IQSS/dataverse-ansible#37 is the issue and https://github.com/IQSS/dataverse-ansible/tree/37_sample_data is the branch. As I reported at http://irclog.iq.harvard.edu/dataverse/2019-01-29#i_85802 I tried (as of IQSS/dataverse-ansible@945e1c1 ) and got this error:

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'sampledataverses'\n\nThe error appears to have been in '/home/centos/dataverse/tasks/dataverse-sampledata-examples.yml': line 9, column 3

I tried to set expectations during sprint planning today that we're going to want to iterate on this. We very much appreciate @donsizemore working on this and we're happy to retest when he's ready. Once we get something working, we'll probably have feedback on how the data looks, etc.

pdurbin · 2019-02-15T17:26:59Z

@donsizemore as I mentioned in IRC, I'm out next week so I hope you don't mind if I assigned this issue to you. It sounds like you're making great progress. Thanks so much for working on this!

At standup I gave @djbrooke a heads up that you'll ping him when you're ready for someone to try your "37_sample_data" branch in dataverse-ansible (pro tip: be sure to tweak your local copy of ec2-create-instance.sh to check out the right branch!).

Oh, I'll attach here the main.yml file I've been using: main.yml.txt

Call it with this:

ec2-create-instance.sh -g main.yml

Huh, I just noticed a typo at 3bdf6c9 (-b instead of -g) so I'll make a pull request to fix it.

donsizemore · 2019-02-18T15:57:24Z

I believe I have the plumbing for this working in in dataverse-ansible. I don't have that much sample data, but the idea is this: if the ansible sampledata group_var is set to true, check for existing sampledata in a given location, or use your own. As-is, ansible first checks a dataverses/ subdirectory for json files, then creates users from another subdirectory, then finally checks for *.sh in a datasets/ subdirectory. The shell script(s) would assign permissions, create datasets and upload files. You're welcome to kick the tires (or me) via my 37_sample_data branch.

djbrooke · 2019-02-19T17:05:36Z

Thanks @donsizemore! This is really important for Dataverse. I'll move this over to code review.

pdurbin · 2019-08-09T23:53:43Z

Actually, I went ahead and spun in up at http://ec2-3-80-234-144.compute-1.amazonaws.com so a volunteer or two can say if we're done with the content. Here's a screenshot:

djbrooke · 2019-08-12T16:21:44Z

Moving back to Team dev. What's left here:

@djbrooke Need more files, 50 at least
@djbrooke Need one dataset with at least 11 files (for pagination) (updated in IQSS/dataverse-sample-data@2cace49)
@djbrooke Verify that we have a good representation of file types (check previous list in How to populate installations with "real world" sample data #5235 (comment) - fits, image, geospatial)
@djbrooke reassess whether or not we use Eleni's test data (test facets are showing up)
@mheppler and @pdurbin Someone aside from @pdurbin (and @donsizemore :)) being able to run this

djbrooke · 2019-08-12T20:05:10Z

@TaniaSchlatter @mheppler @pdurbin thanks for discussing this earlier today. I made the changes for the first four bullets in PR IQSS/dataverse-sample-data#1.

I decided to keep Eleni's datasets in, but I did adjust the keywords from test and test2 to something more real-world.

djbrooke · 2019-08-12T21:08:08Z

OK, PR mentioned above has been merged. Thanks @pdurbin for the review!

@pdurbin can you spin up a branch for @TaniaSchlatter to take a look at the most recently-added files?

donsizemore · 2019-08-12T21:11:49Z

hey @djbrooke regarding the last bullet above, what would help there? there's dataverse-ansible documentation for sample-data in its README.md, but some of this is fed through the EC2 script which Phil just had me move over to the dataverse-ansible repo.

I could beef up README.md in some prescribed way, or would you want a developers' workflow, with examples for EC2, Vagrant, or a local install? there isn't currently a dataverse-ansible Dockerfile, but there could be.

I'm personally leaning toward a dev guide / workflow?

djbrooke · 2019-08-13T13:41:51Z

Hey @donsizemore thanks for checking in and for the offer of help!

In regards to what would be the most helpful here, I'll defer to @mheppler. He tried to get this running when @pdurbin was out and was not successful. More documentation in general is always welcomed, but Mike may have some specific trouble spots that he ran into that could be targeted for additional docs.

Thanks again!

pdurbin · 2019-08-13T15:32:28Z

@mheppler @djbrooke and I are planning to meet at 3 to go through the README at https://github.com/IQSS/dataverse-sample-data . I can try to improve the README if it's confusing. 😄

We can use Dataverse running on laptops or the instance I spun up for this issue: http://ec2-3-80-234-144.compute-1.amazonaws.com . We can also try the destroy_all_dvobjects.py script.

djbrooke · 2019-08-13T20:20:12Z

Thanks @pdurbin and @mheppler for the helpful meeting about this, I was able to successfully set up sample data on the instance mentioned above.

I need to remove the data files for larger states (IL) and replace it with data files for smaller states (WY). After updating this I'll destroy what's currently there and replace it with the data for the smaller states.

djbrooke · 2019-08-14T19:58:10Z

OK, I was able to successfully destroy and re-create all sample data from the command line. If these instructions work for lowly me, they should work for anyone. :) Moving to code review.

Passing to @TaniaSchlatter for review of the data on http://ec2-3-80-234-144.compute-1.amazonaws.com. We can make further adjustments to the test data that's available, but I'm pretty happy with the code where it is.

pdurbin · 2019-08-14T20:05:37Z

We could also review what I wrote for http://guides.dataverse.org/en/4.15.1/developers/tips.html#sample-data and add to it if we want.

TaniaSchlatter · 2019-08-19T16:38:54Z

Comments from review (+ looking at the April 25 list above):

exploring via Data Explore opens Whole Tale (http://ec2-3-80-234-144.compute-1.amazonaws.com/file.xhtml?fileId=157&version=1.0)
one or two dataverses should have banner graphics
looked for a geospatial file with map preview/ explore in world map but didn't find. Is there one that just needs a label?
looked for a dataset with file hierarchy but didn't find
it would be nice if the titles of the test datasets (not the subsets of actual datasets) described what they represent ("Test Dataset with 10+ files and geospatial file")
I assume we can restrict a file and flesh out permissions. If we make these changes in the UI, will it change the data we are making available? Should we have a raw version and a version that has more features enabled for Harvard's use, or would the version that has more features be more useful for everyone?

Regarding the text in the guide: recommend adding a use case about using the test data on a server that is not the main installation.

djbrooke · 2019-08-19T16:57:05Z

Thanks @TaniaSchlatter, I'll checklist-ize this and work through the ones that I can. I'll work with the team on others and also discuss some with you.

Issues for Danny to update

Better highlight a dataset with hierarchy (there is at least one in here...)
Updates test dataset names to represent the contents in the dataset
Guide text (added in 6103 doc updates #6104)

Issues to discuss

best solution for banner graphics (ok to not implement now)
Investigate adding geospatial preview/explore (requires changes to ansible?)

Issues for @djbrooke to add

File restrictions, tags, descriptions (sample data repo)
exploring via Data Explore opens Whole Tale (http://ec2-3-80-234-144.compute-1.amazonaws.com/file.xhtml?fileId=157&version=1.0) (in ansible)

djbrooke · 2019-08-21T14:25:06Z

I'm going to close this issue. @pdurbin announced the sample data repo's existence on the Google group and it's in a state where people in the community can start using it:

https://groups.google.com/forum/#!topic/dataverse-community/u-Yv0U3v4Bo

We created individual issues in the dataverse-ansible and sample data repos for the follow on tasks to make this useful to more groups (design team, specifically):

We'll take these through our usual development process.

Thanks everyone for the hard work on this issue, this is great to have!!

matthew-a-dunlap changed the title ~~As a dataverse test environment user, I'd like to populate a dataverse installation with data analogous to real world use~~ As a dataverse test environment user, I'd like to populate installations with data analogous to real world use Oct 25, 2018

djbrooke self-assigned this Feb 6, 2019

djbrooke added Status: Ready labels Feb 6, 2019

djbrooke removed their assignment Feb 13, 2019

pdurbin mentioned this issue Feb 13, 2019

allow pre-fab or arbitrary sample data IQSS/dataverse-ansible#37

Closed

djbrooke removed the ready for estimation label Feb 13, 2019

pdurbin added Status: Development and removed Status: Ready labels Feb 14, 2019

pdurbin self-assigned this Feb 14, 2019

pdurbin modified the milestone: 4.11 - Preservation Integrations Feb 15, 2019

pdurbin added Status: Community Dev and removed Status: Development labels Feb 15, 2019

pdurbin assigned donsizemore and unassigned pdurbin Feb 15, 2019

pdurbin added a commit that referenced this issue Feb 15, 2019

typo -g for GRPVRS for main.yml (not -b) #5235

14982ae

pdurbin mentioned this issue Feb 15, 2019

sample data #5235 #5557

Merged

djbrooke added Status: Code Review and removed Status: Community Dev labels Feb 19, 2019

djbrooke unassigned donsizemore Feb 19, 2019

pdurbin removed their assignment Aug 9, 2019

djbrooke self-assigned this Aug 12, 2019

djbrooke assigned mheppler and pdurbin Aug 12, 2019

djbrooke removed their assignment Aug 12, 2019

pdurbin assigned djbrooke Aug 13, 2019

djbrooke unassigned pdurbin and mheppler Aug 13, 2019

djbrooke removed their assignment Aug 14, 2019

djbrooke assigned TaniaSchlatter Aug 14, 2019

TaniaSchlatter assigned djbrooke Aug 19, 2019

djbrooke unassigned TaniaSchlatter Aug 20, 2019

djbrooke closed this as completed Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to populate installations with "real world" sample data #5235

How to populate installations with "real world" sample data #5235

matthew-a-dunlap commented Oct 25, 2018

djbrooke commented Feb 6, 2019

mheppler commented Feb 6, 2019

djbrooke commented Feb 13, 2019

pdurbin commented Feb 13, 2019

pdurbin commented Feb 15, 2019

donsizemore commented Feb 18, 2019

djbrooke commented Feb 19, 2019

pdurbin commented Aug 9, 2019

djbrooke commented Aug 12, 2019 •

edited by TaniaSchlatter

Loading

djbrooke commented Aug 12, 2019

djbrooke commented Aug 12, 2019

donsizemore commented Aug 12, 2019

djbrooke commented Aug 13, 2019

pdurbin commented Aug 13, 2019

djbrooke commented Aug 13, 2019

djbrooke commented Aug 14, 2019

pdurbin commented Aug 14, 2019

TaniaSchlatter commented Aug 19, 2019

djbrooke commented Aug 19, 2019 •

edited

Loading

djbrooke commented Aug 21, 2019 •

edited

Loading

How to populate installations with "real world" sample data #5235

How to populate installations with "real world" sample data #5235

Comments

matthew-a-dunlap commented Oct 25, 2018

djbrooke commented Feb 6, 2019

mheppler commented Feb 6, 2019

djbrooke commented Feb 13, 2019

pdurbin commented Feb 13, 2019

pdurbin commented Feb 15, 2019

donsizemore commented Feb 18, 2019

djbrooke commented Feb 19, 2019

pdurbin commented Aug 9, 2019

djbrooke commented Aug 12, 2019 • edited by TaniaSchlatter Loading

djbrooke commented Aug 12, 2019

djbrooke commented Aug 12, 2019

donsizemore commented Aug 12, 2019

djbrooke commented Aug 13, 2019

pdurbin commented Aug 13, 2019

djbrooke commented Aug 13, 2019

djbrooke commented Aug 14, 2019

pdurbin commented Aug 14, 2019

TaniaSchlatter commented Aug 19, 2019

djbrooke commented Aug 19, 2019 • edited Loading

djbrooke commented Aug 21, 2019 • edited Loading

djbrooke commented Aug 12, 2019 •

edited by TaniaSchlatter

Loading

djbrooke commented Aug 19, 2019 •

edited

Loading

djbrooke commented Aug 21, 2019 •

edited

Loading