Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to populate installations with "real world" sample data #5235

Closed
matthew-a-dunlap opened this issue Oct 25, 2018 · 75 comments
Closed

How to populate installations with "real world" sample data #5235

matthew-a-dunlap opened this issue Oct 25, 2018 · 75 comments
Assignees

Comments

@matthew-a-dunlap
Copy link
Contributor

With Dataverse we need better means to populate test data. Current methods such as our existing scripts and dumping a database work but they are flawed. This has gained greater importance with our ability to automate deployment (via #4990), as these newly spinned up environments come with no data. The easier and more open we make this, the easier it will be for people to demo and develop on Dataverse.

Specific needs:

  • Data files
  • Large quantities of data in general (Dataverses, Datasets, Users, etc)
  • Data capturing various edge cases (some of which may be hard to know unless the data is "real").

Possible solutions:

  • Leverage the existing harvesting system?
  • Manually dump and sanitize public data for testing use
  • Continue to build out the scripts/deploy/phoenix.dataverse.org/post automation for populating more normal data
@matthew-a-dunlap matthew-a-dunlap changed the title As someone working on the Dataverse codebase, I'd like to populate a dataverse installation with data analogous with real world use As someone working on the Dataverse codebase, I'd like to populate a dataverse installation with data analogous to real world use Oct 25, 2018
@matthew-a-dunlap matthew-a-dunlap changed the title As someone working on the Dataverse codebase, I'd like to populate a dataverse installation with data analogous to real world use As a dataverse test environment user, I'd like to populate a dataverse installation with data analogous to real world use Oct 25, 2018
@matthew-a-dunlap matthew-a-dunlap changed the title As a dataverse test environment user, I'd like to populate a dataverse installation with data analogous to real world use As a dataverse test environment user, I'd like to populate installations with data analogous to real world use Oct 25, 2018
@djbrooke
Copy link
Contributor

djbrooke commented Feb 6, 2019

@matthew-a-dunlap @mheppler @pdurbin @scolapasta

Now that we live in a post #4990 world, I'd be interested to hear y'all's thoughts (and the thoughts of any other test data enthusiasts out there) on what the next step here is. Is there enough that we should start working on this or do we need more discussion? I'll bring it to backlog grooming tomorrow to remind us to discuss.

@djbrooke djbrooke self-assigned this Feb 6, 2019
@mheppler
Copy link
Contributor

mheppler commented Feb 6, 2019

@pdurbin is the local expert on scripting test data. He and I have discussed our hopes and dreams for this feature, and I think we're on the same page. He mentioned there are plans to include dataset scripts in the “dataverse-ansible” repo.

Here is more about the Ansible configuration management tool from the Choose Your Own Installation Adventure pg of our Installation Guide.

There are some community-lead projects to use configuration management tools such as Ansible and Puppet to automate Dataverse installation and configuration, but support for these solutions is limited to what the Dataverse community can offer as described in each project’s webpage:

https://github.com/IQSS/dataverse-ansible
https://github.com/IQSS/dataverse-puppet

(Please note that the “dataverse-ansible” repo is used in a script that allows Dataverse to be installed on Amazon Web Services (AWS) from arbitrary GitHub branches as described in the Deployment section of the Developer Guide.)

@donsizemore has contributed most of the Ansible installer, and I believe had plans for also building on this "test data" script solution.

@djbrooke djbrooke removed their assignment Feb 13, 2019
@djbrooke
Copy link
Contributor

Hey @donsizemore - we tried to estimate this today and I heard you may be working on this already. Can you leave a comment or some links here? Thank you!

@pdurbin
Copy link
Member

pdurbin commented Feb 13, 2019

IQSS/dataverse-ansible#37 is the issue and https://github.com/IQSS/dataverse-ansible/tree/37_sample_data is the branch. As I reported at http://irclog.iq.harvard.edu/dataverse/2019-01-29#i_85802 I tried (as of IQSS/dataverse-ansible@945e1c1 ) and got this error:

fatal: [localhost]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'sampledataverses'\n\nThe error appears to have been in '/home/centos/dataverse/tasks/dat​averse-sampledata-examples.yml': line 9, column 3

I tried to set expectations during sprint planning today that we're going to want to iterate on this. We very much appreciate @donsizemore working on this and we're happy to retest when he's ready. Once we get something working, we'll probably have feedback on how the data looks, etc.

@pdurbin
Copy link
Member

pdurbin commented Feb 15, 2019

@donsizemore as I mentioned in IRC, I'm out next week so I hope you don't mind if I assigned this issue to you. It sounds like you're making great progress. Thanks so much for working on this!

At standup I gave @djbrooke a heads up that you'll ping him when you're ready for someone to try your "37_sample_data" branch in dataverse-ansible (pro tip: be sure to tweak your local copy of ec2-create-instance.sh to check out the right branch!).

Oh, I'll attach here the main.yml file I've been using: main.yml.txt

Call it with this:

ec2-create-instance.sh -g main.yml

Huh, I just noticed a typo at 3bdf6c9 (-b instead of -g) so I'll make a pull request to fix it.

@donsizemore
Copy link
Contributor

I believe I have the plumbing for this working in in dataverse-ansible. I don't have that much sample data, but the idea is this: if the ansible sampledata group_var is set to true, check for existing sampledata in a given location, or use your own. As-is, ansible first checks a dataverses/ subdirectory for json files, then creates users from another subdirectory, then finally checks for *.sh in a datasets/ subdirectory. The shell script(s) would assign permissions, create datasets and upload files. You're welcome to kick the tires (or me) via my 37_sample_data branch.

@djbrooke
Copy link
Contributor

Thanks @donsizemore! This is really important for Dataverse. I'll move this over to code review.

@pdurbin pdurbin removed their assignment Aug 9, 2019
@pdurbin
Copy link
Member

pdurbin commented Aug 9, 2019

Actually, I went ahead and spun in up at http://ec2-3-80-234-144.compute-1.amazonaws.com so a volunteer or two can say if we're done with the content. Here's a screenshot:

Screenshot from 2019-08-09 19-50-55

@djbrooke djbrooke self-assigned this Aug 12, 2019
@djbrooke
Copy link
Contributor

djbrooke commented Aug 12, 2019

Moving back to Team dev. What's left here:

@djbrooke
Copy link
Contributor

@TaniaSchlatter @mheppler @pdurbin thanks for discussing this earlier today. I made the changes for the first four bullets in PR IQSS/dataverse-sample-data#1.

I decided to keep Eleni's datasets in, but I did adjust the keywords from test and test2 to something more real-world.

@djbrooke
Copy link
Contributor

OK, PR mentioned above has been merged. Thanks @pdurbin for the review!

@pdurbin can you spin up a branch for @TaniaSchlatter to take a look at the most recently-added files?

@djbrooke djbrooke removed their assignment Aug 12, 2019
@donsizemore
Copy link
Contributor

hey @djbrooke regarding the last bullet above, what would help there? there's dataverse-ansible documentation for sample-data in its README.md, but some of this is fed through the EC2 script which Phil just had me move over to the dataverse-ansible repo.

I could beef up README.md in some prescribed way, or would you want a developers' workflow, with examples for EC2, Vagrant, or a local install? there isn't currently a dataverse-ansible Dockerfile, but there could be.

I'm personally leaning toward a dev guide / workflow?

@djbrooke
Copy link
Contributor

Hey @donsizemore thanks for checking in and for the offer of help!

In regards to what would be the most helpful here, I'll defer to @mheppler. He tried to get this running when @pdurbin was out and was not successful. More documentation in general is always welcomed, but Mike may have some specific trouble spots that he ran into that could be targeted for additional docs.

Thanks again!

@pdurbin
Copy link
Member

pdurbin commented Aug 13, 2019

@mheppler @djbrooke and I are planning to meet at 3 to go through the README at https://github.com/IQSS/dataverse-sample-data . I can try to improve the README if it's confusing. 😄

We can use Dataverse running on laptops or the instance I spun up for this issue: http://ec2-3-80-234-144.compute-1.amazonaws.com . We can also try the destroy_all_dvobjects.py script.

@djbrooke
Copy link
Contributor

Thanks @pdurbin and @mheppler for the helpful meeting about this, I was able to successfully set up sample data on the instance mentioned above.

I need to remove the data files for larger states (IL) and replace it with data files for smaller states (WY). After updating this I'll destroy what's currently there and replace it with the data for the smaller states.

@djbrooke djbrooke removed their assignment Aug 14, 2019
@djbrooke
Copy link
Contributor

OK, I was able to successfully destroy and re-create all sample data from the command line. If these instructions work for lowly me, they should work for anyone. :) Moving to code review.

Passing to @TaniaSchlatter for review of the data on http://ec2-3-80-234-144.compute-1.amazonaws.com. We can make further adjustments to the test data that's available, but I'm pretty happy with the code where it is.

@pdurbin
Copy link
Member

pdurbin commented Aug 14, 2019

We could also review what I wrote for http://guides.dataverse.org/en/4.15.1/developers/tips.html#sample-data and add to it if we want.

@TaniaSchlatter
Copy link
Member

Comments from review (+ looking at the April 25 list above):

  • exploring via Data Explore opens Whole Tale (http://ec2-3-80-234-144.compute-1.amazonaws.com/file.xhtml?fileId=157&version=1.0)

  • one or two dataverses should have banner graphics

  • looked for a geospatial file with map preview/ explore in world map but didn't find. Is there one that just needs a label?

  • looked for a dataset with file hierarchy but didn't find

  • it would be nice if the titles of the test datasets (not the subsets of actual datasets) described what they represent ("Test Dataset with 10+ files and geospatial file")

  • I assume we can restrict a file and flesh out permissions. If we make these changes in the UI, will it change the data we are making available? Should we have a raw version and a version that has more features enabled for Harvard's use, or would the version that has more features be more useful for everyone?

Regarding the text in the guide: recommend adding a use case about using the test data on a server that is not the main installation.

@djbrooke
Copy link
Contributor

djbrooke commented Aug 19, 2019

Thanks @TaniaSchlatter, I'll checklist-ize this and work through the ones that I can. I'll work with the team on others and also discuss some with you.

Issues for Danny to update

  • Better highlight a dataset with hierarchy (there is at least one in here...)
  • Updates test dataset names to represent the contents in the dataset
  • Guide text (added in 6103 doc updates #6104)

Issues to discuss

  • best solution for banner graphics (ok to not implement now)
  • Investigate adding geospatial preview/explore (requires changes to ansible?)

Issues for @djbrooke to add

@djbrooke
Copy link
Contributor

djbrooke commented Aug 21, 2019

I'm going to close this issue. @pdurbin announced the sample data repo's existence on the Google group and it's in a state where people in the community can start using it:

https://groups.google.com/forum/#!topic/dataverse-community/u-Yv0U3v4Bo

We created individual issues in the dataverse-ansible and sample data repos for the follow on tasks to make this useful to more groups (design team, specifically):

We'll take these through our usual development process.

Thanks everyone for the hard work on this issue, this is great to have!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants