Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list of contributions #2

Open
xrotwang opened this issue Sep 9, 2020 · 14 comments
Open

list of contributions #2

xrotwang opened this issue Sep 9, 2020 · 14 comments
Assignees

Comments

@xrotwang
Copy link
Member

xrotwang commented Sep 9, 2020

My ideal for the list of contributions to be included in CrossGram would be a simple list of Zenodo DOIs for (particular versions of) datasets. (I have some code - to be included in cldf-zenodo - to fetch such datasets from Zenodo.)
The CrossGram app would then function mostly as (selective) catalogue of CLDF datasets on Zenodo, augmented with as much visualization as is possible generically.

@johenglisch
Copy link
Collaborator

Yeah, that makes sense.

Just for my own brain's sake, this is how I imagine this to happen:

  1. CrossGram editors specify a Zenodo DOI in contributions.json (as opposed to a folder name)
  2. clld initdb downloads the data into a cache folder
  3. clld initdb then loads the CLDF data into the database
  4. The cache folder is kept around, so one can repopulate the database without re-downloading the same stuff over and over again

Although I would still keep the option around to include local repositories (for testing unpublished datasets in a local instance).

@xrotwang
Copy link
Member Author

What about something like this:

  1. initializedb.main prompts for a contribution.json location - defaulting to the one in the repos.
  2. Items in contribution.json are either DOIs or directory names (relative to the parent of contribution.json).

Or maybe we hardwire two contributions.json locations and switch between them with a internal switch like with dictionaria.

@xrotwang
Copy link
Member Author

Actually, we might want to push the analogy with dictionaria even further: I.e. have separate repository for the editorial backend - even though right now this might just look like

├── README.md
├── submissions
│   └── contributions.json
├── submissions-internal
│   └── contributions.json

I'd guess we might want to include at least a short textual description with each dataset, so this may be additional content here.

@johenglisch
Copy link
Collaborator

Sounds good. We could go even more analogous and split the contributions.json into per-submission metadata files:

|-- README.md
|-- submissions
|   |-- dryerorder
|       |-- intro.md
|       |-- md.json
|-- submissions-internal
|   |-- haspelmathcomparison
|       |-- intro.md
|       |-- md.json

@xrotwang
Copy link
Member Author

True. I'm a bit torn, though, because I wouldn't want CrossGram to add information to datasets, which should be already included in the dataset on Zenodo. But then, I can already think of

  • DOI
  • publication data

and possibly

  • responsible editor

@xrotwang
Copy link
Member Author

@johenglisch
Copy link
Collaborator

Yeah, it definitely should not contain the wealth of information that Dictionaria's metadata files provide.

I wouldn't make the metadata much more elaborate than the entries in the current contributions.json:

    {
        "id": "haspelmathcomparison",
        "number": 1,
        "published": "2020-09-08",
        "repo": "../../cldf-datasets/haspelmathcomparison",
        "authors": [
            "Martin Haspelmath",
            "The Comparative Constructions Team"
        ]
    }

(authors and repo will probably be replaced by doi once the dataset is published)

@xrotwang
Copy link
Member Author

Yes. I'm not really happy with the path for repo - but also wouldn't want to go as far as using git submodules. Maybe we should have submissions-internal/datasets/ (hidden via .gitignore) and repo must be a subdirectory name in there?

@xrotwang
Copy link
Member Author

Or actually have datasets in the repos root and use this as cache dir for the DOIs, too.

@johenglisch
Copy link
Collaborator

Yup, that was what I was about to suggest, too.

@xrotwang
Copy link
Member Author

xrotwang commented Sep 10, 2020

Btw. here's my code to download from zenodo (we might want to include it here, until I get around to finishing the cldf-zenodo package):

import io
import re
import json
import pathlib
import zipfile
import urllib.request

from bs4 import BeautifulSoup as bs
import requests


def download_from_doi(doi, outdir=pathlib.Path('.')):
    res = requests.get('https://doi.org/{0}'.format(doi))
    assert re.search('zenodo.org/record/[0-9]+$', res.url)
    res = requests.get(res.url + '/export/json')
    soup = bs(res.text, 'html.parser')
    res = json.loads(soup.find('pre').text)
    assert any(kw.startswith('cldf:') for kw in res['metadata']['keywords'])
    for f in res['files']:
        if f['type'] == 'zip':
            r = requests.get(f['links']['self'], stream=True)
            z = zipfile.ZipFile(io.BytesIO(r.content))
            z.extractall(str(outdir))
        elif f['type'] == 'gz':
            # what about a tar in there?
            raise NotImplementedError()
        elif f['type'] == 'gz':
            raise NotImplementedError()
        else:
            urllib.request.urlretrieve(
                f['links']['self'],
                outdir / f['links']['self'].split('/')[-1],
            )
    return outdir

The resulting directory can then be searched for datasets using pycldf.iter_datasets.

@johenglisch
Copy link
Collaborator

Huh… pycldf.iter_datasets seems great -- now I feel stupid for my

if (path / 'StructureDataset-metadata.json').exists():
    ...
elif (path / 'cldf-metadata.json').exists():
    ...

(<_<)"

@xrotwang
Copy link
Member Author

I feel stupid for not specifying something like "known locations" for the metadata files in the CLDF standard :)

@johenglisch
Copy link
Collaborator

Well, checking for the dc:conformsTo bit feels more reliable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants