list of contributions #2

xrotwang · 2020-09-09T12:07:55Z

My ideal for the list of contributions to be included in CrossGram would be a simple list of Zenodo DOIs for (particular versions of) datasets. (I have some code - to be included in cldf-zenodo - to fetch such datasets from Zenodo.)
The CrossGram app would then function mostly as (selective) catalogue of CLDF datasets on Zenodo, augmented with as much visualization as is possible generically.

johenglisch · 2020-09-10T06:57:02Z

Yeah, that makes sense.

Just for my own brain's sake, this is how I imagine this to happen:

CrossGram editors specify a Zenodo DOI in contributions.json (as opposed to a folder name)
clld initdb downloads the data into a cache folder
clld initdb then loads the CLDF data into the database
The cache folder is kept around, so one can repopulate the database without re-downloading the same stuff over and over again

Although I would still keep the option around to include local repositories (for testing unpublished datasets in a local instance).

xrotwang · 2020-09-10T07:04:10Z

What about something like this:

initializedb.main prompts for a contribution.json location - defaulting to the one in the repos.
Items in contribution.json are either DOIs or directory names (relative to the parent of contribution.json).

Or maybe we hardwire two contributions.json locations and switch between them with a internal switch like with dictionaria.

xrotwang · 2020-09-10T07:09:51Z

Actually, we might want to push the analogy with dictionaria even further: I.e. have separate repository for the editorial backend - even though right now this might just look like

├── README.md
├── submissions
│   └── contributions.json
├── submissions-internal
│   └── contributions.json

I'd guess we might want to include at least a short textual description with each dataset, so this may be additional content here.

johenglisch · 2020-09-10T07:18:42Z

Sounds good. We could go even more analogous and split the contributions.json into per-submission metadata files:

|-- README.md
|-- submissions
|   |-- dryerorder
|       |-- intro.md
|       |-- md.json
|-- submissions-internal
|   |-- haspelmathcomparison
|       |-- intro.md
|       |-- md.json

xrotwang · 2020-09-10T07:30:09Z

True. I'm a bit torn, though, because I wouldn't want CrossGram to add information to datasets, which should be already included in the dataset on Zenodo. But then, I can already think of

DOI
publication data

and possibly

responsible editor

xrotwang · 2020-09-10T07:38:36Z

https://github.com/crossgram/crossgram-internal

johenglisch · 2020-09-10T07:44:19Z

Yeah, it definitely should not contain the wealth of information that Dictionaria's metadata files provide.

I wouldn't make the metadata much more elaborate than the entries in the current contributions.json:

    {
        "id": "haspelmathcomparison",
        "number": 1,
        "published": "2020-09-08",
        "repo": "../../cldf-datasets/haspelmathcomparison",
        "authors": [
            "Martin Haspelmath",
            "The Comparative Constructions Team"
        ]
    }

(authors and repo will probably be replaced by doi once the dataset is published)

xrotwang · 2020-09-10T07:54:54Z

Yes. I'm not really happy with the path for repo - but also wouldn't want to go as far as using git submodules. Maybe we should have submissions-internal/datasets/ (hidden via .gitignore) and repo must be a subdirectory name in there?

xrotwang · 2020-09-10T07:55:56Z

Or actually have datasets in the repos root and use this as cache dir for the DOIs, too.

johenglisch · 2020-09-10T07:58:11Z

Yup, that was what I was about to suggest, too.

xrotwang · 2020-09-10T07:58:29Z

Btw. here's my code to download from zenodo (we might want to include it here, until I get around to finishing the cldf-zenodo package):

import io
import re
import json
import pathlib
import zipfile
import urllib.request

from bs4 import BeautifulSoup as bs
import requests


def download_from_doi(doi, outdir=pathlib.Path('.')):
    res = requests.get('https://doi.org/{0}'.format(doi))
    assert re.search('zenodo.org/record/[0-9]+$', res.url)
    res = requests.get(res.url + '/export/json')
    soup = bs(res.text, 'html.parser')
    res = json.loads(soup.find('pre').text)
    assert any(kw.startswith('cldf:') for kw in res['metadata']['keywords'])
    for f in res['files']:
        if f['type'] == 'zip':
            r = requests.get(f['links']['self'], stream=True)
            z = zipfile.ZipFile(io.BytesIO(r.content))
            z.extractall(str(outdir))
        elif f['type'] == 'gz':
            # what about a tar in there?
            raise NotImplementedError()
        elif f['type'] == 'gz':
            raise NotImplementedError()
        else:
            urllib.request.urlretrieve(
                f['links']['self'],
                outdir / f['links']['self'].split('/')[-1],
            )
    return outdir

The resulting directory can then be searched for datasets using pycldf.iter_datasets.

johenglisch · 2020-09-10T08:05:21Z

Huh… pycldf.iter_datasets seems great -- now I feel stupid for my

if (path / 'StructureDataset-metadata.json').exists():
    ...
elif (path / 'cldf-metadata.json').exists():
    ...

(<_<)"

xrotwang · 2020-09-10T08:08:08Z

I feel stupid for not specifying something like "known locations" for the metadata files in the CLDF standard :)

johenglisch · 2020-09-10T08:12:14Z

Well, checking for the dc:conformsTo bit feels more reliable to me.

xrotwang assigned johenglisch Sep 9, 2020

johenglisch added a commit that referenced this issue Sep 22, 2020

read contribution metadata from internal repo (cf #2)

5bab831

johenglisch added a commit that referenced this issue Sep 22, 2020

load submissions locally from internal repo (cf #2)

e1d6ad9

johenglisch added a commit that referenced this issue Sep 22, 2020

download submission from zenodo (cf #2)

97f1b8d

johenglisch added a commit that referenced this issue Sep 22, 2020

download submissions from git (cf #2)

baf0687

johenglisch added a commit that referenced this issue Sep 22, 2020

folder name == submission id (cf #2)

55aef29

johenglisch added a commit that referenced this issue Sep 22, 2020

deleted contributions.json (cf #2)

25b2cda

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

list of contributions #2

list of contributions #2

xrotwang commented Sep 9, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020 •

edited

Loading

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

list of contributions #2

list of contributions #2

Comments

xrotwang commented Sep 9, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020 • edited Loading

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020

johenglisch commented Sep 10, 2020

xrotwang commented Sep 10, 2020 •

edited

Loading