Data Provenance data #61

shayne-longpre · 2024-04-05T02:50:41Z

Adding the Data Provenance data.

blester125

Should this be marked as a WIP? There seems to be a lot of things, like the include.txt missing?

data_provenance/download.py

shayne-longpre · 2024-04-26T04:46:16Z

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

shayne-longpre · 2024-04-26T17:37:30Z

Also, how do we run the default linter?

Skylion007 · 2024-04-26T19:25:03Z

@shayne-longpre run pre-commit

shayne-longpre · 2024-04-26T19:39:20Z

@shayne-longpre run pre-commit

Could you direct me on how to run that?

Skylion007 · 2024-04-26T19:59:30Z

pip install pre-commit
pre-commit install
pre-commit run -a

shayne-longpre · 2024-04-27T19:59:09Z

pip install pre-commit pre-commit install pre-commit run -a

Done. Thanks!

Skylion007 · 2024-04-27T21:14:06Z

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

blester125 · 2024-05-01T15:03:59Z

We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions

We already have a CI that blocks merges until the formatting is correct (uses black and isort), so I don't think we need to have the CI actually re-write commits. The tools are pretty reliable, but I think using pre-commit (and thus having people need to spot check formatting changes by adding them back to git staging) is less error prone.

blester125

Thanks for getting this in! There are a few tweaks to make but we're pretty close!

data_provenance/download.py

data_provenance/to-dolma.py

shayne-longpre · 2024-05-04T16:50:50Z

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

Still hitting this issue cc @blester125

blester125 · 2024-05-06T14:11:38Z

@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the to_dolma call as I'm unable to pip install dolma for some reason... any ideas?

Still hitting this issue cc @blester125

🤔 What version of things are you using? I'm on python3.11 and I was able to install both dolma 0.9.1 and dolma 1.0.3 on ubuntu 22.04 and 23.04 respectively.

Maybe try to see if you can bump your python version so that there is a pre-built wheel for it? The list of wheels are here https://pypi.org/project/dolma/#files

shayne-longpre · 2024-05-12T20:17:53Z

@blester125 I've reviewed the new licenses with Aviya and we trimmed them down further for an abundance of caution.

The number of included datasets is now ~340, as compared to ~500 before. Everything appears to be working!

craffel · 2024-05-13T15:19:27Z

licensed_pile/licenses.py

@@ -24,7 +29,16 @@ class PermissiveLicenses(StringEnum):
    GFDL = "GNU Free Documentation License"
    APACHE_2 = "Apache 2 License - https://www.apache.org/licenses/LICENSE-2.0"
    MIT = "MIT License"
-    BSD = "BSD License"


Is there any source in here already that was BSD licensed? If so we should make sure to update it to use one of the BSD_2 or BSD_3 to avoid breakage

I don't think any code datasources have been checked in yet so I don't think anyone would have used it.

blester125 · 2024-05-14T19:49:37Z

I pulled the branch to double check that the dolma stuff is working, but when I run download with the include_test.csv file it seems like neither of the datasets have a user_parent key. Is there a step I missed? The README just says to run download.

Also, how much effort would it be to setup so it runs from the data_provenance dir instead of one level up where we have to set the PYTHONPATH?

shayne-longpre · 2024-05-15T00:20:38Z

user_parent

@blester125 sorry the new HuggingFace object key is "dataset" not "user_parent"! I just updated it -- thanks for catching.

As for PYTHONPATH I'm not clear actually on what the changes are? python paths always confuse me

This commit does the following: * Updates the data_provenance code so it will be run from its dir instead of the repo root. * Updates the include_test.csv as it didn't match the include.csv (was missing the `GitHub License` column) * Adds some logging

blester125 · 2024-05-15T13:12:15Z

I just pushed a commit that fixes some issues I found while testing (the include_test.csv was missing the 'GitHub License) and I updated it so the code should be run from the data_provenancedir instead of the repo root (which removes the need for thePYTHONPATH). and the datasets in the test csv seemed to be using targetsinstead oflabels` so I updated the code to be able to handle both.

I was able to run the download and to dolma script with the new include_test.csv and the results look good, I think it's ready to merge but @shayne-longpre should take a quick look at my changes to make sure I didn't miss anything.

adding data provenance data

dedaa6a

shayne-longpre requested a review from soldni April 5, 2024 02:50

shayne-longpre mentioned this pull request Apr 9, 2024

Previous Datasets #12

Open

2 tasks

blester125 reviewed Apr 10, 2024

View reviewed changes

data_provenance/download.py Outdated Show resolved Hide resolved

data_provenance/download.py Outdated Show resolved Hide resolved

data_provenance/download.py Outdated Show resolved Hide resolved

Shayne Longpre added 2 commits April 25, 2024 14:52

Merge branch 'main' into dpi

d79fc33

all additions

5ef28ea

Shayne Longpre added 2 commits April 27, 2024 15:03

reformatting

2dc4cdd

reformatting

2a43582

shayne-longpre requested a review from blester125 April 28, 2024 15:13

blester125 requested changes May 1, 2024

View reviewed changes

Shayne Longpre added 3 commits May 4, 2024 11:45

update allow list

5fe31f2

Merge branch 'main' into dpi

da92788

linter

b2e5a71

Shayne Longpre and others added 5 commits May 6, 2024 15:46

wip

d5793a3

removing licenses

f1862fb

wip

d07ad3d

updated requirements

bea3f0c

updated requirements

354ff9e

craffel reviewed May 13, 2024

View reviewed changes

updated requirements

f07c6a7

Shayne Longpre and others added 2 commits May 14, 2024 20:20

fix user parent

2e097f7

blester125 approved these changes May 15, 2024

View reviewed changes

fixed licenses

30e0871

shayne-longpre merged commit 28e210b into main May 23, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Provenance data #61

Data Provenance data #61

shayne-longpre commented Apr 5, 2024

blester125 left a comment

shayne-longpre commented Apr 26, 2024

shayne-longpre commented Apr 26, 2024

Skylion007 commented Apr 26, 2024 •

edited

Loading

shayne-longpre commented Apr 26, 2024

Skylion007 commented Apr 26, 2024

shayne-longpre commented Apr 27, 2024

Skylion007 commented Apr 27, 2024

blester125 commented May 1, 2024

blester125 left a comment

shayne-longpre commented May 4, 2024

blester125 commented May 6, 2024

shayne-longpre commented May 12, 2024

craffel May 13, 2024

blester125 May 13, 2024

blester125 commented May 14, 2024

shayne-longpre commented May 15, 2024

blester125 commented May 15, 2024

Data Provenance data #61

Data Provenance data #61

Conversation

shayne-longpre commented Apr 5, 2024

blester125 left a comment

Choose a reason for hiding this comment

shayne-longpre commented Apr 26, 2024

shayne-longpre commented Apr 26, 2024

Skylion007 commented Apr 26, 2024 • edited Loading

shayne-longpre commented Apr 26, 2024

Skylion007 commented Apr 26, 2024

shayne-longpre commented Apr 27, 2024

Skylion007 commented Apr 27, 2024

blester125 commented May 1, 2024

blester125 left a comment

Choose a reason for hiding this comment

shayne-longpre commented May 4, 2024

blester125 commented May 6, 2024

shayne-longpre commented May 12, 2024

craffel May 13, 2024

Choose a reason for hiding this comment

blester125 May 13, 2024

Choose a reason for hiding this comment

blester125 commented May 14, 2024

shayne-longpre commented May 15, 2024

blester125 commented May 15, 2024

Skylion007 commented Apr 26, 2024 •

edited

Loading