-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Provenance data #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be marked as a WIP? There seems to be a lot of things, like the include.txt
missing?
@blester125 could you take a look at this pull request when you get a chance? The only thing I was unable to test is the |
Also, how do we run the default linter? |
@shayne-longpre run pre-commit |
Could you direct me on how to run that? |
|
Done. Thanks! |
We should just setup a pre-commit.ci here to do autoformat PRs. I'd be happy to do it if I can get proper permissions |
We already have a CI that blocks merges until the formatting is correct (uses black and isort), so I don't think we need to have the CI actually re-write commits. The tools are pretty reliable, but I think using pre-commit (and thus having people need to spot check formatting changes by adding them back to git staging) is less error prone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting this in! There are a few tweaks to make but we're pretty close!
Still hitting this issue cc @blester125 |
🤔 What version of things are you using? I'm on python3.11 and I was able to install both dolma 0.9.1 and dolma 1.0.3 on ubuntu 22.04 and 23.04 respectively. Maybe try to see if you can bump your python version so that there is a pre-built wheel for it? The list of wheels are here https://pypi.org/project/dolma/#files |
@blester125 I've reviewed the new licenses with Aviya and we trimmed them down further for an abundance of caution. The number of included datasets is now ~340, as compared to ~500 before. Everything appears to be working! |
@@ -24,7 +29,16 @@ class PermissiveLicenses(StringEnum): | |||
GFDL = "GNU Free Documentation License" | |||
APACHE_2 = "Apache 2 License - https://www.apache.org/licenses/LICENSE-2.0" | |||
MIT = "MIT License" | |||
BSD = "BSD License" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any source in here already that was BSD licensed? If so we should make sure to update it to use one of the BSD_2 or BSD_3 to avoid breakage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any code datasources have been checked in yet so I don't think anyone would have used it.
I pulled the branch to double check that the dolma stuff is working, but when I run download with the Also, how much effort would it be to setup so it runs from the data_provenance dir instead of one level up where we have to set the |
@blester125 sorry the new HuggingFace object key is "dataset" not "user_parent"! I just updated it -- thanks for catching. As for |
This commit does the following: * Updates the data_provenance code so it will be run from its dir instead of the repo root. * Updates the include_test.csv as it didn't match the include.csv (was missing the `GitHub License` column) * Adds some logging
I just pushed a commit that fixes some issues I found while testing (the include_test.csv was missing the 'GitHub License I was able to run the download and to dolma script with the new |
Adding the Data Provenance data.