-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tooling to download and process Wikis (#51)
Add tools to scrape mediawiki wikis that don't publish dumps Add tool that exports the xml based on the list of pages. Add the ability to convert wikis to dolma Download and extract script supports multiworker Create WTF Wikipedia parsing server which uses a worker pool to allow for timeouts Creation of script that removes html tags we found in many wiki dumps. Added Shadow Paging to the creation of wikitext dolma files Added Shadow Paging to dolma preprocessing. Added script that remove `None` lines from dolma files. Added script that can combine dolma shards while tracking what was used where to allow for aligned combinations of later versions.
- Loading branch information
1 parent
9a4d292
commit f567cd1
Showing
51 changed files
with
4,450 additions
and
133 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,3 +160,6 @@ cython_debug/ | |
#.idea/ | ||
.python-version | ||
**/licensed_pile_log.txt | ||
|
||
node_modules | ||
package-lock.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
shard_to_*.json |
Oops, something went wrong.