Skip to content

Commit

Permalink
Faster HF dataset iteration in docs (#1414)
Browse files Browse the repository at this point in the history
* Faster HF dataset iteration in docs

* Nit
  • Loading branch information
mariosasko authored Dec 14, 2023
1 parent 8edec53 commit 1146259
Showing 1 changed file with 4 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,10 @@ def test_datasets(self):

# START def_batch_iterator
def batch_iterator(batch_size=1000):
for i in range(0, len(dataset), batch_size):
yield dataset[i : i + batch_size]["text"]
# Only keep the text column to avoid decoding the rest of the columns unnecessarily
tok_dataset = dataset.select_columns("text")
for batch in tok_dataset.iter(batch_size):
yield batch["text"]

# END def_batch_iterator

Expand Down

0 comments on commit 1146259

Please sign in to comment.