Refactor data modules #225
Labels
bug
Something isn't working
enhancement
New feature or request
topic: data
Issue about data loader modules
There are several problems with the current data modules:
Filtering has to happen at data source level. This means one can't discard examples in
process
, but instead must supply afilter_fn
and wrap the source with aFilterDataSource
.This is problematic in two ways: 1) users don't have access to the source in predefined sources such as
MonoTextData
; 2) filtering cannot depend on processing results. The second point led to a previous change that madeTextLineDataSource
return tokenized text, because it also needs to filter out sentences containing more than a certain number of words. This prevented users from applying it to arbitrary text, such asJSONL
files.However, it shouldn't be hard to support discarding examples in
process
. For the interface, we can ask the user to returnNone
to indicate discarding. The implementation should be straightforward --- just change theBatchSampler
(and maybeDynamicBatchSampler
) code.There's a very subtle bug that I just realized. When a dataset has the following configurations:
TextLine...
,Iter...
)."process"
, i.e., data is eagerly loaded but lazily processed."loaded"
, i.e., either don't cache at all, or cache the processed examples.TokenCountBatchingStrategy
).What happens under the hood is:
_CachedDataSource
to support random access by prefetching._CachedDataSource
when it is accessed.reset()
on_CachedDataSource
. This will result in the wrapped data source being unnecessarily iterated over again, and could be problematic if the source is anIterDataSource
._CachedDataSource
.This case was not covered in our tests, because there are simply too many combinations when it comes to the data module. The current design requires a lot of checks and special handling of corner cases, and is not extensible to new features.
A possible new design could be:
The text was updated successfully, but these errors were encountered: