-
Notifications
You must be signed in to change notification settings - Fork 1
BH Compliant Text Reader
Drill provides a variety of text readers: CSV, TSV, PSV and so on. As it turns out, they are all variations on the "compliant text reader". (It is compliant with RFC 4180.) To demonstrate the new scan framework, and the result set loader, this project upgraded the compliant reader.
The TextFormatPlugin
extends the Easy format plugin to define the compliant text reader. Each specific reader (CSV, etc.) is defined by a specific set of plugin options for the compliant plugin.
Key changes included:
- Use the new
EasyFormatConfig
to configure the plugin. - Implement the
scanBatchCreator()
method to create the required scan framework. - Remove methods associated with the prior text record reader class.
The scan framework is assembled using the new TextScanBatchCreator
nested class. Primary tasks:
- Create a
columns
aware file scan framework. - Determine if file headers are to be provided by the reader.
- Specify that the null type is
VarChar
. (Text readers can never produce nullableINT
columns, soVarChar
is a better guess. Missing values will be empty, consistent with the fact that text files don't support NULLs.) - For backward compatibility, specify to use the Drill 1.11 position for partitions. (This line allows existing QA tests to pass. Once the check is committed, this like should be removed and QA tests rebased accordingly.)
As has been noted, the new scan framework creates readers as needed, rather than up front as in the legacy version. The text format plugin must provide a class that creates a reader on request. For simplicity, the text format plugin itself implements the FileReaderCreator
interface and the makeBatchReader()
to create the actual batch reader.
The CompliantTextBatchReader
class replaces the prior TextRecordReader
class to do the work of reading a batch using the result set loader.
The changes to this class were pretty straightforward:
- Rip out the code that implemented direct memory access to write to vectors.
- Replace the code with calls to the result set loader.
- Change the code to read records until the result set loader reports that it is full (rather than reading a fixed number of records.)
- Clean up some error handling.
The FieldVarCharOutput
class handles the case in which the file provides headers. It was modified to write use the result set loader to write each column.
The RepeatedVarCharOutput
class handles the case of using the columns[]
array, by writing to a VarChar array using the result set loader.
Frankly, the revised implementation seems to work fine. A prior version of the code (before adding the JSON reader) passed all the Drill unit tests and the MapR pre-commit tests. This is one part of the project that can be considered done.