Skip to content

EVF Tutorial Late Schema

Paul Rogers edited this page May 18, 2019 · 1 revision

"Late Schema" Readers

The tutorial thus far focused on the log reader which is an example of an "early schema" reader: the log reader has sufficient information to declare a schema before reading the file. Many plugins are of this form: CSV, Parquet, JDBC, etc.

However, other readers are truly "schema on read": they discover columns only as they are read. JSON is the classic example of such a "late schema" reader. Here we'll look at how to support schema discovery on the fly.

No Up-Front Reader Schema

Almost everything we've discussed says the same, except that we don't call setTableSchema() on our SchemaNegotiator instance; we just let EVF create a result set loader with no columns:

  @Override
  public boolean open(FileSchemaNegotiator negotiator) {
    ...
    // No call to setTableSchema()
    loader = negotiator.build();
    ...
    return true;
  }

Column Discovery

With late schema, we define columns on the fly. The JSON reader (not yet in master) is an example. Here is a highly simplified version, assuming all columns are VarChar:

  void readNextColumn(TupleWriter writer) {
    String value = // Get the value
    String name = // Get the column name
    ScalarWriter colWriter = writer.scalar(name);
    if (colWriter == null) {
      ColumnMetadata colSchema = MetadataUtils.newScalar(name, MinorType.VARCHAR, DataMode.OPTIONAL);
      int colIndex = writer.addColumn(colSchema);
      colWriter = writer.scalar(colIndex);
    }
    colWriter.setString(value);
  }

Here we obtain the column writer by name. If the column has not yet been defined, we'll get a null value. In this case, we create a new column by first defining the metadata for the column, then adding the column to the writer, and grabbing the newly-created column by index. Finally, we write the value to the underlying column vector.

There is actually a bit of magic going on here. Suppose our column is foo, but our query was SELECT bar FROM .... When we add the column, we'll get back a dummy column writer since the query does not actually need the value of the foo column. As before, if we care whether the value is projected, we could use the isProjected() method to find out.


Next: Enhanced Error Reporting

Clone this wiki locally