This article explains the important types/traits at the storage layer.
Data is stored in columns in RisingLight. Here a column is referred to as an "array". Array
is a trait over all arrays, specifyng the interface for all arrays. PrimitiveArray<T>
implements the Array
trait, where T
can be bool
, i32
, etc. ArrayImpl
is an enum over different arrays.
ArrayBuilder
is a trait over array builders that builds Array
. Similar to how PrimitiveArray<T>
implements Array
, PrimitiveArrayBuilder<T>
implements ArrayBuilder
. ArrayBuilderImpl
is an enum over different array builders.
Some fields/methods are shown below:
pub trait Array: Sized + Send + Sync + 'static {
/// Corresponding builder of this array.
type Builder: ArrayBuilder<Array = Self>;
/// Type of element in the array.
type Item: ToOwned + ?Sized;
/// Retrieve a reference to value.
fn get(&self, idx: usize) -> Option<&Self::Item>;
/// Get iterator of current array.
fn iter(&self) -> ArrayIter<'_, Self> {
ArrayIter::new(self)
}
...
}
pub trait ArrayBuilder: Sized + Send + Sync + 'static {
/// Corresponding `Array` of this builder
type Array: Array<Builder = Self>;
/// Append a value to builder.
fn push(&mut self, value: Option<&<Self::Array as Array>::Item>);
/// Take all elements and return a new array.
fn take(&mut self) -> Self::Array;
/// Finish build and return a new array.
fn finish(mut self) -> Self::Array {
self.take()
}
...
}
DataChunk
is a list of arrays, representing a contiguous set of rows.
DataChunkBuilder
uses a list of array builders to build data chunks.
pub struct DataChunk {
arrays: Arc<[ArrayImpl]>,
...
}
pub struct DataChunkBuilder {
array_builders: Vec<ArrayBuilderImpl>,
...
}
impl DataChunkBuidler {
// Add a row (a list of DataValue's) to the data chunk
fn push_row(&mut self, row: impl IntoIterator<Item = DataValue>) -> Option<DataChunk>;
// Finishes building data chunk
fn take(&mut self) -> Option<DataChunk>;
}
Each transaction has an in-memory table as a write buffer.
The MemTable
trait specifies that the mem-table should take DataChunk
s and outputs a DataChunk
.
pub trait MemTable {
/// add data to memory table
fn append(&mut self, columns: DataChunk) -> StorageResult<()>;
/// flush data to [`DataChunk`]
fn flush(self) -> StorageResult<DataChunk>;
}
pub struct BTreeMapMemTable {
columns: Arc<[ColumnCatalog]>,
primary_key_idx: usize,
multi_btree_map: BTreeMultiMap<ComparableDataValue, Row>, // <primary key, row data>
}
BTreeMapMemTable
implements MemTable
. At append
, it adds each row in the data chunk to its internal B-tree map. At flush
, it creates array builders using type information from ColumnCatalog
, builds arrays from rows using those builders, and constructs a DataChunk
from those arrays.
SecondaryMemRowsetImpl
represents a mem-table that flushes to disk. Those flushed data will be in EncodedRowset
format.
pub struct SecondaryMemRowset<M: MemTable> {
mem_table: M,
rowset_id: u32,
rowset_builder: RowsetBuilder,
}
impl SecondaryMemRowset<BTreeMapMemTable> {
pub fn append(&mut self, columns: DataChunk) -> StorageReulst<()> {
self.mem_table.append(columns)
}
pub fn flush(self, ..) -> StorageResult<()> {
let chunk = self.mem_table.flush();
self.rowset_builder.append(chunk);
let encoded_rowset = self.rowset_builder.finish();
let writer = RowsetWriter::new();
writer.flush(encoded_rowset);
}
}
pub enum SecondaryMemRowsetImpl {
BTree(SecondaryMemRowset<BTreeMapMemTable>),
Column(SecondaryMemRowset<ColumnMemTable>),
}
An EncodedColumn
is a column in serialized format, containing both index and data of the column in binary.
Column builders, defined by the ColumnBuilder
trait, return (Vec<BlockIndex>, Vec<u8>)
when finishing a column. Note the returned index (Vec<u8>
) is not in binary format yet and IndexBuilder
serializes Vec<BlockIndex>
to Vec<u8>
later.
pub struct EncodedColumn {
index: Vec<u8>,
data: Vec<u8>
}
pub trait ColumnBuilder<A: Array> {
/// Append an [`Array`] to the column. [`ColumnBuilder`] will chunk it into small parts.
fn append(&mut self, array: &A);
/// Finish a column, return (index, data) for the block
fn finish(self) -> (Vec<BlockIndex>, Vec<u8>);
}
PrimitiveColumnBuilder<T>
implements ColumnBuilder
trait. Internally, PrimitiveColumnBuilder
rechunks data into Block
s using BlockBuilder
s. A Block
is the minimum operation unit of secondary storage. Details about blocks are omitted here.
pub struct PrimitiveColumnBuilder<T: PrimitiveFixedWidthEncode> {
data: Vec<u8>,
/// Current block builder
current_builder: Option<BlockBuilderImpl<T>>,
/// Block index builder
block_index_builder: BlockIndexBuilder,
...
}
ColumnBuilderImpl
is an enum of different PrimitiveColumnBuilder
s.
impl ColumnBuilderImpl {
pub fn new_from_datatype(datatype: &DataType, options: ColumnBuilderOptions) -> Self;
pub fn append(&mut self, array: &ArrayImpl);
pub fn finish(self) -> (Vec<BlockIndex>, Vec<u8>);
}
An EncodedRowset
is a collection of EncodedColumns
. RowsetBuilder
builds it using column builders.
pub struct EncodedRowset {
columns_info: Arc<[ColumnCatalog]>,
columns: Vec<EncodedColumn>,
...
}
pub struct RowsetBuilder {
columns: Arc<[ColumnCatalog]>,
builders: Vec<ColumnBuilderImpl>,
...
}
impl RowsetBuilder {
fn append(&mut self, chunk: DataChunk) {
for idx in 0..chunk.column_count() {
self.builders[idx].append(chunk.array_at(idx));
}
}
pub fn finish(self) -> EncodedRowset {
EncodedRowset {
columns_info: self.columns.clone(),
columns: self
.builders
.into_iter()
.map(|builder| {
let (block_indices, data) = builder.finish();
let mut index_builder = IndexBuilder::new(...);
for index in block_indices {
index_builder.append(index);
}
let index = index_builder.finish();
EncodedColumn { index, data }
})
.collect_vec(),
}
}
}
A SecondaryTransaction
represents a transaction that persists data to disk. A transaction buffers the writes (e.g. insertions) in the mem-table and only flushes to disk when (1) the mem-table exceeds a certain size threshold, or (2) the transaction commits.
pub struct SecondaryTransaction {
mem: Option<SecondaryMemRowsetImpl>,
...
}
impl SecondaryTransaction {
fn flush_rowset(&mut self) {
self.mem.flush();
Add flushed rowset to self.to_be_committed_rowsets;
}
pub fn append(&mut self, columns: DataChunk) {
if self.mem.is_none() {
self.mem = ...
}
self.mem.append(columns);
if memtable_size_exceeds_threshold() {
self.flush_rowset();
}
}
pub fn commit(mut self) -> StorageResult<()> {
self.flush_rowset();
// Skipped: Flush DVs, Commit changeset to manifests, etc.
}
}
Array
represents a column in memory.DataChunk
represents a contiguous set of rows, implemented as a collection ofArray
's.MemTable
is an in-memory write buffer.append
takesDataChunk
.flush
outputsDataChunk
.SecondaryMemRowset<M: MemTable>
is a mem-table that flushes to secondary storage.append
appends to itsMemTable
.flush
flushes theDataChunk
returned byMemTable
to disk asEncodedRowset
.EncodedColumn
represents a column to be persisted to the secondary storage. It contains the index and data of a column.ColumnBuilder
takes array atappend
, and outputsEncodedColumn
atflush
, roughly.EncodedRowset
is a collection ofEncodedColumn
s.RowsetBuilder
uses column builders internally.RowsetBuilder
takesDataChunk
atappend
, and outputsEncodedRowset
atflush
.
Let's walk through the whole process of executing INSERT INTO table VALUES (1, '11')
:
VALUES (1, '11')
is stored as aDataChunk
and passed to the insert executor.- The transaction adds the
DataChunk
to its memory-table. - Whenever the mem-table is flushed (exceeding threshold/commit), a new
DataChunk
is produced. - The
DataChunk
is passed toRowsetBuilder
. - When the
RowsetBuilder
finishes, it produces anEncodedRowset
, which is persisted to disk.