Building a Graph Engine -- Key-Value Store -- Nested Schema

In this post I explained the reason of why it is necessary to define schema before insert any data into the key-value store. The previous solution did achieved the goal of compact arrangement. User can define a map schema in a DSL like this for a student record:

[[:id :long] [:name :text] [:school :text] [:enrollment :int-array]]

The a map data like this will fit in the schema

{:id 1, :name "Jack", :school "DHU", :enrollment [2010 2011 2012 2013]}

Most of the relational database which need to define schema or tables works like this, users can define a one-layer schema which it's data types are primitive or arrays at most. If we need to present a map like this:

{:id 1, :name "Jack", :school "DHU",
:scores {:developing 99
:data-structure 87
:math 60}}}

the 'score' field must been placed in another table and leave only a identifier in the student table.

But sometimes we need to embed additional information in one document rather than use identifier to link two items. For example, in Morpheus, we need to define an array which each items have a long and id type field for edge lists[1]. In this case we need to define schema in an nested manner. In the student record example, we can describe the schema with scores like this

[[:id :long] [:name :text] [:school :text]
[:scores [[:developing :short]
[:data-structure :short]
[:math :short]]]

This schema is just right for the student record with scores we seen. It looks like what MongoDB was designed for, just without storing key names and structure of each field in the map and their children. We can utilize this feature to save more spaces from repeatedly describe what is the data represents for in the memory.

The key-value also store supports array of nested schema and array in nested schema. It allows almost infinite combination, which should be flexible enough to model any data structure.

There are two parts of the actual implementation of this feature, the writer and the reader.

In the writer there is a planner responsible for taking data and schema, compile them into a flattened sequential instructions contains field data, data type (encoder) and data bytes length. Next, we just need to write field data according to those instructions one after another, without any gaps and tags.

Reader takes the location of the cell, find out the cell schema and reconstruct data from the schema. Both writer and reader require walk on the schema recursively.

There is a downside of this feature, which is it performance is lower than the one-layer schema. It is also harder to do optimization (Precompile. I will cover this later). But it did provide both memory efficiency and flexibility.

For more exact details, please read this.