Building a Graph Engine -- Modeling Graph: Vertexes, Edges and their relations

The graph engine was designed to build graph data models based on key-value schema, because the engine use identifier list in the vertex to make connections to other vertex or edges. It is efficient to fetch data according to the identifier of a key-value pair from a hash map than do any traditional search in an array. Every unit, including edges and vertex are store as key value pair in the memory store in a very compact way (see this article). In a data explore process, it should be deliver much more better performance than traditional relational databases. It is also flexible and able to model hyper graph, that one edge may have multiply vertex inbound and outbound.

For performance concern, how to present edges in vertex is an interesting topic. The problem is there may be many different kind on edges. For example, when we model a school roster, one student should have multiply kinds of relation to other matters like his class, his friends, his teachers, this scores. Usually we only need to know one kind of his relation like how many friends he has. If we put every edges in one list, it will be too much overhead to extract all of the edges that was linked with the vertex to find only one type of edges. One solution is to define the possible edge types in the schema like what we see how Trinity models movie and actor graph.Snip20160228_2But when the datasets increasing, one entity of a vertex may have more relations and edges to other vertexes over time, model edges like this will leads to alter the cell schema for the vertex frequently, which also leads to rebuild the whole data sets of the the schema in the store. I have studied neo4j and OrientDB, finally decided to develop my own approach, to store array map in vertex for each type of edges to a identifier to edge list cell. The edge list cell contains all of the edges of this type related to the vertex. A vertex contains 3 such array map for inbound, outbound and indirected edges. It can also contains more fields for other data that was defined in the vertex schema.
IMG_0053  Edge list cell has only one field, that is the list of the identifiers to other vertex or edges. For best performance, the engine allows both vertex id (for simple edges) or edge/hyper edges with additional document id for more complex models. The engine can determinate the type of the cell based on it's schema id from the store.IMG_0054_

After designed such structure for graph model, I found that the store still have limitations in custom types like array map. I need to dedicate more time on improving the key-value store, supporting nested schema will suit my needs.

Building a Graph Engine -- Key-Value Store -- Defragmentation

The key-value store to the graph engine did not use any native way in the programming language for memory efficiency concern. Instead, I use a constant size memory space to store all of the data.

In the beginning, it was good, when we add record to the space, the cells arranged tight and their location as well as their id was store in another HashMap as index. Then, if we delete any of the cells before the last one will make gaps between cells. Because I use append header to allocate new position for new cells, before reclaim those gaped spaces and reset the append header, the location of append header will continually increase and pop out of the store boundary.

Defragmentation seems a classical procedure in data base and operating systems. It is my first time to build such system through, I was first stuck in this part, but after some drawing the idea was pretty clear. I was able to finish this job in the Chinese new year eve (Yes, I write my code even when the others are celebrating and on vacation).
IMG_0049

The digram I drawn was shown above. There is a concurrent skip list map in trunks to record on fragments. When adding fragments, there is a check to detect adjacent fragments and merge them into one, like we seen in the 4th and 5th row in the digram. Defragmentation is to move data cell to the left nearest cell tail and mark new fragments  that the movements made just like the first fragment in the 2nd row. In my case, the move procedure is to copy original memory blocks and mark the new fragment for next turn in a loop. After one round of defragment loop, all of the fragments should been move and merged to tail of all data cells just like the last row in the digram. Then we can remove the last fragment and move the append header to the start position of the last fragment.

My implementation is about 50 lines of Clojure code. I also made tests and they passed for every circumstances I can imagine. You can check my code out from here.

Building a Graph Engine -- Key-Value Store -- Off-heap Trunks

I have already finish the memory storage part of the k-v store. Early version was based on java byte array. Then I started concerned about it's constancy between threads in heavy load. I put this question on the Stackoverflow, one employee from OpenHFT confirmed my concern and gave me a solution to resolve this issue.

Off-heap features from sun.misc.Unsafe is something every java developer should avoid because it will blow up applications if it was not been used correctly. It could bypass java memory management including garbage collection to allocate up to TBs of spaces in memory directly like what we usually did in C programming language. It can also avoid thread cache and memory fence, which could accelerate the program and also produce the thread  inconstancy.

I once considered use off-heap before I tried the byte array, but I was worried about that I might not be able to handle those low level operations, and there is no official documents for unsafe because it is an internal API. I realised that the inconstancy is a huge impact on performance because I have to use read write lock or make every write operations in on thread. The byte array also have problem on it's maximum size of about 2G bytes, I have to make multiply trunks to contain more data than that.

Taking over full control of memory management seemed crucial to this project. I was never bothered to try new things to make my works more powerful. After further research on off-heap and helps from Peter Lawrey, I spent about one hour to transform my original approach to off-heap. Most of the primary data types in unsafe have native implementations, which made my work much easier. Tests indicated the performance impact are almost identical compare to the byte array approach. I also meet some JVM crashes during debugging defragmentation because I forgot to add the base address to the offset when performing copy memory, except that, everything is holding tight.

Although it would be fine if the unsafe was used properly, but it will need additional safeguard like preventing write out of the bound to avoid crashing the whole JVM when the pure java approach may just throw an exception instead.

Building a Graph Engine -- Key-Value Store

Graph access pattern are mostly random. To achieve best performance that was possible, it is wise to put everything in memory rather than disk-oriented store. Because DRAMs is much more faster and expensive than hard drives and SSDs, storage efficiency is the essence of the matter. One of my colleges from my previous employer used to developed his own graph database as well. He told me that his design hits memory limitation quickly which has low memory efficiency. His system was written in C++, the relation between nodes was represented in pointer and each node is an object. I cannot get more details on his implementation, but represent node and relations in native form like objects in object-oriented programming languages and maps is bad use case for efficiency, although it is easy for development. Especially for java, each allocated object took 12 bytes for header and additional bytes for alignment[1]. If the object contains other objects, the memory overhead consumption stacks by the number of total objects allocated, which is not acceptable for in-memory applications.

Backend store for Morpheus is a in-memory distributed key-value store. It took some ideas from RAMCloud. Each instance contains numerous of trunks, each trunk contains one constant size of byte array, all of the data record need to be serialize to write and deserialize to read on the byte array for all the time. There is also a hash map as index for record id to the location of each record in the byte array in each trunk. Because the location of the record may change due to defragmentation, value in the index hash map may also change correspondly.

Trinity and most of the relational database require to define schema before insert new data record. It may not flexible compare to the modern database systems like MongoDB and CouchBase, but it is possible to deliver much more compact storage without wasting precious spaces on describing data structure for each records. Morpheus took the same idea for it's backend store, user need to define schema with name and data type for each fields. Fields need to in a sequence for each schema, the system will write fields data according to the schema leaves no gaps and any overhead on data types or names to the fields. The system can also reconstruct the record from bytes according to schemas.

Users may delete or update records in the store. There are certain circumstances that may leaves gaps or we called fragment in the bytes array:

  • Delete: It is obvious that deleting records will erase the sections that the record took in the byte array. That will leave a gap start with the location of the record and it's size is the same as the record took. In this case, the system will mark the whole space as fragment.
  • Update: There are three possible circumstances of updating a record in the trunk. The most easy one is the replacement has the same size of the original one, the system only need to replace the bytes with the new serialized record. Another is the replacement needs less space than the original. Then the system can still write the data to the original place but the rest spaces need to be record in the trunk as fragment. When the new record took more space than the original one, the system will write the new record at the append head and update the index like inserting other new records and mark the whole section of the original one as fragment just like delete the original record.

The backend store took care of both of the circumstances above which user can use it as a fully functional key-value store.

When the system was running for a long time, frequent delete and update operation may leave lots of fragments in the trunk, which leaves less spaces for new records to store. A defragmentation mechanism is required to resolve this issue by moving record sections to fill in the gaps and keep data integrity. After on turn of defragmentation, the append head should move backward and more records can fit in. Implementing non-pause defragmentation is difficult, I will address this issue later in another article.

Each record has a integer id, the backend store distributes the record to servers and trunks in servers according to the id and fetch the record by the id. I can still use distributed hashing table for this case, which is easy to accomplish shard-nothing architecture. I have already done this part in another of my project.

 

Building a Graph Engine -- Introduction

Recently, I am trying to build my own graph database and graph compute engine system to support my research. The purpose of the system is to provide a distributed, low-latency store with computational engine based on RDDs. I named it "Morpheus", saluting another graph engine from Microsoft named "Trinity", which I took it's design concepts from this article and implemented in my system.

The system has three parts: a distributed in-memory key-value store, graph represent layer providing graph w/r operation functionality based on the k-v store and distributed data processing framework.

Right now, this project is under heavy development and I think it will took pretty much time until all of the parts are ready for missions. You can track the progress by visiting my GitHub profile. I will release some articles to reveal the details on the implementations. I have so much ideas in this moment and I need to test to make the right decision but I will provide reports in these articles and explain why I adapt those solutions.