Building a Graph Engine -- Computation Engines

Morpheus is a distributed graph engine. Once I resolved it's  storage problem, I start to build it's computation engines. If I just leave this project in current stage as a pure high-performance graph storage, user has to write their own algorithms. Most importantly, client side traversal may cause performance issues on communications.

Currently Morpheus use message passing to establish BSP and any other traversal based computation models. To reduce the size of each message, Morpheus can distribute task parameters to all of the machines in the cluster. The current commit have already contains distributed depth first search for path search, construct sub-graph for vertex or bulk update. but I have't finished all of them yet because the condition to stop the search or update actions require the existence of query language in the system like Cypher in Neo4j, which I am working on to provide the flexible functionality in search and updates.

I haven't decided the format of the query language. Lisp-like expression (S-expression or Polish notation) is preferable because of it's easy implementation in parser. I will not simply eval the code due to security concern and performance (eval in Clojure will generate new class, it's life-cycle is not manageable and possibly leads to memory leak). The problem of S-expression is that there a few people get used to such expressions. People usually read 1 + 1 = 2, but an expression like (= (+ 1 1) 2) looks strange to them. Another advantage of s-expression is it is more succinct when 1 + 2 + 3 + 4 can be represented as (+ 1 2 3 4).

Morpheus also have a code distribute feature. Users can write their plain code, even pack as Leiningen project, send them to the cluster and run remotely without compile anything. It is useful if users want to run their own algorithms in the cluster. There should be a sort of API or framework, and I have done it yet.

Future plan includes a RDD based computation module for bulk data processing and map-reduce computation models, like GraphX in Apache Spark. It may consider vertices and edges as documents, users can join the documents with other sources. The computation model may not comply graph traversal models, but it is useful When relations are not the first class problems in the cases. The RDD module may have a dedicated memory management policy because the memory spaces required for computation are unpredictable. RDDs may require to temperately flush into disk instead of persist in the memory. For example, in LRU manner.

Computation models seems a much more challenging topic in this project. And I have so few time to finish the job.That is the reason I considered the Morpheus project as my second personal long-term project (the first the the WebFusion project, it was paused, will continue after the Morpheus project).

 

Leave a Reply

Your email address will not be published. Required fields are marked *