I just paused on coding to read some paper   , finding the optimal way to solve cell addressing problem for Nebuchadnezzar. I meet some guys from PingCAP before the spring festival, they asked me why I choose consistent hashing, a P2P way to addressing data in Nebuchadnezzar, over using some addressing service that clients can lookup what is the server for the data and then go to the server to do the jobs (they call it placement driver). PingCAP is building a relational database named TiDB with underlying key-value store TiKV based on theory from Google F1 and Spanner, that meeting was a interview for a remote engineer position, held in a restaurant.
Honestly I can't answer the question at that time. Because I don't know how their placement driver works. Google chose this architecture for almost all of their infrastructure software. Seems like the placement driver won't becomes bottle neck to the whole system. PingCAP claimed their PD can take more than a million per second which is enough for almost any use case. In the other hand, Cassandra (it only use consensus for transactions), Dynamo from Amazon, Trinity from Microsoft and RAMCloud from Stanford all use or propose P2P architecture on top of consistent hashing.
The benefits from PD and range sharding is data can move around in the cluster at will. That means data can store in order by it's fields and fetch them in sequence easily. If one server cannot hold the sequence, PD can dispatch data to other servers. To locate the data, clients have to query the location in PD. In PingCAP PD, it takes at least 1 (when the location for data entry was cached) to 3 (if cached location is stale) network round trip to fetch the data from the actual server.
Consistent hashing don't need a central service to hold the location to the data. The only information it required is members and their weights to generate the lookup table by hashing, sort, and generate range slots. To lookup in the slots, clients need to hash the key and use binary search to find the range slot the key belongs, each slot represents a server address. Servers will maintain it's own hash table to the actual data. There will be only one network round trip and the addressing can be finished in clients. Range slots won't expand with amount of entry the system holds, the amount of range slots is also configurable. The downside to this approach is that range query is somehow difficult to implement and also inefficient compared to PD approach.
It seems like PingCAP considered P2P way but rejected due to it's complexity and the extra network round trips from PD is insignificant to the performance for their product. It is understandable because TiKV is a key-value store backed by RocksDB which it's data persists to stable hard drives. Each operation on TiKV may also trigger a disk operation which is expected. Disk operations may not faster than network (disk latency is about 1 to 15 milliseconds depends on the actual device, network latency should be about 1 milliseconds or less), increased network round trips will not make much impact on such system. In TiKV, due to it's replication, data also need to replicate to at lease one follower, that will also take at least one more network round trip. It is designed to ensure data safety.
Nebuchadnezzar is a key-value store to power my graph engine Morpheus. Morpheus enable users to do basic CRUD and graph traversal by performing parallel breath-first-search. Each vertices or edges are stored in form of cells in the key-value store. When performing graph traversal operations, Morpheus will do
get operations for every cells for vertices, edges and meta data it required. Because graph data are all about relations and their access pattern are unpredictable, we can consider it's pattern is mostly random and seldom sequential access. Another factor to be considered is latency. Because every traversal operations are serials of random access to the key-value store, it is essential to keep every operations to be finished as soon as possible. Keep those ideas in mind, Nebuchadnezzar is designated to be a in-memory key-value store without instant writing data to disks and no internal replication supported. It is similar to Redis and memcached, with more features like sharding and schema support for memory efficiency.
Low latency distributed storage system makes addressing problem more sensitive than traditional data store backed by hard drive disks. Because we have made so much efforts on decreasing latency to less than 1 ms in a single server, redundant network round trips will be unacceptable. It should also considered that because there will be huge amount of random access, cache mechanisms mentioned in PD client will fail due to frequent cache miss. It can be foreseen that there will always be at least 2 network round trips (even with batch) for PD approach.
This makes consistent hashing more preferable for Nebuchadnezzar. Addressing run-time takes no more than
n is the size of range slot size, which would be a constant. There will also be only one and at most one network round trip, overall latency will be more controllable.
Range query problem over consistent hash table still remains. Although Nebuchadnezzar is built for high-performance random access, range query allow us to implement index for client range query which is essential to some use cases. It is not yet solved in the first version and was considered almost impossible. This is because after hashing, data keys lost it's underlying value and the data dispatched to range slots in uniform distribution. There is no way to determine data location from the range slots.
There is some workaround for range query problem in consistent hash table actually.  proposed to use prefix tree (PHT) and  proposed to use segment tree (DST), for strings and numbers. Both of the workarounds required to use another data structure to keep track of data keys, range query in the data structure and use consistent hashing to fetch the data.
The only problem left now is engineering. We have to choose how to distribute the data structure to cluster servers in large scale. Fourtantly, both PHT and DST are tree data structure, means they all can be split into fine grained sub-trees on different servers and execute commands in parallel if the tree is too large to fit into one server.
But this also leads to new problems, in worst cases we have to reach multiply servers to create and delete index for keys. It will increase latency for every commands related to the index. Although we can put index processes into coroutines that will not affect user thread, but user may also suffer inconsistent from index. In the actual implementation, I will let user to do the trade-off.
I will keep using the original architecture in my first version for addressing problem in next generation of Nebuchadnezzar. It is the result of compromise for performance. Next if I have some time, I may built another system that does not have such low-latency requirement, on PD approach.