After one month of desperate tests and adjustments (I nearly crashed one of the hard drive in my proving ground machine), I think I have found a reasonable design for the key value store.
The hard part of this project is memory management. My previous design is to allocate a large amount of memory for each trunk and append data cell to one append header. When a cell was deleted or obsoleted, there will be one defragmentation thread, moving all living cell to fill the fragments. That means if there is a fragment at the very beginning of the trunk, most of the cells in the trunk will be moved in one cycle and the fragment spaces only reclaims after one round of the defragmentation. Usually the fragment spaces reclaims so slow that the whole trunk cannot take updates operations too frequently, I used to apply slow mode in the system, but finally find it was a stupid idea.
I also tried to fill the new cells into fragments. But it causes performance issues. Defragmentation process have to compete with other operation threads for fragments, there must be locks to keep consistent, that can heavily slow down the whole system.
After I wrote my previous blog in Celebrate Ubuntu Shanghai Hackathon, I nearly ditched the design I mentioned before and trying to implement partly log-structured design that was seen from RAMCloud. It took me only one night to do rewirte that part. I does not fully followed that design in this paper. Especially I does not append tombstones into the log. Instead, I write tombstones into the places that belongs the deleted cell, because the paper spent large amount of paragraph on about how to track tombstones and reclaim their spaces. And I found it has too much overhead. The major improvements compare to the original design is dividing one trunk into 8M segments. Each segment tracks their own dead and alive objects, and the maximum object size reduced from 2GB to 1MB. This design was proved to be efficient and manageable. Segmentation makes it possible to start multiply threads and defragment segments in one trunk in parallel. It significantly improved the performance on defragmentation. When a writer trying to acquire new space from trunks, it only need to search for one segment that has sufficient space. In most common situations, defragmentation process can always been finished on time.
I also redesigned the backup process. When trunks are divided into segments, synchronizing memory data into disks are much more straight forward. There is one thread for each trunk keep detecting dirty segments and send them to remote servers for backup once at a time. Dirty segment means the segments contains new cells. Segments with only tombstones will not considered as dirty and need to backup because the system tracks versions for each cell. In recover process, cells with lower version will be replaced by higher version, but higher one will not been replaced by lower one.
As in the application level, 1MB object size produce limitation. In project Morpheus, it is highly possible for one cell to exceed this limitation. Especially in edge list, one vertex may have large amount of edges in or out. But this limitation can be overcomed by implement linked list. Each edge list contains an extra field indicates next list cell id. When the list require to be iterated, we can follow the links to retrieve all of the edges.
Although, there is still some bugs in Nebuchadnezzar and I don't meant to recommend it to be used in production, but I glad to find that this project made some progress, instead of become another of my short term fever project.