I was struggling for how to implement the durability feature for the key-value store. I have finished my first attempt of write memory data into stable storage. The paper from Microsoft did not give me much clue because it just mentioned that they designed a file system called TFS, which is similar with HDFS from Apache Hadoop. It did not give any more information about how and when to write memory data into the file system. My first and current design for memory management followed their implementation. After some tests by importing large amount of data into the store and then frequently update the inserted cells, I realised the original design from Trinity is problematic.
Memory management, also named defragmentation is used to reclaim spaces from deleted cells. Especially in update operations, the store will mark the original spaces as fragment and append new cell to the append header. If the spaces did not been reclaimed in time, the container will been filled and no more data can been written into. The approach in my previous presented article have major flaws in copying all of cells that follows the position of the fragment. That means if we mark the first cell in the trunk as fragment, in the defragmentation process, all the rest of the cells in the trunk will been copied. In some harsh environment, the defragmentation process will never catch up.
The key-value store also supports durability. Each trunk will have a mirrored backup file in multiply remote servers to ensure high availability. Each operation on the trunk memory space, including defragmentation will mark the modified region of memory space in a skiped list buffer as dirty, order by the memory address offset to the operation. After certain interval of time, a daemon will flush all of the dirty data and their address to remote backup servers. Backup servers will take those data in a buffer and a daemon will take data from buffer and write then to disk asynchronously. With such defragmentation approach, each defragment operation will also leads to updating all of the copied cells. That is a huge io performance issue.
Then I read the paper from Stanford to find more feasible solution. The author provided full details on log-structured memory management. The main feature of this approach is to make memory space into append only segments. In cleaning process, the cleaner will make full empty segments for reuse. Each operation, including add, delete and updates are all appended logs. The system maintains a hash table to locate cells in segments. The cleaner is much complex than defragmentation in my key-value store, but it provides possibility to minimize memory copy and parallel cleanning. The downside is the maximum size of each cell is limited by segment size by 1MB when my current approach can provide maximum 2GB for each cell (even it is not possible to have such big cells) and maintaining segments costs memory, especially when the memory space is large, the number of segments increases accordingly. I prefer to give the log-structured approach a try because defragmentation will become major bottleneck in current design, and large object is not essential in current use cases.
Manage memory is far more complex that I was thought in the first time. Other system like JVM and other programming languages took years on improving their GC, each improvements may reduce pause by 1 ~ 2x. I cannot tell which is the best solution without proper tests because each solution seems specified for certain use cases. I think I can do more research on this topic in the future.