Servers are crashing, bad memory modules or hot weather?

I cannot figure out the reason, but recently, 2 of my servers both with 8 x 16GB of Kingstone DDR3 1600 MHz ECC REG memory give me the correctable ECC memory error messages. One of the server can recover by itself, another is totally dead with stucking at following screen on startup.

url_redirect

I used to resolve such problems by find and pop out the problematic modules. The server that cannot recover by itself is the server that I put it to the data center as a long term running server. I suffered 6 times of memory failure from different modules that was plugged into the same slot. Last year, I start to suspect there should be cooling problem, while there may have no air flow through the module and it gets overheat. The chassis for the server is a 1U with 4 hard drive. I invested a lot of resource on this machine, makes the small space crowded with 2 more SSDs for caching and 1 more shell less router for IPMI. Once I unplugged the memory module that cannot get any air flow, it was stabilized for a while (nearly 1 year), and got crashed today.

Another server as proving ground, computation server and virtual machine server in my home was observed that it has familiar problem when I testing Morpheus with wikidata graph on it. Instead of crashing, it can resolve the problem by itself but leave following messages in my server log

Apr 30 18:46:00 shisoft-vm kernel: [90407.431330] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Apr 30 18:46:00 shisoft-vm kernel: [90407.431337] EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: 8c00004000010090
Apr 30 18:46:00 shisoft-vm kernel: [90407.431338] EDAC sbridge MC1: TSC 0
Apr 30 18:46:00 shisoft-vm kernel: [90407.431340] EDAC sbridge MC1: ADDR 1d7a258b00 EDAC sbridge MC1: MISC 204a167686
Apr 30 18:46:00 shisoft-vm kernel: [90407.431342] EDAC sbridge MC1: PROCESSOR 0:206d5 TIME 1462013160 SOCKET 1 APIC 20
Apr 30 18:46:00 shisoft-vm kernel: [90407.659365] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x1d7a258 offset:0xb00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:4 rank:1)

If I unplugged the modules that was indicated as problematic, other modules fails in the next round of tests. That leaves me no other options but ignore it.

I have another machine with 2 x 16GB of Samsung 2133 MHz DDR4 REG ECC memory. Which was assembled in the beginning of last year does not have such problems even it's memory was exhausted and start to taking swap. I highly suspect the failures may been caused by the heat or maybe my hardware provider did not give me the qualified parts (motherboard may also cause such problems).

Right now, I decided to upgrade the machine with Samsung memory to 96GB, and one piece of Intel 750 400GB SSD as secondary storage for project Morpheus. I also planned to replace the machine in the data center with new one. My new server will take more care of head sink problems, hope it won't be so annoying in the future.

I don't suggest purchase hardware and place in the data center when cloud platforms (for example DigitalOcean and Amazon EC2) are affordable for their applications. My use cases are harsh, I have to customize my servers to balance performance and prices, and also have to manager server hardware problems by myself.

Nebuchadnezzar, finally evolved

After one month of desperate tests and adjustments (I nearly crashed one of the hard drive in my proving ground machine), I think I have found a reasonable design for the key value store.

The hard part of this project is memory management. My previous design is to allocate a large amount of memory for each trunk and append data cell to one append header. When a cell was deleted or obsoleted, there will be one defragmentation thread, moving all living cell to fill the fragments. That means if there is a fragment at the very beginning of the trunk, most of the cells in the trunk will be moved in one cycle and the fragment spaces only reclaims after one round of the defragmentation. Usually the fragment spaces reclaims so slow that the whole trunk cannot take updates operations too frequently, I used to apply slow mode in the system, but finally find it was a stupid idea.

I also tried  to fill the new cells into fragments. But it causes performance issues. Defragmentation process have to compete with other operation threads for fragments, there must be locks to keep consistent, that can heavily slow down the whole system.

After I wrote my previous blog in Celebrate Ubuntu Shanghai Hackathon, I nearly ditched the design I mentioned before and trying to implement partly log-structured design that was seen from RAMCloud. It took me only one night to do rewirte that part. I does not fully followed that design in this paper. Especially I does not append tombstones into the log. Instead, I write tombstones into the places that belongs the deleted cell, because the paper spent large amount of paragraph on about how to track tombstones and reclaim their spaces. And I found it has too much overhead. The major improvements compare to the original design is dividing one trunk into 8M segments. Each segment tracks their own dead and alive objects, and the maximum object size reduced from 2GB to 1MB. This design was proved to be efficient and manageable. Segmentation makes it possible to start multiply threads and defragment segments in one trunk in parallel. It significantly improved the performance on defragmentation. When a writer trying to acquire new space from trunks, it only need to search for one segment that has sufficient space. In most common situations, defragmentation process can always been finished on time.

I also redesigned the backup process. When trunks are divided into segments, synchronizing memory data into disks are much more straight forward. There is one thread for each trunk keep detecting dirty segments and send them to remote servers for backup once at a time. Dirty segment means the segments contains new cells. Segments with only tombstones will not considered as dirty and need to backup because the system tracks versions for each cell. In recover process, cells with lower version will be replaced by higher version, but higher one will not been replaced by lower one.

As in the application level, 1MB object size produce limitation. In project Morpheus, it is highly possible for one cell to exceed this limitation. Especially in edge list, one vertex may have large amount of edges in or out. But this limitation can be overcomed by implement linked list. Each edge list contains an extra field indicates next list cell id. When the list require to be iterated, we can follow the links to retrieve all of the edges.

Although, there is still some bugs in Nebuchadnezzar and I don't meant to recommend it to be used in production, but I glad to find that this project made some progress, instead of become another of my short term fever project.

Some thought about storage system for the in-memory key-value store

I was struggling for how to implement the durability feature for the key-value store. I have finished my first attempt of write memory data into stable storage. The paper from Microsoft did not give me much clue because it just mentioned that they designed a file system called TFS, which is similar with HDFS from Apache Hadoop. It did not give any more information about how and when to write memory data into the file system. My first and current design for memory management followed their implementation. After some tests by importing large amount of data into the store and then frequently update the inserted cells, I realised the original design from Trinity is problematic.

Memory management, also named defragmentation is used to reclaim spaces from deleted cells. Especially in update operations, the store will mark the original spaces as fragment and append new cell to the append header. If the spaces did not been reclaimed in time, the container will been filled and no more data can been written into. The approach in my previous presented article have major flaws in copying all of cells that follows the position of the fragment. That means if we mark the first cell in the trunk as fragment, in the defragmentation process, all the rest of the cells in the trunk will been copied. In some harsh environment,  the defragmentation process will never catch up.

The key-value store also supports durability. Each trunk will have a mirrored backup file in multiply remote servers to ensure high availability. Each operation on the trunk memory space, including defragmentation will mark the modified region of memory space in a skiped list buffer as dirty, order by the memory address offset to the operation. After certain interval of time, a daemon will flush all of the dirty data and their address to remote backup servers. Backup servers will take those data in a buffer and a daemon will take data from buffer and write then to disk asynchronously. With such defragmentation approach, each defragment operation will also leads to updating all of the copied cells. That is a huge io performance issue.

Then I read the paper from Stanford to find more feasible solution. The author provided full details on log-structured memory management. The main feature of this approach is to make memory space into append only segments. In cleaning process, the cleaner will make full empty segments for reuse. Each operation, including add, delete and updates are all appended logs. The system maintains a hash table to locate cells in segments. The cleaner is much complex than defragmentation in my key-value store, but it provides possibility to minimize  memory copy and parallel cleanning. The downside is the maximum size of each cell is limited by segment size by 1MB when my current approach can provide maximum 2GB for each cell (even it is not possible to have such big cells) and maintaining segments costs memory, especially when the memory space is large, the number of segments increases accordingly. I prefer to give the log-structured approach a try because defragmentation will become major bottleneck in current design, and large object is not essential in current use cases.

Manage memory is far more complex that I was thought in the first time. Other system like JVM and other programming languages took years on improving their GC, each improvements may reduce pause by 1 ~ 2x. I cannot tell which is the best solution without proper tests because each solution seems specified for certain use cases. I think I can do more research on this topic in the future.