Servers are crashing, bad memory modules or hot weather?

I cannot figure out the reason, but recently, 2 of my servers both with 8 x 16GB of Kingstone DDR3 1600 MHz ECC REG memory give me the correctable ECC memory error messages. One of the server can recover by itself, another is totally dead with stucking at following screen on startup.

url_redirect

I used to resolve such problems by find and pop out the problematic modules. The server that cannot recover by itself is the server that I put it to the data center as a long term running server. I suffered 6 times of memory failure from different modules that was plugged into the same slot. Last year, I start to suspect there should be cooling problem, while there may have no air flow through the module and it gets overheat. The chassis for the server is a 1U with 4 hard drive. I invested a lot of resource on this machine, makes the small space crowded with 2 more SSDs for caching and 1 more shell less router for IPMI. Once I unplugged the memory module that cannot get any air flow, it was stabilized for a while (nearly 1 year), and got crashed today.

Another server as proving ground, computation server and virtual machine server in my home was observed that it has familiar problem when I testing Morpheus with wikidata graph on it. Instead of crashing, it can resolve the problem by itself but leave following messages in my server log

Apr 30 18:46:00 shisoft-vm kernel: [90407.431330] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Apr 30 18:46:00 shisoft-vm kernel: [90407.431337] EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: 8c00004000010090
Apr 30 18:46:00 shisoft-vm kernel: [90407.431338] EDAC sbridge MC1: TSC 0
Apr 30 18:46:00 shisoft-vm kernel: [90407.431340] EDAC sbridge MC1: ADDR 1d7a258b00 EDAC sbridge MC1: MISC 204a167686
Apr 30 18:46:00 shisoft-vm kernel: [90407.431342] EDAC sbridge MC1: PROCESSOR 0:206d5 TIME 1462013160 SOCKET 1 APIC 20
Apr 30 18:46:00 shisoft-vm kernel: [90407.659365] EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x1d7a258 offset:0xb00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:4 rank:1)

If I unplugged the modules that was indicated as problematic, other modules fails in the next round of tests. That leaves me no other options but ignore it.

I have another machine with 2 x 16GB of Samsung 2133 MHz DDR4 REG ECC memory. Which was assembled in the beginning of last year does not have such problems even it's memory was exhausted and start to taking swap. I highly suspect the failures may been caused by the heat or maybe my hardware provider did not give me the qualified parts (motherboard may also cause such problems).

Right now, I decided to upgrade the machine with Samsung memory to 96GB, and one piece of Intel 750 400GB SSD as secondary storage for project Morpheus. I also planned to replace the machine in the data center with new one. My new server will take more care of head sink problems, hope it won't be so annoying in the future.

I don't suggest purchase hardware and place in the data center when cloud platforms (for example DigitalOcean and Amazon EC2) are affordable for their applications. My use cases are harsh, I have to customize my servers to balance performance and prices, and also have to manager server hardware problems by myself.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.