Solved System hang/freezing boot/LSI 3108 and Seagate Exos 7E8

mikkol · Jul 28, 2020

I was happily converting bash scripts to csh, when all of a sudden the server froze. Prior to freezing, it was working well for several days.

After rebooting, the server would hang at Consoles: EFI Consoles. I suspected that a big spider had done damage in the server, but no bugs were to be found in it. However, I noticed that removing one of five Seagate Exos 7E8 drives would render the server bootable again. ~~I have not yet tried to see if it is that particular drive or if it was enough to remove any one of those five.~~ I tested with different configurations of four Exoses in at a time, and it did not matter which one of them was out. Ultimately, the whole thing started booting again with all five Exoses in. Leaving two Kingston KC600s in the machine did not hinder the boot.

All of the disks, the Seagates and the Kingstons alike, are connected to a Supermicro LSI 3108 controller running in JBOD mode with firmware 6.36. The Kingstons are SATA, the Seagates are SAS. The driver used for the controller is mrsas(4) because mfi(4) provided untolerably slow data transfers.

What tools exist for me to see what might have caused the freezing and what would be blocking the boot? /var/log/messages does not contain anything about the crash, dmesg provides a perfectly normal boot log.

Edit 2: on a possibly related note, dmesg will report the transfer speeds to be 150.000MB/s, which is not correct. A dd or cat measurement will yield much higher rates.

SirDice · Jul 28, 2020

mikkol said:
I was happily converting bash scripts to csh

Don't use csh for scripting. If you want to convert those scripts convert them to sh(1). The csh(1) is great for interactive use but an absolute horror show for scripting. None of FreeBSD's own scripts are written for csh(1), they all use sh(1).

Csh Programming Considered Harmful

mikkol said:
However, I noticed that removing one of five Seagate Exos 7E8 drives would render the server bootable again. I have not yet tried to see if it is that particular drive or if it was enough to remove any one of those five.

Is that drive part of a zpool? Because it happens quite often that a broken disk would just hang up the entire pool. One of my pools had this happen on several occasions and every time this happened one of the disks was bad.

mikkol · Jul 28, 2020

SirDice Thank you for the response. I actually don't specifically want to convert them to csh. I was perfectly happy with the bash scripts but was under the impression that any cronned job that would run as root better be csh and not bash. Come to think of it again, I will probably cron with a specific reference to bash and keep my full working bash scripts.

Yes, all five drives were part of a RAIDZ1 zpool, inside of which I had also created a 1TB zvol, which was geli(8)ed, as I understood this to be the approach for encrypting just part of a zpool. What I have done now is destroy the zvol and the zpool and am considering setting up just two mirrors and putting one drive on the shelf as a cold spare.

SirDice · Jul 28, 2020

mikkol said:
I was perfectly happy with the bash scripts but was under the impression that any cronned job that would run as root better be csh and not bash.

sh(1), not csh(1).

mikkol said:
Yes, all five drives were part of a RAIDZ1 zpool, inside of which I had also created a 1TB zvol, which was geli(8)ed, as I understood this to be the approach for encrypting just part of a zpool. What I have done now is destroy the zvol and the zpool and am considering setting up just two mirrors and putting one drive on the shelf as a cold spare.

Why don't you simply replace the broken drive? Disks break, that's inevitable. The only question is when.

mikkol · Jul 28, 2020

SirDice said:
sh(1), not csh(1).

Why don't you simply replace the broken drive? Disks break, that's inevitable. The only question is when.

I would replace the the broken disk if I could identify which one, if any, of them is broken. smartctl indicates all of them are OK. I was thinking whether I must download SeaTools and test each drive individually or whether there would be other tools available that I could run in FreeBSD.

mikkol · Jul 28, 2020

In the mean time, the Kingston drives claimed to have a hiccup. To eliminate a faulty controller, I pulled out the 3108 controller from the server and switched to the onboard 3008 (in IT mode). To be seen what happens now.

mikkol · Jul 28, 2020

After replacing the 3108 with the 3008 and replacing mrsas(4) with mpr(4), the following things have happened so far:

The idle CPU power consumption dropped from ~17W to ~11W.
The server's power consumption dropped by 25W.
The transfer rates reported by dmesg rose from 150MB/s to 600MB/s and 1,200MB/s.

Keeping at least my fingers crossed and will monitor. If no problems arise, I will mark this as solved by faulty controller.

mikkol · Aug 4, 2020

I must assume that the AVAGO LSI 3108 controller either got faulty or does not play well with the Kingston newcomers (KC600). I solved this by moving all drives to AVAGO LSI 3008 and removing the 3108 entirely from the system.

Solved System hang/freezing boot/LSI 3108 and Seagate Exos 7E8

mikkol

SirDice

Administrator

mikkol

SirDice

Administrator

mikkol

mikkol

mikkol

mikkol