Other Periodic disk activity and system 'hangs' while waiting for IO to complete

tOsYZYny · Saturday at 2:48 PM

I still use physical spinning disks and can hear when they're busy. I had an issue awhile ago where a periodic job was running and writing to disk every few seconds, but AFAIK, I am no longer running that job yet there is something periodically writing to disk every few seconds.

Whenever it does, the system appears to hang, the keyboard and mouse are non-responsive and sometimes, the keyboard input is not even captured during that time. So, if I were typing while the system was out to lunch, they keystrokes were silently dropped.

If I open a new terminal window, it takes about a second for it to load (I have a bunch of things it does). After a few minutes when everything is cached, opening a new tab is instantaneous.

I am running iotop with a refresh every second, but I don't see anything standing out. I recently switched to a different hard drive as I rebuilt my system - I should say that all my drives are in some state of failure and I think this one might be worse off. My workstation and router are running on the same physical box in 2 separate jails, I have recently setup rctl to limit resources the jails can use.

My router uses fairly minimal resources, I have it set to 50% CPU and 2G of ram, listing the resources it consumes, it is well below those limits. For the workstation, I have it set to 300% CPU and 16G of ram. I only approach 16G of ram when using go fix on some larger projects or when digikam is running.

I have no other resource limits set.

htop shows a fairly minimal load on the system both in terms of CPU and memory and perhaps I don't know how ot read iotop, but nothing stood out there either.

What other tool(s) shall I use to investigate this? My other system for reference did not have rctl setup, but I noticed the pausing even before using rctl, so I don't believe that is the culprit or factor. Perhaps it is indeed the drive, I can always swap over to that for comparison. If the drive were going, would dmesg show that or perhaps SMART tools?

EDIT #1:
drive A:

raw read error rate: 51334312
seek error rate: 444635657

drive B:

raw read error rate: 4270608
seek error rate: 498599102

Drive B has a lower raw read error rate, but higher seek error rate. It was powered on for about 2000 more hours.

EDIT #2:
If I look at iostat -w1x, I periodically see the tout, KB/t, and tps numbers increase every 5 seconds which seems to correspond to the hard drive noise I can hear. My system CPU is an i5-3470 and I'm using the onboard GPU, not an external unit. I know the onboard GPU's performance isn't great, but I can generally watch full HD videos without the system pausing. Beyond that, I notice pauses and the frames dropped increases.

iostat -wx1

2 237 16.4 74 1.19 0.0 0 0.00 0.0 0 0.00 2 0 0 0 98

I'm mainly wondering if there is a way to improve this pausing that seemed to crop up recently. Perhaps I try disabling resource limits to see if that has an effect.

I disabled rctl and reenabled it, and that is where I can see the difference. For example, with rctl enabled, whenever I open a new terminal, the terminal sets up an ssh-agent if it needs to. With rctl enabled, it waits for a lock, with it disabled, I can open many tabs concurrently, and they all complete quickly. It seems rctl is affecting lock files? I need to investigate more.

My idea for using rctl was to prevent a jail from bringing down the host by consuming too much resources. However, the only settings I'm touching are CPU and memory.

EDIT #3:
I am not certain the perceived hangup has anything to do with the disk. It doesn't seem like there is that much activity presently. I came across another post that suggests that the system hanging could actually be the monitor:

Thread 'Debugging system hang-ups'

May 29, 2017

I'm running into a problem where my system (running 11.0-RELEASE-p9) becomes unresponsive and I'm looking for some advice on how to debug it.

When it happens, the system does not respond to any inputs. I am always in X when it happens, but I can't say whether X is actually the problem. No keyboard or mouse input seems to be recognized, and I can't ssh into the system (it always times out).

All that I can tell is that the fans on my system kick in, which suggests that CPU usage is very high. Again, I have no idea what is actually using it.

I do not know how to trigger the problem. It...

In my case, I notice it now with watching videos. A video with a bandwidth of 1958 kb/s plays fine, but 4416 kb/s is choppy. Both are the same framerate, codec, and resolution.

tOsYZYny · Monday at 10:54 AM

I think the hangups are due to the device starting to fail. This is just speculation, but I am monitoring the raw error rate and I can see it steadily increasing:

while [ 1 ]; do smartctl -a /dev/ada0 | grep Raw_Read_Error_Rate; sleep 15;done

I'm getting 3000 errors every 15 seconds roughly, that seems a bit high to me. When I started monitoring this:

81797472

and now:

86395168

It seems the drive may die today or at least become unusably slow.

SirDice · Monday at 11:14 AM

tOsYZYny said:
It was powered on for about 2000 more hours.

That's not really old.

Here's one of mine:

Code:

  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1353733405
  9 Power_On_Hours          0x0032   022   022   000    Old_age   Always       -       68570

tOsYZYny said:
which seems to correspond to the hard drive noise I can hear.

Weird noises? Could it be the drive is powering up and you're hearing the initialization "rattle" of the heads? Some power-saving option that turns the drive off? Then gets woken up when accessed? That could cause a slight delay too.

tOsYZYny · Monday at 11:48 AM

I think it is disk access noise, it isn't terribly loud and is roughly about 250 ms in duration. But for instance, when I run digikam to scan my media collection, that makes a ton of that noise for as long as the scan takes place. I believe that is the head moving back and forth. My media collection is on a different disk entirely.

I don't believe it is powered down, I would hear it spin up first, the drives aren't idle long enough for that to happen as it is every 5s, I hear about 250ms worth of disk access. it is more pronounced when the system first boots as the cache is empty.

The raw error rate is now:
96648400

I was just thinking we're comparing whose flesh wound is more serious. There should be a parody of the black knight in Monty Python's, The Holy Grail, but for computers.

I think I made it spike a bunch as I decided to rebuild my system just in case this drive kicks the bucket. That process pulls the old system for packages, it uses it as a package cache, git projects, and ZFS volumes to restore so it puts a bit of strain on the disk. My current cold disk image is about 1 week old.

Emrion · Monday at 12:11 PM

tOsYZYny said:
I think the hangups are due to the device starting to fail.

The symptoms you describe are typically those of a hard disk that will soon die.
Don't wait the disaster, save the valuable data right away.

tOsYZYny · Monday at 11:35 PM

I should say, those error counts are increasing drastically because I am making another system from the current system. It is pulling git and ZFS from the current system. But yes, I backup nightly, it is just the restoration part that might be tricky without a live copy.

tOsYZYny · 2024-12-04T02:13:28+0000

I swapped out the drive for a slightly 'better' one that has a lower error count and I don't hear the disk activity that I heard with the other one (original drive started at about 8M errors and last had 14M when powered down, this is about 2.2M but similar age). I still think something is going, but I'm not sure what. The system has been up for a little bit of time, so things should be cached. There is still considerable latency when opening a new terminal window. While I have a bunch of scripts that get loaded, I don't believe that is it and it used to be quick with no changes recently. When that new tab opens, the keyboard and mouse don't respond.

Perhaps I need to do a memory test?

tOsYZYny · 2024-12-04T11:46:37+0000

I moved the drive back over to my backup system and am experiencing the same pausing there after I put it in the main machine. So, perhaps it isn't a hardware issue, or if it is, it is identical, which I find extremely unlikely. The only change I made recently (in the past week) was adding rctl for the jails to limit the amount of resources a jail can use. I will disable that and see if that is the culprit. The original system had periodic disk access every 5s that wasn't a major event, but enough for me to hear. Normally, I don't hear much disk activity unless I'm doing a ton of IO.

Perhaps my configuration for rctl is too low for my workstation jail:
pcpu: 300
memory: 16G

I think I sorted out the pausing on the new system, limiting my workstation to 3 CPUs is a bit too much, it needs more for smooth operation even if the system isn't completely pinned.

Other Periodic disk activity and system 'hangs' while waiting for IO to complete

Administrator