Solved New hardware, ssh session is hanging

arader · Oct 27, 2015

Hi all,

I just put together what will be a new router for my house, and I'm seeing an issue with SSH hanging while I'm trying to configure it. What happens is the terminal will stop echoing my keystrokes for 10+ seconds at a time, then suddenly everything I've typed will show up. I've seen it hang long enough to cause my client to actually kill the connection with a 'Broken Pipe' error. When it does this I can't even reconnect, new connections just timeout. It really seems like sshd is hanging on something, but I'm not experienced in the art of dtrace to figure it out on my own.

I've confirmed that I can still interact with the machine using the console during such a hang, so it's either network stack related, or sshd related. This happens using wired and wireless clients running OSX, iOS, FreeBSD, and Windows, so I'm fairly certain I've localized it to the new machine.

I've tried the following:

Emptying my rc.conf of the bare necessities (ifconfig, sshd_enable)
Switched my connection to each of the 4 onboard Intel NICs
Switched Ethernet cables
Switched ports on the switch
Set sshd(8) to listen only on 1 IP

Machine details:

Supermicro A1SAi-2550F
16GB ECC RAM
2x 120GB SSDs in Mirrored root pool
GPT partitions
FreeBSD 10.2 Release #0

Has anyone experienced something like this before? At this point I'm looking for some next steps to dig deeper. Anyone have some dtrace guides for this sort of thing? It's feeling like the process is hanging on something, but right now I have no clue what.

thanks!

tingo · Oct 28, 2015

Nothing in /var/log/messages? In the old days this often was related to irq storms on various hardware the FreeBSD didn't like (or the other way around).

robroy · Oct 28, 2015

arader, my Supermicro X9SBAA-F has a symptom wherein network connections feel "choppy," or experience sporadic, brief hangs, unless I disable the NIC offload accelerations.

I suspect that my symptom may be related to IPFW and in-kernel NAT not playing nicely with these accelerations, so this may not apply to you.

Yet if you're interested in giving it a whirl, run something like this:

Code:

ifconfig igb0 -tso -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum
ifconfig igb1 -tso -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum
ifconfig igb2 -tso -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum
ifconfig igb3 -tso -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum

If this is going to help, you'll see the difference without having to reboot.

arader · Oct 28, 2015

tingo said:
Nothing in /var/log/messages? In the old days this often was related to irq storms on various hardware the FreeBSD didn't like (or the other way around).

Nothing out of the ordinary, but I'll double check when I'm home tonight

robroy said:
arader, my Supermicro X9SBAA-F has a symptom wherein network connections feel "choppy," or experience sporadic, brief hangs, unless I disable the NIC offload accelerations.

I suspect that my symptom may be related to IPFW and in-kernel NAT not playing nicely with these accelerations, so this may not apply to you.

Yet if you're interested in giving it a whirl, run something like this:

Code:

ifconfig igb0 -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum ifconfig igb1 -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum ifconfig igb2 -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum ifconfig igb3 -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum

If this is going to help, you'll see the difference without having to reboot.

I thought of trying this on my way in to work today, so glad I'm not way off base. I'll give this a shot.

Since I'm so early in the set up I might even try running -CURRENT and see if it still repros there, but I don't want to lose the chance to learn some new debugging skills quite yet.

robroy · Oct 28, 2015

Here's a little more information that might be of interest, in the meanwhile.

My symptomatic X9SBAA-F that's cured by disabling accelerations runs 10.1-RELEASE, and is at BIOS level 1.1 (it came with 1.0c).

I also have an A1SRM-2758F that's on 10.0-RELEASE, and I've never had any networking symptoms with it despite having left all of its default NIC accelerations enabled. Though, its IPFW and in-kernel NAT configuration's different from the X9SBAA-F's.

I noticed that the A1SAi-2550F's newest BIOS level's 1.1a. If you're not already on 1.1a, 'might be worthwhile to update it and try again (if nuking the accelerations doesn't help).

robroy · Oct 28, 2015

I just added -tso to the commands above; I left it out accidentally.

arader · Oct 29, 2015

Thanks for the pointers robroy, this morning before work I tried the following:

Disabled the various accelerations on all the interfaces, no effect on my SSH issues
I then checked my BIOS, and I was on 1.1 - I then flashed it to 1.1a but still no effect

I haven't yet tried disabling the accelerations on the new BIOS, but I'm not very hopeful it will change anything.

This morning I was lucky though that the hangs were pretty bad, to the point where I couldn't connect via SSH at all for a period of time. I logged in via the console and messed around some, and here are the notes:

The main sshd(8) process (the one that listens on port 22) was in the 'Is' or sometimes 'Ss' state, so nothing seems odd there
After being booted via the 'Broken Pipe' message, there were still two sshd(8) processes running, one as 'root', one under my user name. It might have been coincidence, but I couldn't reconnect until I killed these two processes. Even restarting sshd(8) ( /etc/rc.d/sshd restart) didn't get me back in, but I could reconnect after killing those two.
I ran dtruss -p <pid> for both the main sshd(8) process and the process running as my user. Some notes
- While everything was working as expected, the sshd(8) process running as my user would do the classic BSD sockets loop (block on getsockopt(2), call select(2) then read(2), then block again on getsockopt(2)
- Once a hang would occur, dtruss(1) didn't show any further syscalls being made, it would just continue to block on getsockopt(2). Perhaps this points to some issue in the networking stack or drivers? It really does seem like these processes aren't being woken up when new data arrives on their socket.
- The dtruss(1) instance running on the main sshd(8) instance didn't seem very interesting. When the new connection was made you can see the call to select(2), accept(2), fork(2), and some close(2) calls.

So for my next steps I'm going to try disabling the interface accelerations on the new BIOS, and if that still doesn't work, wipe the machine and install from -CURRENT to see if anything from there improves matters.

thanks so much for your help so far! Keep the suggestions coming!

_martin · Oct 29, 2015

Maybe worth trying to login differently, the good-o telnet via enabled via /etc/inetd.conf could help.
What about the tcpdump trace ?
Did you try to check the statistics with netstat -s ?

arader · Oct 29, 2015

Hi matoatlantis, I tried tcpdump briefly but it's been too long since I've used it, so it was a bit of a firehose and I couldn't really weed out my ssh attempts. I'll need to read the manpage to figure out how to filter out the noise.

I'll give netstat a try as well. thanks!

robroy · Oct 29, 2015

arader, you're welcome. 'wish I could help more. This problem's extra interesting to me 'cause I've been eyeballing the A1SAi-2550F and A1SRi-2758F for my own playground.

Have you considered re-testing after directly connecting your ssh client computer to a port on your A1SAi-2550F (with its other ports disconnected)?

This would bypass your switch, and also the (remote) possibility of any kind of network address conflict on your LAN.

Also, might you be willing to post your /etc/rc.conf lines and ifconfig -a output, to subject them to a friendly sanity check?

Finally, I think I'm barking up the wrong tree even more wildly with this--especially since your console behaves normally--yet have you done anything to qualify the RAM and motherboard's stability? I normally run MemTest86+ on any new computer for 24 to 48 hours before declaring it stable enough to install an operating system on. I've had to send off for warranty replacements a number of times based on its results.

_martin · Oct 29, 2015

You may have a problem to filter it out as you want to debug in the first place. You could connect directly (keyboard/monitor to the board) and run it directly from the console. Or telnet to the server, start the tcpdump and then SSH to it. Or even connect to it from different IP and set the filter to the first one.
Basic filtering could be done by:
# tcpdump -nf port 22 and host <IP_TO_DEBUG>. It's probably better to read it in wireshark. For that you can save the dump with -w <file>.

I'd lean towards the direct connection from console even to check the status with top(1). Especially during that lag you've mentioned. vmstat(8) during that period is also not a bad idea( vmstat -i to see if something obvious doesn't pop up).

arader · Nov 2, 2015

Thanks for all your help so far everyone, I've made some progress! though I'm still left scratching my head.

robroy - good call on the direct connection, I'm embarrassed to say I hadn't even tried that yet, but that worked! Here's what I found:

A single ethernet cable from my laptop to the any of the 4 ports on the Supermicro board works great. I statically assigned the IPs (10.9.1.1 and 10.9.1.2) on both sides and was able to sit in an active SSH connection for 8+ hours, no hiccups, no drops
I dug out an old 10/100 switch I had lying around, and connected the Supermicro board and my laptop to that, still with static IPs, and I was still able to stay connected just fine
I then added a network cable from the GigE switch to another one of the ports on the Supermicro. The laptop and Supermicro were still communicating fine with their static IPs. The Supermicro got an address via DHCP (10.0.0.23) and could hit the internet
I then moved the laptop over to the GigE switch, let it get an address via DHCP (10.0.0.119), and tried connecting to the Supermicro's DHCP address (10.0.0.23) - this was met with immediate struggles. I could connect, but after the first 15 seconds I had already hanged the connection
As another experiment, I moved both the laptop and the Supermicro off the GigE switch, and onto two free ports on the Comcast router I have. The DHCP addresses stayed the same, and the connection was still flaky!

So it appears that there's something very odd going on with my switch+router combination. When the Supermicro is off by itself (via direct connection or its own switch) everything works fine. The minute I try to use it on my actual network, be it through my switch or my router, the connection drops.

Unfortunately I had to run to work and couldn't investigate further, but I've been scratching my head. It's been a while since I've had to think critically about layer-2 and layer-3 networking, but what could cause this? I'm guessing its a misbehaving device, but how would it have this effect? I'll need to fire up tcpdump tonight and see if that offers clues

My next plan is to disconnect everything and isolate my switch, and see if that works, then slowly add devices back to the network and see when the connection goes south.

robroy · Nov 3, 2015

arader, that's great news! I like your plan.

arader said:
It's been a while since I've had to think critically about layer-2 and layer-3 networking, but what could cause this? I'm guessing its a misbehaving device, but how would it have this effect?

With my intuition's gain knob turned up, I can think of one plausible, yet improbable cause. In other words, behold I Don't Know, Unabridged Edition.

Your Comcast device may be issuing DHCP leases with netmask and route settings which cause traffic between your A1SAi-2550F and laptop to actually be sent through itself, instead of being transmitted directly between the devices. If it were doing this, it could also be performing unwanted NAT on the traffic, and its NAT mechanism could be working poorly, and gradually starving its own resources (causing your network connection to die after fifteen seconds worth of packets).

If this were happening, the symptom would be reproducible while your laptop and A1SAi-2550F were plugged in to your internal GigE switch, and also while both were plugged in to your Comcast device's built-in switch.

If you feel inclined to post them, I'd be curious to see the netmasks and routing tables from both the A1SAi-2550F and your laptop, in two states: working (with manually configured networking), and symptomatic (with DHCP leases from your Comcast device).

I'd also be curious to know your GigE switch's make and model. I'd use this information only to verify that it's truly an Ethernet switch and nothing more (that it has no features above the data-link layer). Please disregard if you're already certain of this.

I hesitate to suggest a test that may mask the real problem (and make it harder to properly solve), yet have you already tried power-cycling your Comcast device and GigE switch, and re-trying?

I'm asking because my own Comcast device (an SMC8014), was configured by Comcast in "pass-through mode," which supposedly rendered its functions above the data-link layer dormant, and turned it in to a bridge. It's attached directly to my FreeBSD router. I've noticed that changing what it's attached to (for instance, to a spare router instead of my FreeBSD router), causes a long connectivity interruption. The only quick way to re-establish connectivity's to power-cycle the Comcast device; it acts the new MAC address fails to replace the old one in the Comcast device's ARP table. This experience has left me wondering whether some Comcast devices may adapt poorly to networking changes.

arader · Dec 1, 2015

First a thousand apologies for my disappearing act, we've been renovating our home so this project had to be put on pause.

Executive summary: The issue lies somewhere in my network, not on the new host, and not in FreeBSD.

Details:
I implemented the final phase of my troubleshooting, which was to start with just the SuperMicro, switch, and laptop. This worked fine, and so I slowly added devices back. It wasn't until I added my TP-LINK access point did I start to see stability issues. It was as simple as "turn off AP, SSH works. turn on AP, SSH dies". I reproed this 10 times.

Unfortunately, wireshark wasn't helpful, as I'm pretty sure the issue lies between the switch and the AP. Maybe if I had a dumb hub I could use wireshark, but the switch was too smart.

The switch is a Netgear ProSafe 8 port Gigabit Switch, and hasn't given me problems before, but it is really old. I've also maxed it out, and have been wanting to play with VLANs for some time, so come Christmas I might get myself an upgrade

Also, I've just bought my own cable modem to replace Comcast's all-in-one deally, so hopefully that will improve matters as well (if not with this issue, with others I'm sure)

thanks everyone!