Send in your Unix questions today! |
See additional Unix tips and tricks
On a late Friday evening, just as I was about to leave for home, one of the servers I manage suddenly started having what appeared to be serious disk problems. Its NFS clients were collecting "NFS server not responding still trying" errors. The system itself was unable to process share and unshare commands. My initial fear was that the system had suffered a disk crash and would have to be rebuilt from spare parts and backups before the staff's arrival on Monday morning.
The system was still running, but having serious problems running ordinary commands and the console was filling up with messages like these:
WARNING: /pci@1c,600000/scsi@2,1 (glm1):
Disconnected tagged cmd(s) (1) timeout for Target 0.0
|
SCSI command tagged queuing, as I came to understand, refers to queuing multiple commands to a SCSI device in order to improve performance. Why the system would have SCSI queuing problems just when work was slowing down and people were leaving for the day, on the other hand, seemed hard to fathom.
Another option worth considering was the possibility of a SCSI cabling problem. However, the server room was already closed and locked. The chance of any disruption or damage to the cabling at the time the errors started appearing seemed extremely unlikely.
When I attempted a reboot, the system began spewing out a large variety of errors, all seemingly related to the SCSI, but the problem had somehow switched from Target 0.0 to Target 0.1. This oddity in the problem's behavior brought the possibility of a problem with the SCSI controllers into view.
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco Cmd (0xb2b23298) dump for Target 0 Lun 1:
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco cdb=[ 0x2a 0x0 0x0 0x5d 0xb3 0xb3 0x0 0x0 0x1 0x0 ]
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco pkt_flags=0xc000 pkt_statistics=0x61 pkt_state=0x7
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco pkt_scbp=0x0 cmd_flags=0x18e1
Jul 27 17:42:34 zydeco scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco Disconnected tagged cmd(s) (1) timeout for Target 0.1
Jul 27 17:42:34 zydeco genunix: [ID 408822 kern.info] NOTICE: glm1: fault detected in device; service still available
Jul 27 17:42:34 zydeco genunix: [ID 611667 kern.info] NOTICE: glm1: Disconnected tagged cmd(s) (1) timeout for Target 0.1
Jul 27 17:42:34 zydeco glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018]
|
A quick Google search revealed a few hundred reports of similar problems, but no easy explanations as to what might be causing the severe timeouts or why the problem would switch drives on a reboot.
The particular system has one internal disk and a fairly large (several hundred gigabytes) of disk in a external disk array. It's several years old and has never had a problem with its disks before.
Without any good guesses on the root cause of the problem, the best approach to addressing it seemed to be to cut the problem in half. If I were to detach the external array, I reasoned, it should be possible to determine whether the array was the source of the timeouts and other errors or that the internal disk and its SCSI controller were the source.
After powering down the disk array (with redundant power supplies, I had to switch it off twice) and then the server, I detached the SCSI cable at the server end of the cable. I then powered the server back on and watched the console. To my surprise, the system booted easily after some not unexpected repairs were needed on the root file system.
The / file system (/dev/rdsk/c1t0d0s0) is being checked.
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2172 (2 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2578 (2 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2711 (6 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=3227 (80 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=3242 (64 should be 48) (CORRECTED)
...
|
After the fsck was complete, the system then booted easily, complaining only that it wasn't able to open the file systems on the disk array with errors like these:
checking ufs filesystems
/dev/rdsk/c2t0d0s6: No such device or address
Can't open /dev/rdsk/c2t0d0s6
/dev/rdsk/c2t0d0s6: CAN'T CHECK FILE SYSTEM.
/dev/rdsk/c2t0d0s6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
/dev/rdsk/c2t0d1s6: No such device or address
Can't open /dev/rdsk/c2t0d1s6
/dev/rdsk/c2t0d1s6: CAN'T CHECK FILE SYSTEM.
/dev/rdsk/c2t0d1s6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
WARNING - Unable to repair one or more filesystems.
Run fsck manually (fsck filesystem...).
Exit the shell when done to continue the boot process.
|
After verifying that the system was stable through reboots and showed no signs of further file system problems, I commented out the file systems on the array in the /etc/vfstab file and shut the system down again. Then I reattached the disk array and repowered the array and then the system.
I mounted the file systems on the array one at a time, verified their integrity with fsck and saw no problems whatsoever. I then restored the /etc/vfstab file to its previous condition and rebooted the system once again. None of the errors that were plaguing the system only hours earlier resurfaced.
Seemingly severe SCSI problems can simply vanish with the right procedure of resetting and restarting your disk devices. Exactly how the sequence of timeouts began or why it was unable to recover without intervention is probably more than I will ever understand about SCSI. The knowledge that such problems can be spurious, on the other hand, and can be resolved without replacing hardware is something I am unlikely to forget.