open.itworld.com
  Search  
Security Home Page Security Webcasts Security White Papers Security Newsletters Security News Open Topics Careers ITworld Voices ITwhirled The Security site of ITworld.com

Unix Tip: A SCSI mystery: Come and gone

ITworld.com 07/31/2007

Sandra Henry-Stocker, ITworld.com

Send in your Unix questions today! | See additional Unix tips and tricks

On a late Friday evening, just as I was about to leave for home, one of the servers I manage suddenly started having what appeared to be serious disk problems. Its NFS clients were collecting "NFS server not responding still trying" errors. The system itself was unable to process share and unshare commands. My initial fear was that the system had suffered a disk crash and would have to be rebuilt from spare parts and backups before the staff's arrival on Monday morning.

The system was still running, but having serious problems running ordinary commands and the console was filling up with messages like these:

WARNING: /pci@1c,600000/scsi@2,1 (glm1):
        Disconnected tagged cmd(s) (1) timeout for Target 0.0
SCSI command tagged queuing, as I came to understand, refers to queuing multiple commands to a SCSI device in order to improve performance. Why the system would have SCSI queuing problems just when work was slowing down and people were leaving for the day, on the other hand, seemed hard to fathom.

Another option worth considering was the possibility of a SCSI cabling problem. However, the server room was already closed and locked. The chance of any disruption or damage to the cabling at the time the errors started appearing seemed extremely unlikely.

When I attempted a reboot, the system began spewing out a large variety of errors, all seemingly related to the SCSI, but the problem had somehow switched from Target 0.0 to Target 0.1. This oddity in the problem's behavior brought the possibility of a problem with the SCSI controllers into view.

Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco         Cmd (0xb2b23298) dump for Target 0 Lun 1:
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco                 cdb=[ 0x2a 0x0 0x0 0x5d 0xb3 0xb3 0x0 0x0 0x1 0x0 ]
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco         pkt_flags=0xc000 pkt_statistics=0x61 pkt_state=0x7
Jul 27 17:42:34 zydeco scsi: [ID 365881 kern.info] /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco         pkt_scbp=0x0 cmd_flags=0x18e1
Jul 27 17:42:34 zydeco scsi: [ID 107833 kern.warning] WARNING: /pci@1c,600000/scsi@2,1 (glm1):
Jul 27 17:42:34 zydeco         Disconnected tagged cmd(s) (1) timeout for Target 0.1
Jul 27 17:42:34 zydeco genunix: [ID 408822 kern.info] NOTICE: glm1: fault detected in device; service still available
Jul 27 17:42:34 zydeco genunix: [ID 611667 kern.info] NOTICE: glm1: Disconnected tagged cmd(s) (1) timeout for Target 0.1
Jul 27 17:42:34 zydeco glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018]
A quick Google search revealed a few hundred reports of similar problems, but no easy explanations as to what might be causing the severe timeouts or why the problem would switch drives on a reboot.

The particular system has one internal disk and a fairly large (several hundred gigabytes) of disk in a external disk array. It's several years old and has never had a problem with its disks before.

Without any good guesses on the root cause of the problem, the best approach to addressing it seemed to be to cut the problem in half. If I were to detach the external array, I reasoned, it should be possible to determine whether the array was the source of the timeouts and other errors or that the internal disk and its SCSI controller were the source.

After powering down the disk array (with redundant power supplies, I had to switch it off twice) and then the server, I detached the SCSI cable at the server end of the cable. I then powered the server back on and watched the console. To my surprise, the system booted easily after some not unexpected repairs were needed on the root file system.

The / file system (/dev/rdsk/c1t0d0s0) is being checked.
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2172 (2 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2578 (2 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=2711 (6 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=3227 (80 should be 0) (CORRECTED)
/dev/rdsk/c1t0d0s0: INCORRECT BLOCK COUNT I=3242 (64 should be 48) (CORRECTED)
...
After the fsck was complete, the system then booted easily, complaining only that it wasn't able to open the file systems on the disk array with errors like these:

checking ufs filesystems
/dev/rdsk/c2t0d0s6: No such device or address
Can't open /dev/rdsk/c2t0d0s6
/dev/rdsk/c2t0d0s6: CAN'T CHECK FILE SYSTEM.
/dev/rdsk/c2t0d0s6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
/dev/rdsk/c2t0d1s6: No such device or address
Can't open /dev/rdsk/c2t0d1s6
/dev/rdsk/c2t0d1s6: CAN'T CHECK FILE SYSTEM.
/dev/rdsk/c2t0d1s6: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.

WARNING - Unable to repair one or more filesystems.
Run fsck manually (fsck filesystem...).
Exit the shell when done to continue the boot process.
After verifying that the system was stable through reboots and showed no signs of further file system problems, I commented out the file systems on the array in the /etc/vfstab file and shut the system down again. Then I reattached the disk array and repowered the array and then the system.

I mounted the file systems on the array one at a time, verified their integrity with fsck and saw no problems whatsoever. I then restored the /etc/vfstab file to its previous condition and rebooted the system once again. None of the errors that were plaguing the system only hours earlier resurfaced.

Seemingly severe SCSI problems can simply vanish with the right procedure of resetting and restarting your disk devices. Exactly how the sequence of timeouts began or why it was unable to recover without intervention is probably more than I will ever understand about SCSI. The knowledge that such problems can be spurious, on the other hand, and can be resolved without replacing hardware is something I am unlikely to forget.

On this topic

 

Sandra Henry-Stocker has been administering Unix systems for more than 18 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She currently works for TeleCommunication Systems, a wireless communications company, in Annapolis, Maryland, where no one else necessarily shares any of her opinions. She lives with her second family on a small farm on Maryland's Eastern Shore. Send comments and suggestions to bugfarm@gmail.com.




Sponsored Links

Sign up for a Microsoft Dynamics® CRM WEBCAST
Hear globally recognized leaders in customer strategy discuss the importance and evolution of CRM.
Sun Microsystems' - FREE 60 DAY TRIAL OFFER!
Test Sun's Newest Servers BEFORE YOU BUY. Plug Them In With Access To Full Technical Support.
100% Web Based Help Desk Software
Easy to use, customizable to meet your needs, powerful and scalable. Free online demo. Try it today!
Sign up for a Microsoft Dynamics® CRM WEBCAST
Hear globally recognized leaders in customer strategy discuss the importance and evolution of CRM.
Used and Refurbished HP ProCurve Switches
Lifetime Warranties, Professional Testing & Shipping on all HP Equipment Purchases!
» Buy a link now

Advertisements
Sponsored links
Locate Hidden Software on business PCs with this free tool
Bring harmony to your mix of UNIX-Linux-Windows computing environments
Top 5 Reasons to Combine App Performance and Security
KODAK i1400 Series Scanners stand up to the challenge
 Home   Open source  Operating systems  Unix
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.