Thursday, August 30, 2007

Troubleshooting Tips

These tips are copied from some other sites. Basically I am trying to accumulate all informations needed for Linux support to this blog ....



Up to date?

If you are installing new hardware, or having trouble with anything new, do you have the right drivers for any hardware being added? Just because the OS HAS a driver doesn't mean that it is a GOOD driver- for example, SCO OSR5 had a driver for Intel Pro100 cards, but if you wanted something that WORKED, you had to go to SCO's FTP site and download a good driver.
On a similar note, do you have the current recommended patches and updates? Sometimes this is all it takes to fix your problem. It's certainly worth checking.

Evidence
The first rule of troubleshooting is to do no harm. Another way to put that is to say "don't trample all over the evidence". If you aren't careful and methodical in your approach, you may destroy clues that could help you narrow down the source of your problem. For example, Unix and Linux keep a "last accessed" date for files and directories. You can see that date by using "ls -u filename" or "ls -du directoryname". That date can be a critical clue as we'll see in a moment, but it's so easy to lose it. Try this experiment: cd /tmp and make a new directory "testdate". Do

cd /tmp
mkdir testdate
touch testdate/a
ls -lud testdate
sleep 60
ls -lud testdate
# still the same date
ls -l testdate
# nothing to see in there, but
ls -lud testdate
# now the access date has changed
The same change will take place to an ordinary file if you cat it, copy it or access it in any way (because it's the access date!).
(Confused? The "ls -lud" reads the /tmp directory to get info about "testdate". The "ls -l" looks into "testdate" so that's access.)
Needing or wanting to know when something was last accessed comes up all the time, but just a few examples might give you some ideas:
A directory was supposed to be on the backup but isn't. Did the backup program access that directory?
A misbehaving program is supposed to use a certain file during its startup. Did it?
What files does a program try to use? Knowing this can sometimes help you track down where a program is failing when it is too dumb to tell you itself.



Unix systems keep two other dates: modify, and inode change. The modified date is what you see when you use "ls -l" and is the date that the file has been changed (or created, if nobody has changed it since). The inode change date (ls -lc) reflects the time that permissions or ownership have changed (but note that anything that affects the modified time also affects this).
Some systems have a command line "stat" command that shows all three times (and more useful info) at once. Here's the output from "stat" on a Red Hat Linux system:

[tony@linux tony]$ stat .bashrc
File: ".bashrc"
Size: 124 Filetype: Regular File
Mode: (0644/-rw-r--r--) Uid: ( 500/ tony) Gid: ( 500/ tony)
Device: 3,6 Inode: 19999 Links: 1
Access: Fri Aug 18 07:07:17 2000(00000.00:00:13)
Modify: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Change: Mon Oct 25 03:51:14 1999(00298.03:16:16)
Note that "Modify" and "Change" are the same. That's becaue "Change" will also reflect the time that a file was modified. However, if we used chmod or chown on this file, the "Change" date would reflect that and "Modify" would not.
You can get a similar listing on SCO, but you need to know the inode (ls -i), the filesystem the file is on, and then run fsdb on that file system. It's not usually worth the trouble, but it does also tell you where the file's disk blocks are to be found:

# cd /tmp
# l -i y
33184 -rw-r--r-- 1 root sys 143 Aug 11 14:01 y
(33184 is the inode number of the file "y")
# fsdb /dev/root
/dev/root(): HTFS File System
FSIZE = 1922407, ISIZE = 480608
33184i
i#: 33184 md: f---rw-r--r-- ln: 1 uid: 0 gid: 3 sz: 143
a0:337145 a1: 0 a2: 0 a3: 0 a4: 0 a5: 0 a6: 0
a7: 0 a8: 0 a9: 0 a10: 0 a11: 0 a12: 0
at: Tue Aug 15 11:06:35 2000
mt: Fri Aug 11 14:01:10 2000
ct: Fri Aug 11 14:01:10 2000
This tells me the three times, and also that the entire file is located in one block (block 337145). If it were larger, and the next block was not 337146, I'd also know that the file is fragmented (once you get above a9, the rules change; see BFIND.C for a brief introduction to that).
The inode time can be very illuminating: suppose a problem started yesterday, and you can see that "ls -c" shows an inode change then but no "ls -l" change: it might be that ownership or permissions have been changed and are causing your problem.
Of course, if you don't know what the ownership or permissions should be, this may not help a lot. Some systems have a database of file ownership and permissions and can report discrepancies and even return the files to their proper state. SCO Unix, for example, can use the "custom" command to verify software. Old Xenix systems had "fixperm". Linux systems using RPM can verify packages; other Linux package managers have similar capabilities. Add-on security packages like COPS and others can also be useful: although their concern is the security aspects of files changing, their watchfulness can be useful in troubleshooting contexts also.
If you need to know what files a process is using, the "lsof" command (standard with Linux, available from Skunkware for SCO) can help. Here's an example from a Linux system:

# lsof -p 1748

COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
httpd 1748 root cwd DIR 8,5 4096 2 /
httpd 1748 root rtd DIR 8,5 4096 2 /
httpd 1748 root txt REG 8,6 242860 361425 /usr/sbin/httpd
httpd 1748 root mem REG 8,5 494250 216911 /lib/ld-2.2.4.so
httpd 1748 root mem REG 8,6 10007 264042 /usr/lib/apache/mod_vhost_alias.so
httpd 1748 root mem REG 8,6 8169 263597 /usr/lib/apache/mod_env.so
httpd 1748 root mem REG 8,6 17794 263604 /usr/lib/apache/mod_log_config.so
httpd 1748 root mem REG 8,6 7562 263603 /usr/lib/apache/mod_log_agent.so
httpd 1748 root mem REG 8,6 8558 263605 /usr/lib/apache/mod_log_referer.so
httpd 1748 root mem REG 8,6 8142 263596 /usr/lib/apache/mod_dir.so
httpd 1748 root mem REG 8,6 370 117962 /usr/lib/locale/en_US/LC_IDENTIFICATION
httpd 1748 root mem REG 8,5 531205 110159 /lib/i686/libpthread-0.9.so
(many lines deleted)
This tool can also list open network connections. For example, if you need to know what process are listening on port 137 , you would use:

# lsof -i :137
COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME
nmbd 903 root 5u IPv4 1255 UDP *:netbios-ns
nmbd 903 root 7u IPv4 1260 UDP apl:netbios-ns
Another piece of evidence that can be very helpful is when someone has logged in. The "last" comand will give you that information, including how long they were on the system. But "last" gets its data from /etc/wtmp, so you may want to get information about wtmp before running last (keep in mind that wtmp is affected by logins and logouts, so unless you were logged in BEFORE whatever problem you are looking for started, your login has changed wtmp).
How long the system itself has been up is available from "w" or "uptime". Other evidence you may want to collect before beginning a trouble search includes:
• df -v
• du -s
• memsize
System logs may be very useful. However, before you look at them, get the "ls -l", "ls -c" and "ls -u" from the logs- this tells you when the log was last used, etc.- if there's no current information and there should be, those dates are important clues..
You need to know what to look for in the logs. SCO makes it easier by using words like "CONFIG" "NOTICE" and "WARNING" that you can grep for, but on other OS'es you may have to just look manually until you can figure out what sort of key words they would use.
Another often overlooked clue is "what processes are using this file?". You get the answer to that with "fuser", which will return a list of the processes using a file ("lsof" is another very useful tool in this context).
Software problems can be tough. If you have "trace" (Linux "strace", Mac OS X "ktrace" and "kdump"), knowing what files a program tries to read and write can be very useful. A first attempt is to run the program like this:

trace badprog 2> /tmp/badprog.trace
Then examine /tmp/badprog.trace, particularly noting access, open, read and write calls. These will all have the general format like this:

_open ("/etc/default/lang", 0x0) = 4
read (4, "#\t@(#) lang.src 58.1 96/10/09 \n#\n#\tCopyr".., 1024) = 437
...
_open ("/usr/lib/lang/C/C/C/collate", 0x0) = 5
read (5, "".., 1) = 1
The return value of the call is your clue- if it's positive, it's usually OK.For example, the lines above mean that /etc/default/lang was opened correctly ( we would have got a negative number instead of 4 if it failed) and then 437 bytes were read from it- see the "read(4.." ? That 4 means it's reading /etc/default/lang. Later on it opens "/usr/lib/lang/C/C/C/collate" and the return value is "5", the next read is from that file (because it's "read(5..". As you can see, you really don't need to be a programmer or even understand much about this to be able to (possibly) find a problem here (do watch for "close", though- if you aren't paying attention you won't know what file is being read because the numbers get reused).
Sometimes it's helpful to know WHAT files got modified by an application. To find that out, this can help:

touch /tmp/rightnow
runapplication
find / -newer /tmp/rightnow


Keep it simple, stupid

The kiss ("keep it simple stupid") principle is always good to follow. Removing unnecessary hardware, making the network smaller, etc. are good examples. Dirk Hart posted this in response to a problem that involved one of those nasty sub-nets that make our brains hurt:


In this situation, I would use a netmask of 255.255.255.0 on the '486
(and on another pc to avoid confusion) and determine that the card is
indeed working on the 486. _Then_ I would worry about binary arithmetic.

'It don't mean a thing if it ain't got that ping'

Regards, Dirk


Testing Specific Systems

You may have a very good idea of what you need to concentrate on. If a printer isn't working, bad disk blocks aren't a likely suspect. However, the problem may be less obvious: data corruption in the middle of a file could be bad disk blocks, bad memory, bad cpu, bad disk controller, a bad DMA controller.. how do you narrow this stuff down? Well, honestly, sometimes it can be hellishly difficult, particularly if you are pressed for time. If you don't know where to start, and you aren't sure if the problem is hardware or software, checking the hardware is probably the best place to start. Unfortunately, some of the things you'll need to for that require that the system be in single user mode.
Single user mode does NOT mean that just one person is logged in. Specifically, it means that the system has been brought to init state 1, usually by typing "init 1" while logged in as root at the console.


Things you can check while still multi-user

Log files. On some sytems, "dmesg" gives you a lot of information, on others (SCO, for instance) it doesn't give you much and you have to look at the logs yourself. You'll find these logs in /var/log or /usr/adm (SCO). The "syslog" is often particularly useful. If printing is the problem, the printlogs are in /usr/spool/lp/logs (SCO), or the "lf" entry in /etc/printcap points you to them.
Network statistics. If the network isn't working right, all kinds of other things may be in trouble. To be good at this, you really need to understand networking more than I want to try to cover here, but I will cover some basic points:
• If you can ping 127.0.0.1, your tcp/ip is working but that says NOTHING about your NIC card. As pinging your own ip address will get re-routed to 127.0.0.1, pinging your own ip address won't prove anything about your NIC either.
• Always try netstat and arp commands with the "-n" flags to avoid name resolution lookups- this keeps everything local and aids your troubleshooting.
You can see what network ports are in use with "netstat -an" (lsof is also useful for this). Let's just take a quick look :

Active Internet connections (including servers)
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp 0 0 64.13.44.12.1254 209.167.40.69.119 ESTABLISHED
tcp 0 0 64.13.44.12.1252 63.209.14.66.119 CLOSE_WAIT
tcp 0 0 64.13.44.12.1216 216.71.1.37.80 CLOSE_WAIT
tcp 0 0 10.1.36.3.1085 10.1.36.100.23 ESTABLISHED
tcp 0 0 *.80 *.* LISTEN
What can I tell from this? First, I'm reading news from 209.167.40.69. How do I know that? See the "119" that follows that address? If I "grep 119 /etc/services, I get:

nntp 119/tcp readnews untp # USENET News Transfer Protocol
I also was reading news from another site, but I'm not presently- that's why that line says "CLOSE_WAIT" rather than "ESTABLISHED". I had a connection to a web page (the "80"- grep it from /etc/services if you don't recognize it) and I have a telnet session (23) open to 10.1.36.100. This machine also has a web-server, and it's LISTENing on port 80. Of course, none of this means anything unless you know what's SUPPOSED to be going on. So, if the web server isn't working, and you "netstat -an | grep LISTEN" and don't see *.80 in the list, that's why it isn't working. Now WHY isn't it working? If you try to start it and it fails, its logs will likely tell you what the problem is. If not, then maybe trace or strace can give you a clue.
A system that is running, but slowly, can be hard to figure out. The slowness is either coming from the CPU or the hard drive. The "sar" program (available for Linux, too) will let you figure out which pretty quickly. First try"sar 5 5". That gives you information about your CPU. Then, "sar -d 5 5" will tell you about your hard drives. The only problem is that, unless it's flat obvious (CPU at 0% idle or hard drives showing 1,000 msec in the avserv column), you don't know if what you are seeing is normal or abnormal for this machine running its typical load. If sar has been set up and employed properly, you'll have historical data (in /var/adm/sa on SCO) to compare, but otherwise you have to just use your best judgement, and if you haven't had much experience on similar systems, you just won't know. See Sarcheck Review(May 1999) for more information.
If CPU is the problem, a simple "ps -e | sort -r +2 | head" can tell you a lot, particularly if you put it in a loop like this:
while true
do
ps -e | sort -r +2 | head
sleep 10
done
If that shows a process gaining a substantial amount of time during the 10 second sleep, that process is using a lot of time- if it gained 5 seconds, for example, it is using 50% of your cpu! You might also use "ps -ef" and look at the "C" column (just before the time)- that's total CPU usage.
Skunkware has "iohog", "memhog", and "cpuhog". These can help you pinpoint misbehaving processes. Download them all from http://www.caldera.com/skunkware/osr5/vols/hog-1.1-VOLS.tar
If you are slow on a network connection, "netstat -in" and (on SCO) "llistat -l" can show you problems. Don't let DNS issues confuse you: if, for example, a telnet session is slow to connect but then works fine, that's DNS- either the connecting machine can't resolve the server's name or vice-versa. Remember that: the server is going to try to resolve the name of the client, and if it can't, there will be a delay- possibly a long delay.
Sometimes it helps to know just what equipment is in the machine. SCO has the "hw -v" command and "hwconfig", Windows has its Control Panel -> System -> Device Manager, and linux has lsdev, lspci, pnpdump and, of course, dmesg.
See Why is my system slow for more on this.
If you are having troubles, particularly during an installation, take out any hardware you don't need right now. You can always put it in later.
If you can identify a good-sized file that is not being used by anyone (check with fuser or lsof), running repeated "sum -r"'s on that file should always produce the same result. If it does not, and you are certain no-one else is modifying the file, then the hard disk, memory, disk controller, dma controller or cpu could be suspect. Gee, doesn't that narrow it down? Well, maybe we can get a little better:
The first time you sum the file, the information should be read from the hard drive directly unless it is already in cache( to effectively flush the cache, you need to know how big it is so you can sum, or just cat to /dev/null, some other, larger file which will overwrite any trace of this file). On SCO, the cache size can be found by

grep "i/o bufs" /usr/adm/messages
On linux "dmesg | grep cache" will get you what you need. If you pick your file size to be slightly less than cache, you've got a good shot at getting it all into memory (assuming no one else is using the machine just then).
The second time you sum, some portion or even all of the file will be read from memory, so if the sum's were constant after flushing cache, but not otherwise, memory would be suspect.
You might get a clue about DMA by testing other devices that use it- the floppy disk is a good candidate: if data can be reliably written and read from floppy (use sum -r to check it) but hard drive data is changing, then it's probably NOT memory or motherboard problems.



Things you need to be in single-user mode for

The first thing you will want to do is run fsck. Except for root, file systems should be unmounted when running this on them- you could umount /home while multiuser, of course, but since you are going to be checking all of them, you may as well be single-user. Be sure to check the man page; SCO, for example, doesn't do a full fsck on their modern versions unless you specify "-ofull" (see Filesystems for a more complete discussion of fsck).
What does fsck tell you? Well, if you have hard disk problems, fsck is going to trip over them, so that can be useful. Because fsck uses a fair amount of memory as it's working, memory problems will probably affect it -unfortunately, the effect may be unpleasant, so if you suspect that memory could be an issue, try creating a floppy file system first ( just "mkfs /dev/fd0" is sufficient ) and fsck it- that won't make you 100% sure that fsck isn't going to ruin your life for reasons beyond its control, but it gives you a little more confidence. I wouldn't fret over this- if you've had bad memory problems, things on your disk are likely not healthy already.
After fsck, you want to check the hard drive. SCO has "badtrk", Linux has "badblocks". Read the man pages carefully- you do NOT want to accidentally destroy your system, and these things can.


Rebooting

If your problems began after linking a new kernel, maybe you should boot with the previous version. On SCO, you automatically get "unix.old"; just type that at the Boot prompt. On Linux, you get the old kernel if you bothered to arrange for that in lilo.conf (or whatever your boot manger uses). Try hitting TAB at the boot prompt to see if you have other choices.


Logging

Maybe you still can't figure it out. Some file is getting trashed for unknown reasons and you can't get a handle on it. Perhaps you can set up some cron or background process to try to catch whatever is going on in the act. Something like this:

while true
do
sleep 300
sum -r suspect_file
fuser (or lsof) suspect file
done > /tmp/mylog &
You can get more sophisticated; if I have to leave this for days or weeks I send the sum to temporary files, compare them and delete them if nothing has changed; or I'll collect the fuser output and run it through "ps" so I can see who or what was responsible.


Seeing the invisible

Sometimes you need to see what's really going on without interpretation. For dumb terminals, you can often turn on a "monitor" mode in the terminal's setup that will then display the hex characters of any control characters rather than acting on them. For a telnet session, you can escape to the telnet prompt (generally CTRL-] ) and type "set netdata", then escape again and "set prettydump".
Another way to capture everything is to use "script". For example, "script /tmp/mystuff". Then do whatever you are doing, and when you are done, CTRL-D to end script. The /tmp/mystuff contains everything that happened, including control characters; you can examine it with vi or hd.

No comments: