LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-28-2010, 02:09 PM   #1
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Rep: Reputation: 39
Debugging Linux


Hi,
although I have used Linux for at least 6 years but I still find many times debugging it difficult when there is some problem.
Many times do Googling etc ,post on forums just want to know what do people do when they usually face problem in using some thing on Linux.
Which logs to check etc etc do you check init scripts or some how reach to the root of problem.What is the approach?
 
Old 05-28-2010, 02:29 PM   #2
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
I'm not quite sure what your question is.......

When trouble-shooting a problem, you would look at the relevant log (in /var/log). But, more to the point----tell us what the specific problem is, and you will get ideas on how to trouble-shoot---INCLUDING which log to check.
 
Old 05-28-2010, 09:48 PM   #3
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Actually I recently had a few problems
just giving examples

1) Used a streaming server known as Red5 on Debian Lenny.Googled followed many blogs etc finally could not make it work.
One of my friend came and in some script of Red5 which I feel was installer of Red5 he set in set -vx and then he told me exactly what the problem was in /usr/sbin there was some library which was conflicting.

2) Later on I wiped out Debian I used CentOS.I used a software known as eduCommons if you see my post I have put up the same question about a website loading slow http://www.linuxquestions.org/questi...o-load-810372/
got some thing known as stale PID.I was more than lucky to have a friend who helped me but the original problem is still there.

3) Used Xen on Debian Lenny so many problems finally switched to KVM.



Had a few more similar problems some solved some not.
My question is not related to a specific problem.
Stale pid in example 2 was such a thing I could never think of I was more than lucky that some one helped me.
So just wanted to know what general steps can some one take to avoid confusion.
I am not a Linux Guru I do know some shell scripting etc.Is there some tool which can debug Linux or bring it back to previous state or may be it can tell which files or processes had changed between 2 time intervals .I am just giving an idea do not know how to put it in words.As how can we manage the debugging ourselves.

Stale pid was some thing which I could never think of.May be some tool which based on errors helps
give a pop up message for suggestion
that "this is the latest bug"
or gives suggestions that "this is a problem you might be facing" .

Last edited by tkmsr; 05-28-2010 at 09:51 PM.
 
Old 05-29-2010, 01:12 AM   #4
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
I understand you want to know in general how to investigate and resolve problems in Linux. That's a big question but a good one. Mostly people learn how to do it by experience, ideally as a junior member of a support team so they can learn from adepts.

Here are a few ideas
  1. If something used to work and now doesn't the chances are the problem was triggered by a recent change. For this reason many sysdmins keep a log of all changes and, if applying automated updates, log which packages were changed during update.
  2. Similarly, whenever changing a configuration file or script, keep a backup of the earlier version(s) so you can go back. Many sysadmins start a change plan by developing a backout -- the procedure to get back to where you were before making the change. Symbolic links are useful because, for example, they can point to the new or old versions of a file and backing out is simply a matter of changing the symlink to point to the previous version.
  3. Note the exact error messages, perhaps appearing on screen, perhaps in a log file. As pixellany wrote, many of these are under /var/log/. Typically /var/log/syslog has top level errors and /var/log/messages has more detail but they may be elsewhere depending on the software, sometimes not under /var/log/. Sometimes relevant messages may be in more than one log so the timestamps are useful to correlate them. Sometimes problems are triggered by other events that are not themselves errors so scan log messages immediately before the error message.
  4. Having got the exact error message, netsearch for it. You will, of course, have to remove specifics so, for example, if investigating "kernel: scsi 6:0:0:0: [sdc] Unhandled error code" then drop the 6:0:0:0: and the sdc and netsearch for kernel, scsi, and "Unhandled error code". On an up-to-date system you can start by filtering the netsearch to show only results up to a year old.
  5. If that doesn't find a solution it may be possible to make the problem software more verbose using perhaps a -v or --verbose option. If it's a daemon it may be possible to make it run at the command prompt so you can see all its output (some of which may not be logged).
  6. If the problem software is a shellscript you add debug commands to make it output/log its progress and show key values and commmands (make a backup of the original first!).
  7. If the problem software is a binary you can run it with strace to display any system commands it is running.
  8. Sometimes it helps to simplify the erroring system, taking it to the minimum that will work and then progressively re-introducing the original features until it breaks -- then you will know which feature is causing the problem. For example, in a networking system, temporarily disable the firewall; if that fixes it you need to work on firewall configuration.
  9. If you still can't solve the problem then ask on Linux Questions! This will be most effective if you include key data from the problem investigation -- OS and software versions, error messages, links to similar problems found, what you tried and the results.
That's a start; I'm sure I've forgotten some of the toolkit. It might be useful to evolve this into a WIKI entry.
 
Old 05-29-2010, 01:54 AM   #5
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by catkin View Post
I understand you want to know in general how to investigate and resolve problems in Linux. That's a big question but a good one. Mostly people learn how to do it by experience,
Yes you understood my question correctly.See many people who have been able to solve problems here on LQ if they can share their experience that may help us who are not that much experienced to trouble shoot.We may be experienced Linux users who are between average and an expert level.

But still I find it some times very difficult and in helpless situation
when some problem occurs say for example this Stale PID error which I reported in a thread even after using Linux for 5 years I never knew any such thing can also occur so some place where we can refer to as glossary or commonly occurring problems or what to look around.
I understand that log files are the but the above STALE PID from where I got frustrated is where this question came to my mind.
If a person like me can not trouble shoot then definitely ordinary users will not able to do so.
Say for example in 2005 I used to have a LAN card which was never detected in those days of versions of Linux (Please do not start a debate of hard ware support it is just an example )
you have compiled various kernels ,tried ndiswrapper,madwifi etc etc on some blog people mentioned it working but you sitting in some corner of world never understand what went wrong and always wish if some one could have given you a clue.
A thread or a some tutorial that tells people in a step by step fashion that look this is how strace is used that is a STALE PID you may be having such an error.
Or suggest that this type of issue is a bug in lenny,squeeze,sid kernels also so it is not a thing to worry.Or you have this dependency missing.
You are not the only one developers are working.
Or what book should I read that I at least can understand all this I am willing to go through a PhD on Linux if that can help me but many would not be willing to do so.
The documentation is there but if I were to blame my low IQ that I could not understand even after following some thing in a simple term that what to look for we may not be that much experienced to trouble shoot.

May be I can meet a Linux Consultant for solving my doubts.

Last edited by tkmsr; 05-29-2010 at 01:58 AM.
 
Old 05-29-2010, 02:58 AM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by tkmsr View Post
See many people who have been able to solve problems here on LQ if they can share their experience that may help us who are not that much experienced to trouble shoot.
That's exactly what does happen (except in the rare cases when the OP reports they have found a solution but does not share it).

A lot of problem investigation and solving technique can be learned by following threads on LQ. Time and time again you will see the same techniques being used: firstly gathering information about the problem -- installed hardware, software versions, error messages etc. -- followed by some diagnostic commands -- for networking ifconfig, route -n, ping <gateway>, ping <host-on-Intenet-by-IP-address>, ping <host-on-Intenet-by-name> ... . To do this effectively you do need to understand the "archictecture" of the problem system, that is how the "building blocks" fit together; in the case of networking that is Ethernet/hardware/MAC-addressing, IP addresses (including netmasks and public/private), gateways (including NAT), routing ...
Quote:
Originally Posted by tkmsr View Post
... even after using Linux for 5 years I never knew any such thing can also occur ...
There's always something new!
Quote:
Originally Posted by tkmsr View Post
... so some place where we can refer to as glossary or commonly occurring problems or what to look around.
The Internet is great for that. Linux and all the software you can run on it is a huge and changing field, impossible to document all in one place and keep up to date. That's why the Internet is so good, it is a distributed reference and constantly updated. When you find a page that works for you, bookmark it. That way you build up a personal encyclopaedia of places to look for reference and orientation -- for understanding the "archictecture" of the problem system and some of them will have step-by-step troubleshooting procedures -- but often specific problems require netsearching for the latest. I must have ~100 Linux-related bookmarks collected over the last 2 years, many first seen in posts on LQ.
Quote:
Originally Posted by tkmsr View Post
... Or what book should I read that I at least can understand all this I am willing to go through a PhD on Linux if that can help me but many would not be willing to do so.
I understand your frustration but do not know of any "silver bullet" that will work for you. As already said, Linux+ is a huge and rapidly changing field that nobody knows all of; learning it is a major task -- perhaps equivalent to taking a PhD. There are some Linux Systems Administration publications that may help by giving an overview you can orientate yourself with, a framework that you can add detailed information to. But, while they are big books (~1000 pages), they really only scratch the surface and are soon outdated. The Linux System Administration Handbook, 2nd Edition, ISBN 0-13-148004-9 is pretty good but AFAIK the last revision is 2007, a long time ago in terms of how fast Linux is developing; it has nothing about udev rules for example. It doesn't have a glossary but it does have a good index.

The Linux Documentation Project has a lot of, er, documentation including many glossaries (one would be nice!).

The man -k <keyword> command is useful.
 
Old 05-29-2010, 03:32 AM   #7
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by catkin View Post
That's exactly what does happen (except in the rare cases when the OP reports they have found a solution but does not share it).
With no offence to any one but LQ always does not work.

Quote:
Originally Posted by catkin View Post
The man -k <keyword> command is useful.

Assuming I read above all could I be able to know what is STALE PID file,man pages does not mention and when you have that problem it wont come in my dream that there is stale PID.The worst part is it is no where in your logs.No error message at all.
That is just an example.
 
Old 05-29-2010, 03:35 AM   #8
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by tkmsr View Post
With no offence to any one but LQ always does not work.

Assuming I read above all could I be able to know what is STALE PID file,man pages does not mention and when you have that problem it wont come in my dream that there is stale PID.The worst part is it is no where in your logs.No error message at all.
That is just an example.
Then try here.
 
Old 05-29-2010, 05:44 AM   #9
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Ha ha ha
You are not getting the point.It is not that you gave me a link then I read.Now the problem is solved.Stale PID was just one example.
I very well understand that but only when a friend of mine helped me.
There might be more which you do not know.If you have to read 3000 threads to do that then this does not makes sense.
Imagine a car owner goes to a mechanic since his car is not working.Mechanic is quite experienced in his work but even he finds it difficult to understand what happened.This mechanic then contacts some one in company then there are 10-20 engineers looking at the problem and they finally found one.This mechanic had no clue what happened.Though he repaired a lot of cars.
So what was that which this mechanic lacked.Assuming this mechanic did not had funds to make his own car but he knows how engine work.
If some one is in between average and expert then he has to develop his application say a mail server he would come across a lot of issues which normally you do not.If some one faces now a similar problem then you easily can get it solved.
So I feel you need to have development experience to be able to debug any thing.

But as a home user I do not have that much experience.So if there is some tool which can suggest me that try to look at this thing this error might occur.
Documentation of errors which have solution and how people reached that
not how Linux works is what I am looking at.

http://tldp.org/FAQ/faqs/BLFAQ
mentions a software xv which no longer comes.
The documentation is so old.

From the error message if I can get a clue as to where to look at then that may be quite helpful.
Not a message like

"deploy :33 threw an exception garbage type cola."

and you are googling above message entire world.

Last edited by tkmsr; 05-29-2010 at 05:57 AM.
 
Old 05-29-2010, 08:18 AM   #10
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
You are asking for the impractical. A single error message can have many possible underlying causes. Generally squeaking the programmer has given the most accurate and useful error message practicable; the program is identifying the error in terms that are meaningful in the program space, that is from the programs perspective.

IBM used to publish the sort of documentation you requested for their mainframes (MVS and VM operating systems). They were huge, printed in multi-volume sets. For each error they listed the likely cause and how to further analyse the problem. At the end of each section they advised consulting the Systems Programmer (the IBM mainframe equivalent of a UNIX/Linux systems administrator), that is someone who understands the systems architecture and hence the meaning of the error message, the likely triggering causes, how to form hypotheses about what might be going wrong and how to test those hypotheses. In my experience it was rare for this extensive (and expensive) documentation to solve the problem; a Systems Programmer's expertise was required.

Microsoft introduced similar but interactive problem solving wizards (95? 2000? Dunno but mature by XP) and I similarly found they very rarely identified the problem; a netsearch for the error symptoms/messages was much more likely to find the cause and fix.

In the case of your one of the website takes a very long time to load the problem investigation technique of running /etc/init.d/educommons start at a command prompt is a basic problem investigation technique. Checking the boot messages for errors is another basic problem investigation technique. Having got the error message "Client1 already running: 2153" it was an obvious step to inspect the script to determine why that error message was given and so to identify an old PID file.

Given that the problem investigation and resolution was so easy does it make sense for the writers of the script to do the work of documenting it in the way you are asking for? Bear in mind that the same message could mean the daemon is already running. It would be less effort for them to enhance the script to check the PID fromm the PID file and, if does not exist or is not the eduCommons daemon, to delete the PID file and continue. There are limited resources available for software development and documentation; sometimes (maybe too often) this leads to only the essentials being done.

You are asking many other people to do a lot of work to compensate for your relatively low level of systems administration expertise; it is not going to happen when most systems administrators don't need it.
 
Old 05-29-2010, 08:30 AM   #11
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by catkin View Post
You are asking for the impractical.
Why is this impractical there can be a meaningful
error message.

Why are you after stale PID that was just an example there are many unsolved threads I have posted.
Read this one fine day I lost all my websites with no clue as what happened.

Another
read this thread
Also this one streaming server problem which I did solved but wiping out Debian from my server.

An old problem I faced using a local LAN card 3 years back having a fake chipset in local market worked well with Windows but had to compile kernels reading net it worked also and did not worked also.
Later on I mentioned a solution also here.

I got a problem my Xen virtual machines one fine day stopped working after googling for at least one week reading each thread I came across that error message had nothing to do with error.
Read it here I mentioned solution also.

Solution was to remove two XML files was on this forum .Reboot then keeping those files back again completely solved my problem.
If you read that link person had said that he could not relate the problem to solution.Strange enough to scare a user like me.The friend who helped me to answer stale pid also could not solve this problem which I just mentioned.


After this experience I had to ask how do I debug ?

Can there be a meaning full error message.Which a common man can understand.


I am a very ordinary user and I do not understand this much complex details.
As an example you use windows if your network does not works there is a pop up which asks you a set of steps to do or some thing similar which suggests you can this not happen in Linux?


If in a team of network admins some one made a change to a server they at a later date find some thing else not working no one knows how to revert back to earlier configuration or say it is tedious.
Windows have a very nice system restore utility which Linux I do not know has or not.


At least to a person like me why should I be a network admin or so to use Linux can normal human beings not use Linux?

Last edited by tkmsr; 05-29-2010 at 09:06 AM.
 
Old 05-29-2010, 08:48 AM   #12
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by tkmsr View Post
Why is this impractical there can be a meaningful error message.
Because the program describes the error from its perspective and you want it from yours. In the case of eduCommons it is a poor message because it is not the truth. It says "already running" but the detected error condition is that the PID file exists. The error message would more truthfully and usefully be "PID file <PID file name> detected; the daemon may be running or the PID file could be stale".

Linux was born out of UNIX; UNIX was born out MULTICS because MULTICS was too complex and so died. The genius of UNIX's creators was in making it as simple as possible and one of the simplifications -- which has become part of the Linux culture -- is to assume competence on the part of the sysadmin. Distros such as ubuntu are reversing this approach and are written for the naive user. This requires a huge amount of code. The more code the greater the probability of bugs and the greater the development work -- the same reasons that MULTICS died but we are 40 years on and it is more do-able now. For the competent sysadmin such distros are a nightmare; it is very difficult to identify the root cause of errors in the complexity.

Quote:
Originally Posted by tkmsr View Post
I am not asking people to do any thing for me I am just asking does that exist on this planet some where?
No.

Last edited by catkin; 05-29-2010 at 08:50 AM.
 
Old 05-29-2010, 09:19 AM   #13
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by catkin View Post
Because the program describes the error from its perspective and you want it from yours.
Exactly I may not be the ONLY one facing that condition.

Quote:
Originally Posted by catkin View Post
In the case of eduCommons it is a poor message because it is not the truth. It says "already running" but the detected error condition is that the PID file exists. The error message would more truthfully and usefully be "PID file <PID file name> detected; the daemon may be running or the PID file could be stale".
Yes
this is the portion of script
Code:
launch () {
        RETVAL=0
        if [ -f $1 ]; then
                PID=`cat $1`
                echo $2 already running: $PID;
        else
                $3 "start"
                RETVAL=$?
                if [ $RETVAL -ne 0 ]; then
                        echo Error launching $2.
                fi
        fi
        return $RETVAL
}
Here who so ever developed could have given a message that
echo $2 already running: $PID;
or some more meaningful thing which made sense to a human being not a developer.

eduCommons was just an example forget it for some time.Read other posts I did there could have been useful meaningful error messages a message could describe that if you are reading it then this is coming from this file and please refer to this or this which may have caused the problem.
The message while being thrown at stdout could have a meaningful thing.
Which makes sense to people other than who developed it.

Quote:
Originally Posted by catkin View Post
it is very difficult to identify the root cause of errors in the complexity.
That is what I am also saying the most surprising was the Xen error I posted above.By the message you could never understand and log file had 100s of message which is the one you should search
and the solution was not at all related to problem?
So if a meaningful error message comes which other than developer some human being can understand that will be a good feature in Linux and I feel it is not difficult to do so.
While writing a script a developer can be told that there can be these point of failures these exceptions should be giving message but let the messages be in a format that can easily be understood.
It will not only be good for common human beings but also for developers at a later stage to debug it.

But to do so one needs to know some set of problems which may arise.
So where is that.You can gain this experience by reading the threads here at LQ.This experience is missing this can be documented in to conditions to give a meaningful message.

Read this link
http://smarden.org/runit/
and this one
also this
these people have taken care of a small portion of such scripts.But for that they knew what can be issues in start up scripts or why programs could fail.
I am not criticizing any one but where can I find this as a user.

Last edited by tkmsr; 05-29-2010 at 09:32 AM.
 
Old 05-29-2010, 10:58 AM   #14
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by tkmsr View Post
...
Here who so ever developed could have given a message that
echo $2 already running: $PID;
or some more meaningful thing which made sense to a human being not a developer.
...
Sure. OTOH, I remember countless cases with Windows programs when they just complained they couldn't open a file not saying which file.

File bug reports ruthlessly.

Last edited by Sergei Steshenko; 05-29-2010 at 12:17 PM.
 
Old 05-29-2010, 11:48 AM   #15
tkmsr
Member
 
Registered: Oct 2006
Distribution: Ubuntu,Open Suse,Debian,Mac OS X
Posts: 798

Original Poster
Rep: Reputation: 39
Quote:
Originally Posted by Sergei Steshenko View Post
Sure. OTOH, I remember countless cases with Windows programs when the just complained they couldn't open a file not saying which file.

File bug reports ruthlessly.
Ok I did not knew this thing.Never used it :(
Got your point.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Effective debugging or improving ones debugging skills Ajit Gunge Programming 3 05-22-2009 09:29 AM
Difference between kernel - debugging and application debugging topworld Linux - Software 2 03-30-2006 12:50 AM
Visual Debugging and Linux Kernel Debugging Igor007 Programming 0 09-30-2005 10:33 AM
Kernel Linux Debugging Igor007 Programming 2 09-05-2005 02:10 PM
Debugging yacc in linux subu_s Programming 1 12-13-2004 07:29 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 09:37 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration