Red HatThis forum is for the discussion of Red Hat Linux.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have two machine running RedHAT Linux and attached to EMC SAN storage and running an oracle cluster active active. Yesterday, One of the machine was down
due to power fluctation. This is multiprocess machine model 6800 power edge and using (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) ) Kernal by default.
the machine gets hang on stuff below. It seems to me a HBA driver loading problem. But when i use different kernel (REDHAT ENTERPRISE AS-UP (2.4.21-37.EL))
it boots the machine and attache to storage. But i cannot run my oracle cluster stuff on it becasue the kernel should be (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) )
Please help me how can i pass thorugh the booting process using kernal (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) ) . Please find below the error detail
when using kernel ELsmp.
error message detail
--------------------
loading megaraid_sas.o module
/lib/megaraid_sas.o
Hint: insmod errors can be caused by inccorect module parameter,including invalid
I/O or IRQ paramegter. you may find more information in syslog or the output from dmesg.
Hi Amir,
Welcome to Linux world.
Can you tell me few things about the about this problem that help us to give you better solution.
Quote:
Originally Posted by amir_myself
Dear All,
I have two machine running RedHAT Linux and attached to EMC SAN storage and running an oracle cluster active active. Yesterday, One of the machine was down
due to power fluctation. This is multiprocess machine model 6800 power edge and using (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) ) Kernal by default.
the machine gets hang on stuff below. It seems to me a HBA driver loading problem.
Is both your Linux machine have same configuration,same hardware, same setup??? I mean two boxes have same setup and configuration.
As you have said one server goes down means there is no issue with
other linux machine.
When your linux machine is Hangout? At the time of booing or after booting when some of your process is running???
Quote:
But when i use different kernel (REDHAT ENTERPRISE AS-UP (2.4.21-37.EL))
it boots the machine and attache to storage. But i cannot run my oracle cluster stuff on it becasue the kernel should be (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) )
Please help me how can i pass thorugh the booting process using kernal (REDHAT ENTERPRISE LINUX AS (2.4.21-37.ELsmp) ) .
Why you used different kernal? Have you face this issue first time?
I mean previously before you face the problem is every thing running fine with same kernal?
I/O or IRQ paramegter. you may find more information in syslog or the output from dmesg.
1) Regarding for your first question Is both your Linux machine have same configuration,same hardware, same setup??? I mean two boxes have same setup and configuration.
Answer: Yes, Both the servers have the same configuration,hardware and same setup.
2) As you have said one server goes down means there is no issue with
other linux machine.
Answer: Yes, the one machine is working fine and no issues.
3) When your linux machine is Hangout? At the time of booing or after booting when some of your process is running???
Answer: My machine is hangout at the time of booting it doesn't show any process before that it hangs. You know when count down start on kernel selection screen after that it try to load the drivers then it give me the below messages and doesn't go to the process
loading megaraid_sas.o module
/lib/megaraid_sas.o
Hint: insmod errors can be caused by inccorect module parameter,including invalid
I/O or IRQ paramegter. you may find more information in syslog or the output from dmesg.
error: /bin/insmod exited abnormally loading lpfc.o module
Machine Exception 0000000000000000000004
Note: it says that you can find more information from dmesg or syslog but it doesn't boot how can i get that information. Even i tryied with giving option in kernel 1 to boot as a single user the same above error appear. But any how please find attached dmesg file of EL kernel. but i could not locate the syslog file.
4) Why you used different kernal? Have you face this issue first time?
I mean previously before you face the problem is every thing running fine with same kernal.
Answer: I just use the different kernel which is EL to just check whether the problem is with HBA or something else and it booted but i don't want to use the EL kernel as you know my machine is multiprocessor machine and my oracle application service doesn't run on that kernel that kernel is for singl processor machine. Every thing was running fine with ELsmp kernel prviously.
If you need more information please let me know. I really apperciate your help.
Yes i checked your dmesg looks ...its issue between kernal and drivers module.Some of modules are not properly installed thats way this issue u faced ....
I request you pls reinstall the OS. I hope it will start working agian.
Let me know the status after u reinstall.
Some of modules are not properly installed thats way this issue u faced ....
I request you pls reinstall the OS
Why reinstall the OS??? When you change oil in your car do you replace your windows, doors, and seats?
Amir, first of all try to reinstall the modules/drivers for your fiber, MegaRAID, and MegaSAS, see if that helps. The reason why you could not use your Oracle with the EL kernel is probably related to missing modules/drivers that are present with the ELsmp kernel but are not compiled for the EL kernel. It's hard to see what the problem is without more info. Copy and paste your /var/log/messages and /var/log/failog. If the two machines are identical just image one to the other with dd.
In any way what I think is problematic from your dmesg log is this:
Code:
megasas: PCI hotplug regisration failed
Code:
SCSI device sdc: 555745280 512-byte hdwr sectors (284542 MB)
sdc:<6>Device 08:20 not ready.
I/O error: dev 08:20, sector 0
Device 08:20 not ready.
I/O error: dev 08:20, sector 0
unable to read partition table
SCSI device sdd: 307200 512-byte hdwr sectors (157 MB)
sdd: sdd1
SCSI device sde: 204800 512-byte hdwr sectors (105 MB)
sde:<6>Device 08:40 not ready.
I/O error: dev 08:40, sector 0
Device 08:40 not ready.
I/O error: dev 08:40, sector 0
unable to read partition table
...
Double check all you configs and especially logs for you megasas/megaraid for more details.
Well exkor5000, reinstallation is best method when you cant do much more and especially when
you are not sure about your issue. I am not saying this is the only method but you can't deny this is the one of most effective method in some extreme scenario where our logic's and mind's got hanged.
Here in the current scenario The OP has not able to boot the machine at all. So, i don't this you
got the logs. More ever as my experience i think this is the best quick method to resolve the issue for this scenario.
Sorry, Actually I posted the EL kernel dmesge file but by mistake i said ELsmp. Because ELsmp is not working from the boot itself and giving the error which i menstion in my first post.
Regarding the installation of HBA driver it's working fine with EL kernel because i can see the storage. but due to some configuration related to oracle it should see ELsmp kernel because this is not normal oracle installation it's cluster oracle installation. So, it record also the kernel while installation.
Please guide me How can i reinstall the lpfc.o module for ELsmp kernel sitting on EL kernal.
And this is a production system i cannot reinstall the O.S. That is the last option but i don't want to go for that without trying other methods.
To recompile the modules all you need to do is go to the kernel source directory (people usually put it in /usr/src/linux-xxxxx) and issue these commands:
Code:
make modules
make modules_install
This is assuming you compiled your kernel this way from source in the first place.
There is a faster way you can try:
go to /lib/modules and make a copy of you ELsmp kernel modules:
Vap the problem here is with kernel modules it is clear as water.
It is better to try and isolate the problem, the worst is reinstalling the kernel not the OS. Plus you learn absolutely nothing by just reinstalling every time a problem pops up.
I tried your second method which you menstion below but no success. The same error message i recieved
Hint: insmod errors can be caused by inccorect module parameter,including invalid
I/O or IRQ paramegter. you may find more information in syslog or the output from dmesg.
error: /bin/insmod exited abnormally loading lpfc.o module
(
There is a faster way you can try:
go to /lib/modules and make a copy of you ELsmp kernel modules:
Code:
cp -R ./<ELsmp_KERNELVERSION> ./<ELsmp_KERNELVERSION>.bakThen copy the fiber, megaSAS, and megaRaid modules (all the three) from ./lib/modules/<EL_KERNELVERSION> to /lib/modules/<ELsmp_KERNELVERSION>.
)
Note: even i copied the fiber, megaSAS, and megaRaid modules from my working machine ELsmp kernel still gave me the same error message.
As this is a production system and i did not execute your first option.
I really appericate your help. Please advise me what to do next.
ok then try to load the modules one by one manually see what happends.
Boot the system with EL kernel.
Turn off megaSAS, megaRaid, and fiber from booting in the current RC script.
You can do it either with this command:
Code:
ntsysv
or
Code:
chkconfig <name> <on|off>
list all runlevels:
Code:
chkconfig --list
Then boot the system with ELsmp.
When you login, try to load the modules by hand using:
Code:
modprobe <name>
That way you are loading them without any parameters, so let's see if that's the problem. Errors and messages will be also logged so if you get kernel panic you can reboot and read the logs.
If that's not the problem then you most likely have something wrong with IRQ table in your ELsmp kernel.
Please find attached file which contain output of chkconfig --list on my server which is not working from ELsmp. This command is executed from EL kernel.
Actually, I am not sure which module need to off from list which attached in my file.
Please guide me.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.