I have a new install of tustix on a basically new system and it freezes randomly. I thought at first it might be a scis drive I just bought, but changed the drive to another new one and it did the same thing...locks up out of seemingly nowhere.
It's a new borad, chip and mem. Wouldn't be the first time new mem crapped out on me.
I did run a diagnostic tool on the scsi drive the first time it happened and nothing, but I am not familiar with scsi yet, so tried the other drive to make sure.
If my memtest comes back ok, which I actually hope it doesn't, what do you guys recommend next in terms of testing?
Also, so you know, the system could locked up on reboot (initialization) a couple times, so it doesn't seem to be my install.
Any ideas or suggestions would be appreciated.
Thanks in Advance,
Well, memory comes back fine with memtest. So, I don't know what else it could be.
I read about cpu temperature being a possibility, but I think the temperture on the system is fine.
Could it have anything to do with me setting up my own partitions? /var and /home on their own partitions (extended), with swap,/, /boot on their own (primary). Give it a shot with everything on /?
What about the hard drive? I did not zero each drive on these installs, but simply ran the regular OS install partitioning scheme. Would not doing something like Killdisk and not having the drive zeroed out lead to something like this?
What other tests can I do to figure this out?
Again, two different hard drive did the same thing, and they don't use the same cable or hw routes. And, I would imagine the cpu is fine, if it is running at normal temperature, otherwise the failure would be all out and not intermittent.
Anyone have any sugggestions?
Ok, the cpu temp is fine, as I re-"gooped" the heatsink.
I went to untar a file and it locked up both via ssh and locally each time. I think the first was on gunzip and the second what on untar. This is what happened the first few times I noticed the problem, and on reboot and forcing an integrity check it looked up a few times, which let me to believe it was not with tar or gzip. It also locked up while doing some other type of activity remotely, which had nothing to do with tar or gzip.
I am surprised no one has given any suggestions on this one. Hell, comments would be appreciated even.
I have done the following:
1. Done memtest (all tests), which showed no errors on mem.
2. Put in another hard drive and Trustix install. Locked on me.
3. Reformatted the hd and reinstalled OS, same thing.
4. Updated software, and same thing.
5. Re-siliconed the heatsink, and temp is fine.
6. Took out one older NIC and replaced with a new one. (Two PCI and one integrated that doesn't work with Trustix.)
I am going to call the vender who sold me the case, mobo, and mem tomorrow to see if they have any ideas. Hopefully that will yield something for me. Looks like hardware to me.
Could it be the CPU? Is there a place I might check for errors. I did check most of the logs already though, and didn't find anything I could identify, except for the following: /var/log/kernel/error
MP-BIOS bug: 8254 timer not connected to IO-APIC
This might be a hardware/software issue? I have no clue.
I will mess with the tar and gzip more tonight. It did untar and gunzip successfully on other occasions though. I will run it over and over again, and see what happens.
Any help with this would be greatly appreciated. Really... :confused:
Ok, maybe there is something to this error that is creating the lockup.
At the same time of the boot error, I receive the following few lines, that eventually lead to potentially adding something to help potentially stop the crash. I just don't know how to do it.
Error from the above post (from /var/log/kernel/errors):
Dec 10 20:32:08 www kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC
And the matching from info file:
Dec 10 20:32:08 www kernel: ..TIMER: vector=0x31 pin1=2 pin2=0
Dec 10 20:32:08 www kernel: ...trying to set up timer (IRQ0) through the 8259A ...
Dec 10 20:32:08 www kernel: testing the IO APIC.......................
Dec 10 20:32:09 www kernel: .................................... done.
Dec 10 20:32:09 www kernel: ACPI: Subsystem revision 20030813
Dec 10 20:32:09 www kernel: PCI: PCI BIOS revision 2.10 entry at 0xfb550, last bus=2
Dec 10 20:32:09 www kernel: PCI: Using configuration type 1
Dec 10 20:32:09 www kernel: ACPI: Interpreter enabled
Dec 10 20:32:09 www kernel: ACPI: Using IOAPIC for interrupt routing
Dec 10 20:32:09 www kernel: ACPI: System [ACPI] (supports S0 S1 S4 S5)
Dec 10 20:32:09 www kernel: ACPI: PCI Root Bridge [PCI0] (00:00)
Dec 10 20:32:09 www kernel: ACPI: Power Resource [ISAV] (on)
Dec 10 20:32:10 www kernel: PCI: Probing PCI hardware
Dec 10 20:32:12 www kernel: PCI: Using ACPI for IRQ routing
Dec 10 20:32:12 www kernel: PCI: if you experience problems, try using option 'pci=noacpi' or even 'acpi=off'
Am I on the correct path here, and how do I add this option to the boot? Through grub? If so, could someone let me know how this is done?
That's kewl. Thanks anyway. I figured it out.
I don't know if this is fixed exactly. I don't get the error anymore, and the system is not locking with tar and gunzip done several times over. It is early still, but the errors are gone.
The pci=noacpi didn't work for me. The error was still there and the system locked on me.
I tried acpi=off and the error is gone. No locks so far.
We'll see, but it seems to have resovled my issue.
Hope this helps someone in the future.
|All times are GMT -5. The time now is 10:49 PM.|