LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   Main file server died, took out entire network, do I have any chance of saving it? (https://www.linuxquestions.org/questions/linux-server-73/main-file-server-died-took-out-entire-network-do-i-have-any-chance-of-saving-it-4175580798/)

Red Squirrel 05-26-2016 08:45 PM

Main file server died, took out entire network, do I have any chance of saving it?
 
For some reason my UPS did not kick on properly and I just lost my file server during a routine power outage (had to turn off main breaker to check something). Now it just sits at a blinking cursor. Won't boot. I really really really don't want to have to completely reinstall and have to reconfigure everything. I have backups, but it's still a huge royal pain if I have to go through everything, as there's so much config files spread all over, I won't even remember everything off hand. It's not like I can just hit a button and restore everything. I'm hoping the actual raid arrays arn't going to be corrupt, but that's my fear.

OS is CentOS 6.4

Do I have any chance of saving this server? Anything special I can do with a boot CD or something?

Edit: have access to file system now, so it's a good start, but hoping to figure out how to get it to boot now.

Emerson 05-26-2016 08:49 PM

Boot SystemRescueCD (CD or USB stick) and start checking the hard drives. If it boots. You say it won't POST?

Red Squirrel 05-26-2016 08:53 PM

It POSTS, I see regular bios text and stuff but then it just goes to a screen with blinking cursor. A long time ago I accidentally screwed up a mv /folder command and did mv / and I caught it on time but a bunch of system stuff got moved. I was able to move it back, but lot of that stuff needs to be on a specific physical location on the hard drive in order to boot, so I have a feeling it has to do with that. Is there some kind of repair I can run or something?

As a start I'm going to see if I can mount the raid arrays, if I can at least confirm the data is ok it will make me feel much better. Problem will be trying to assemble 3 raid arrays and know what drive is which. I have a spreadsheet... on the file server. :(

I'm going to end up having to dig through backups either way I think.

jefro 05-26-2016 09:41 PM

Almost any current "live" media could be used to see what state the drive is in. Your problems may different than the drive. It could be any number of failures.

For a number of years, distros have made their installation live. In fact they don't even bother to mention it now. See if you can't get a live Centos https://wiki.centos.org/Download

Red Squirrel 05-26-2016 09:48 PM

Some good news. With a CentOS rescue CD (well a virtual ISO actually) I was able to mount the raid arrays so at least it's one thing less to worry about. Though that does not really mean nothing is corrupted, just that the arrays themselves are fine. The VMs may very well be corrupted given the VM server was also running and the data stores were basically pulled from under the rug so to speak. One of the arrays is resyncing though. I'm going to just leave it alone till that's done.

I've chrooted to the mount point created by the rescue CD and have network access. SSH beats crowching in the hot aisle trying to use keyboard in that tight space. I really need to get a KVM console, but even a 1 port is like over a grand.

I'm kinda wondering if I can just start all the services from here... Though technically it's probably running on the old kernel that's on the CD, I imagine that could be problematic. Either way I have to wait for that array to resync before I do anything. Any tips on what I may be able to do from here to get the OS to boot?

I have to say mdadm raid is super resilient though, it's never let me down.

syg00 05-26-2016 09:50 PM

The blinking cursor is grub telling you it can't find the next stage. Like the MBR is there, but no-one else is home. Probably not good - is your boot partition (if you have one) also RAID'd ?.
Are we talking mdadm (software) RAID here ?. If you are lucky (smart) and they were created with current metadata, they should assemble correctly automagically. Are you feeling lucky ?.

Ahhh, gotta learn to type faster.

Red Squirrel 05-26-2016 10:01 PM

The OS is not raided, I always felt trying to raid the OS is adding too much complexity, as it's a chicken and egg scenario, the raid has to be available for OS to be seen, but the OS has to be running for the raid to run... I heard it can be done by having two boot partitions and some kind of mdadm preloader, just never looked too deeply into it. So that said the OS is just on a single SSD as I figure it's unlikely to fail randomly like a single spindle drive could. (it's not an OCZ :P ) The /boot is on a separate partition on that drive.

I have access to the server through SSH now, so I can do anything that may need to be done to get it to boot.

Oh, and I can confirm OS is 6.4 and not 6.5.

syg00 05-26-2016 11:16 PM

Get out of the chroot and fsck everything in sight. Starting with the /boot. Force the fsck, but don't auto-reply; I always want to know what is broken, and how many. Even if I don't understand it. Gives me some gut feel as to whether I should really be restoring the whole filesystem rather than trust it again.

Red Squirrel 05-26-2016 11:27 PM

I can try that, but I don't suspect this is a file system issue, though it won't hurt to check anyway as a fsck has not been run in like 1-2 years. Same with the raid I should probably run one on all the file systems.

I'm thinking the issue has to do with my mv / mishap I did several years back. The files were moved back where they go, but I'm wondering if there are some attributes that are wrong, or something along those lines. Is there some kind of tool I can run that will fix all that? Like /boot, the MBR etc.

Red Squirrel 05-27-2016 12:18 AM

Oh and it might help to add, I get the blinking cursor before I get to the grub menu. It's set to automatically go to first option after 5 seconds, but that never comes up. So the issue is with grub, most likely.

Actually another thing, the grub boot file refers to (hd0,0) but the first hard drive actually ends up being one of the storage drives, they show up first, and the internal SSD is at the end (sdv). Could this be an issue? I don't really want to map to sdv as I'll be in the same boat if I add more drives to the system. Is there a way to make it use the GUID?

Emerson 05-27-2016 03:54 AM

Linux kernel can mount by PARTUUID.

jpollard 05-27-2016 05:51 AM

Quote:

Originally Posted by Red Squirrel (Post 5551575)
Oh and it might help to add, I get the blinking cursor before I get to the grub menu. It's set to automatically go to first option after 5 seconds, but that never comes up. So the issue is with grub, most likely.

Actually another thing, the grub boot file refers to (hd0,0) but the first hard drive actually ends up being one of the storage drives, they show up first, and the internal SSD is at the end (sdv). Could this be an issue? I don't really want to map to sdv as I'll be in the same boat if I add more drives to the system. Is there a way to make it use the GUID?

That is more like the BIOS lost identification of the disk to boot.

Red Squirrel 05-27-2016 12:53 PM

It's always been that way, I have 3 HBAs, for whatever reason they are always first, then the onboard is last. The OS drive is onboard. The BIOS sees it.

I'm pretty sure the reason it won't boot is because of a screw up I did once involving the mv command many years ago. I knew if I was to reboot that machine I'd be in trouble, which is why I spent over a grand in UPS batteries... but it failed me. I need to look at dual conversion but that's a couple more grand that I don't have to spend. I just need to know how do I go about repairing the MBR, but also how to modify the grub file so it refers to the GUID and not the actual letter, as that will change at times. This is what the grub.conf looks like:

Code:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /boot/, eg.
#          root (hd0,0)
#          kernel /vmlinuz-version ro root=/dev/mapper/vg_isengard-lv_root
#          initrd /initrd-[generic-]version.img
#boot=/dev/sda
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.32-573.22.1.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-573.22.1.el6.x86_64 ro root=/dev/mapper/vg_isengard-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_isengard/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_isengard/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet selinux=0
        initrd /initramfs-2.6.32-573.22.1.el6.x86_64.img
title CentOS (2.6.32-573.3.1.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-573.3.1.el6.x86_64 ro root=/dev/mapper/vg_isengard-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_isengard/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_isengard/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet selinux=0
        initrd /initramfs-2.6.32-573.3.1.el6.x86_64.img
title CentOS (2.6.32-504.30.3.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-504.30.3.el6.x86_64 ro root=/dev/mapper/vg_isengard-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_isengard/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_isengard/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet selinux=0
        initrd /initramfs-2.6.32-504.30.3.el6.x86_64.img
title CentOS (2.6.32-504.23.4.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-504.23.4.el6.x86_64 ro root=/dev/mapper/vg_isengard-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_isengard/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_isengard/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet selinux=0
        initrd /initramfs-2.6.32-504.23.4.el6.x86_64.img
title CentOS (2.6.32-358.el6.x86_64)
        root (hd0,0)
        kernel /vmlinuz-2.6.32-358.el6.x86_64 ro root=/dev/mapper/vg_isengard-lv_root rd_NO_LUKS LANG=en_US.UTF-8 rd_NO_MD rd_LVM_LV=vg_isengard/lv_swap SYSFONT=latarcyrheb-sun16 crashkernel=128M rd_LVM_LV=vg_isengard/lv_root  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet selinux=0
        initrd /initramfs-2.6.32-358.el6.x86_64.img



I've also read about grub-install, but it fails when I try it, it says:

[root@localhost grub]# grub-install /dev/sdv
/dev/sdv does not have any corresponding BIOS drive.


I also found a path called dev/mapper/vg_isengard-lv_root which I think may be a static way to refer to /dev/sdv, so I tried that too and get the same error.

Is there something in linux equivalant to fdisk /mbr?

Red Squirrel 05-27-2016 02:12 PM

Ok so I managed to run grub-install. I had to pull out all the other drives so that it's only the internal drive, and it was now /dev/sda. I ran it, and now I get to the boot loader, and I see the progress bar showing that it's loaded. For some reason it says CentOS 6.7 when /etc/issue says 6.4, so not sure what that is about or if it matters. But now what happens is the progress bar goes to the end, then it just sits there forever. I can actually ping the machine, but can't SSH to it. So it's not exactly frozen, but still not booting fully. It does this with and without all the drives.

Edit: Ok so I found out I can hit esc to see what's happening. Getting a whole bunch of exportfs errors that it can't resolve hostnames... why? Why is this holding up the system from booting? The DNS server is in a VM, the VM's data store is on THAT server! Is there anything I can do to bypass this?

Red Squirrel 05-27-2016 02:49 PM

I may have gotten lucky, I totally forgot my old DNS server is still running, just the service was off. I started switching stuff back to that DNS server and was able to get into the file server.

Now to see what the damage is on my VMs, it's not looking too good as everything is locked right up, but now that I fixed DNS it is seeing the data stores at least...


All times are GMT -5. The time now is 04:50 AM.