LinuxQuestions.org - Home made cluster node fails to start sometimes on account of nfs

- Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)

- - Home made cluster node fails to start sometimes on account of nfs (https://www.linuxquestions.org/questions/linux-server-73/home-made-cluster-node-fails-to-start-sometimes-on-account-of-nfs-796652/)

Home made cluster node fails to start sometimes on account of nfs

Hi all,
I've build a home-made small cluster built up of a master and 1 disk-less slave node.

Lately it happens that the node 1 fails to start, reporting the following message:

--------------------------------------------------------
IP-Config: Complete:
[ 12.318051] device=eth0, addr=192.168.100.21, mask=255.255.255.0, gw=192.168.100.2,
[ 12.414252] host=192.168.100.21, domain=mydomain.com, nis-domain=(none),
[ 12.499742] bootserver=192.168.100.2, rootserver=192.168.100.2, rootpath=
[ 12.589739] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
[ 12.681474] Looking up port of RPC 100003/2 on 192.168.100.2
[ 12.750322] Looking up port of RPC 100005/1 on 192.168.100.2
[ 12.819465] Root-NFS: Server returned error -13 while mounting /diskless/192.168.100.21
[ 12.915257] VFS: Unable to mount root fs via NFS, trying floppy.
[ 12.987233] VFS: Cannot open root device "nfs" or unknown-block(2,0)
[ 13.063343] Please append a correct "root=" boot option; here are the available partitions:
[ 13.163295] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(2,0)
[ 13.262198] Pid: 1, comm: swapper Not tainted 2.6.31-gentoo-r6 #4

---------------------------------------------------------

If I try to hard rebooting it through the switch it fails again, whereas if I wait say 3-5 minutes and reboot it, it starts normally:

---------------------------------------------------------

IP-Config: Complete:
[ 12.404086] device=eth0, addr=192.168.100.21, mask=255.255.255.0, gw=192.168.100.2,
[ 12.500328] host=192.168.100.21, domain=mydomain.com, nis-domain=(none),
[ 12.585810] bootserver=192.168.100.2, rootserver=192.168.100.2, rootpath=
[ 12.675797] md: Skipping autodetection of RAID arrays. (raid=autodetect will force)
[ 12.767526] Looking up port of RPC 100003/2 on 192.168.100.2
[ 12.836319] Looking up port of RPC 100005/1 on 192.168.100.2
[ 12.929577] VFS: Mounted root (nfs filesystem) readonly on device 0:15.

---------------------------------------------------------

I have never had such a problem so far and perhaps I messed the whole thing up by unintentionally altering some configuration file.

As I am at lost of ideas and checked (or at least I presume so) every possible file on PC and forum on the web, any help on pinpointing the hitch would be very welcomed.

Thanks,
Pier

Well, error 13 is a permissions issue I believe. Do the server logs mention anything about this particular failure? Have you tried increasing the log level of the NFS server during one of these failures?

Quote:

Originally Posted by MS3FGX (Post 3909620)

As a matter of fact I had a look at the boot messages of node1 and noticed there were problems with nfs mount.

I managed to get the node1 starting again by removing the lines related to ntp, which I recently added in order to getting right the master time. /etc/conf.d/local.start reads:

# /etc/conf.d/local.start

# This is a good place to load any misc programs
# on startup (use &>/dev/null to hide output)
echo
# eth0 -> internet
echo "Setto eth0 192.168.0.129 up ..."
ifconfig eth0 192.168.0.129 up
route add default gw 192.168.0.1 dev eth0
echo
echo
# eth1 -> Gigabit for fast comunication with node1
echo "Setto eth1 192.168.99.2 up ..."
ifconfig eth1 192.168.99.2 up
echo
echo "Setto eth2 192.168.100.2 up ..."
# eth2 -> comunications with node1
ifconfig eth2 192.168.100.2 up
echo
echo "Abilito fooldns..."
cp /etc/resolv.conf.fooldns /etc/resolv.conf
echo
echo "Abilito modalita wol su on board eth0"
echo
ethtool -s eth0 wol g
echo

################ Partenza node1 ########################
echo "Lancio il demone dhcpd in ascolto su eth2..." #
/etc/init.d/dhcpd start #
echo #
sleep 2 #
echo "Avvio il nodo 1..." #
echo #
/sbin/node_01.up #
########################################################

--------------- what follows has been removed ----------
sleep 2

echo
echo "Aggiorno ora di sistema"
echo

if ping -c 1 -q -W 2 -w 2 ntp1.ien.it >/dev/null; then
/usr/sbin/ntpdate ntp1.ien.it
else
echo
echo "Web non raggiungibile: impossibile aggiornare orario"
echo
fi

########################################################