Hi,
I moved my Proxmox cluster - consisting essentially of two physical servers, two Cisco NAS units where the (KVM) VM images live and two switches, to a new data centre where they now have new IP addresses.
I've also asked this question on the Proxmox forum, but I think it's essentially a DRBD problem.
I reconfigured basic networking on the two servers, updated the IP addresses in /etc/pve/cluster.cfg and rebooted the boxes, master node first.
The storage is set up as /dev/drbdvg0 and /dev/drbdvg1. I didn't install this myself and I'm not that familiar with DRBD or indeed iSCSI. Both are used to store KVM guest virtual machine images, seen by both servers.
Everything looked fine, until I attempted to start a VM on the second (slave) node. It took ages to start, hanging for thirty seconds at a time. It was clearly miscommunicating with the NAS.
All of the images, including those set up on the second node, will run fine on the first (and that's what I'm doing for now).
On the first box, /proc/drbd looks like this:
Code:
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:0 dw:27568823 dr:156762105 al:309656 bm:309639 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:10184632
1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r----
ns:0 nr:0 dw:2451648 dr:14918745 al:1244 bm:1211 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:1152564
.. and very similar on the second:
Code:
version: 8.3.7 (api:88/proto:86-91)
srcversion: EE47D8BF18AC166BE219757
0: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----
ns:0 nr:0 dw:0 dr:1705944 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:954596
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----
ns:0 nr:0 dw:0 dr:1821288 al:0 bm:107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:520192
So it looks like at some level they aren't talking to each other - I don't see the usual "UpToDate/UpToDate".
I'm also seeing lots of messages like this on the second node:
Code:
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4329026692, last ping 4329027942, now 4329029192
connection1:0: detected conn error (1011)
Can anyone suggest what might have gone wrong here? A cabling issue maybe? Or how to fix it? I'm particular anxious to avoid losing updates to the images as seen by the first node if they manage to sync up - don't want to lose or corrupt the VM images!
Very grateful for any advice.