Sles 11 cluster (heartbeat) shared file sys config
Hello,
I have 2 same H/W config HP DL 585 servers with SAN storage. We were planning to go live with cluster but go live deadlines made us to config only one server with SAN. Now we run SAP in one server with SLES 11 SP1. All the SAP related partitions are on SAN and OS partitions (and oraarch) are in server file system. New server has also SLES 11 SP1 installed. Both servers can see the SAN. Now SAP Basis consultant has asked me to configure the OS level cluster which enables him to use the OS cluster service and configure the SAP cluster). So I need to start configuring cluster and need some info to make sure everything goes smoothly. This is what I plan to do. I am going to configure Heart Beat for cluster config. with following steps. 1. Edit /etc/hosts for server communication 2. Connect both servers using cross cable 3. Install heartbeat in the current production system (heartbeat-1:~ # yast2 heartbeat) 4. Install heartbeat in the new server (heartbeat-1:~ # yast -i heartbeat) 5. Defining the communication channels (bind net, redundancy) 6. Define authentication settings in production server. 7. Transfer the config to new server (with Csync for synchronizations) 8. Start the initial synchronization (csync2 -xv) 9. Configure OpenAIS to start cluster services automatically during the boot up in both servers. 10. Start the cluster on both servers (/etc/init.d/heartbeat start) 11. set a pwd for hacluster (heartbeat-1:~ # passwd hacluster) 12. configure the cluster (heartbeat-1:~ # hb_gui) 13. Configure global cluster (with stonith enabled) 14. Configure cluster resources (hb_gui/peace maker program)and add the IP addresses and file systems How ever I need to solve following concerns first. Please help me with it. 1. Do I need to configure trust relationship within two servers root user? 2. I am confused over the file system clustering. Do I need to add file system as a resource in clustering or it will be a SAP basis part? If so what are the steps I have to go through? 3. Please give me a explanation on how clones come in to picture? 4. Do I need to change the file system to OCF RAs? Can I do it without formating my current file system? (LVM/ext3) 5. Otehr than server communication I need to configure logical address for both servers for user access? If yes, how to do it and when? 6. When the quoram comes in to picture? When I need to configure it? * Is there any one who can help me online when the configuration starts? I know its lot of reading, but many thanks if you can help me. Regards, CCIlleperuma. |
First of all I'm not an expert in clustering, I just set up my cluster (active/active) and gone through my troubles that time so the only help I can give you is my experience.
When you refer to heartbeat I guess you are talking about heartbeat/pacemaker, heartbeat is a lower layer and pacemaker is the one in charge of managing the cluster. You can save yourself a lot of pain reading the documentation here. I'll try to answer your questions: Quote:
Quote:
Quote:
Quote:
Quote:
You will set your maintenance IP as you allways do in Linux, and use the script provided by heartbeat /usr/lib/ocf/resources.d/heartbeat/IPaddr to let the cluster assign the right IP to the right machine. You will have to configure cib to use that script the following way: Code:
<resources> The last four lines specify where each IP should run and when. 0 is maximum priority and INFINITY is never to use that resource (unless the machine goes down). Quote:
I hope I have cleared something, but as I said at the beginning, the documentation will save you A LOT of pain. |
Thanks. It was very helpful.
Also still I am not certain about the file system, even if I configure it in /usr/lib/ocf/resource.d, do I need to change the file system of my data base to OCF? currently ext3. Can you please provide me the steps you have gone through to configure your cluster (with simplest form as I am a beginner in linux ). regards, ccilleperuma |
I'm glad I could help ;)
What do you mean 'change the file system of my data base to OCF'? OCF is not a file system, is a standard for scripts, I mean, you have to create a script compliant to OCF if you want it to run in your cluster, OCF seems to be an LSB upgrade for cluster scripts (/etc/init.d files are an example of LSB scripts) they should accept a start, stop, reload, etc... As you can see here pacemaker accepts 3 type of scripts, those compatible with LSB, those with OCF and legacy heartbeat resource agents (eventhough I think the only ones that worked for me were OCF compliant ones). In a quick summary, the steps to follow to start a simple cluster are: 1) Create your /etc/ha.d/ha.cf file with the options you want in both machines (or in the number of nodes you will be running). Here is an example o my ha.cf: Code:
logfacility local6 2)You will have to add to your node the piece of configuration I posted in the previous post. That is the simpliest node configuration, a cluster where the IP changes from one node to another. Use cibadmin for instance to load the configuration to your node. 3)Start your node with /etc/init.d/heartbeat start (I'm running it on debian). Check with crm_mon the status of the cluster and wait for it to set BOTH service IPs in your machine (note you have a one node cluster for now). 4)Start the second node. If ha.cf is correct in both computers, it is supposed to conect to the cluster and receive the configuration, so as soon as it starts one of the IPs should disappear from first node and go to the second one. Use crm_mon to see what is happening with the cluster. Of course, you will have to change node names, IPs and masks in the configuration files for your own setup. I almost forgot. If you have a shared resource in your network and you mount it on /usr/lib/ocf/resources.d on both machines, the same scripts will be accessed. Otherwise, you will have to copy the OCF scripts you create in both machines. Be aware that those scripts a crucial element in your cluster, so if access to that shared resource is lost, your cluster is doomed. Find a way to replicate those files and have them in local on both machines. EDIT NOTE: bcast should have both interfaces (in /etc/ha.d/ha.cf). Otherwise, if the link between both machines goes down, both machines will start all services, and we don't want that, if direct link goes down BUT both machines have network connection, cluster should work normally (except for the error message in the log). |
I started with a two PC's for testing purposes before touching the production server. I have another point to clear. It says STONITH is usually implemented as a remote power switch. So do I have to purchase it? or can I implement it as a service in non-cluster server? What are the risks for not implementing it? Is there any other method to do it?
Regards, ccilleperuma. |
I don't know much about stonith, I have it disabled since my cluster was a test project and the services running there are not critical for the company. As far as I can recall stonith needs some kind of hardware because it is a mechanism to shut down the machine physically. I can only point you to the documentation here.
About not implementing stonith.... The risk you will face depends on the services running on your nodes. As I stated before, you could end up with two machines running at the same time with the same IPs. You could also end up with two DB running at the same time on different machines with the same IPs, you can imagine the mess even the dissaster that could be. It is up to you to decide how much you want to spend and what services are you going to run. Making sure you won't start both servers (or the services you want) at the same time can be enough, but.... ¿How can you assure your machine (or the service) is down before starting the other one? You are supposed to have two network connections on each machine, one straight from one machine to the other just for the heartbeat and another with the network, so if your network goes down, both machines will know it is a network problem and they are supposed to recognized that. If the link between both machines go down they are supposed to know it is the link, since they can access the network. I have not tested this thoroughly, so I can't tell if it is working fine. Line ping 10.50.1.1 in the ha.cf I posted is supposed to be checking network (it pings a gw in my case) the link I don't know. You will sure have to create your own scripts (OCF compliant, remember) to start, stop and test the status of your resource. If those scripts are correct, they will make sure a service is stopped when it is supposed to, and notify otherwise in case it isn't. That will be enough for not having a service running on both nodes (from a software point of view, network problems are in another level). If you want to be completely sure one node is not started unless the other is down, I guess you will have to get the hardware to setup stonith. Since you have two machines for testing, implement the simple cluster I pointed out on the other post and once it is running disconect one machine from the net, see how the cluster reacts. Then disconect the link and see how it reacts, shut one machine down, and again see how it reacts. After having the cluster configured correctly you will be able to decide if stonith is necessary for you or if it is not. Regards. |
I have configured the cluster in test environment. But still not clear about the ip config. this is what i have done.
configured local lan ip's for 2 nic's (192.100.100.70/71) configured ip's for 2 NIC's (192.168.30.70/71) and connected them through cross cable. now both can ssh each other without password. hostfile has only cluster connectivity IP's written. Then when I started cluster config with heartbeat its askeg for following and find my config also. Communication channels Both given 192.168.30.0 Multicast address node 1: 226.0.1.5 Node2: 226.0.1.6 (not clear what this config helps??) Mulicast port Both given 5454 Node ID Node1: 1, Node2: 2 still both servers say they are the Dc and other shows as offline, please advice me on this. ps:- I didnt configure any logical ip's yet. regards, ccilleperuma. |
Hello again.
Could you please post your /etc/ha.d/ha.cf and your /etc/hosts? Could you also post your crm_mon -1 output? I assume you have IPs 192.100.100.70/71 in eth0 and 192.168.30.70/71 in eth1. Am I right? I also assume you are using heartbeat/pacemaker, therefore you have installed the packages pointed out here (just to make sure we have the same) |
Hi,
Sorry for the replying delay, I had another issue to solve. In my cluster servers there is no ha.cf probably because of following Code:
Ultimately it will change in SLES11, HA will be replaced with OpenAIS and follow the same packaging and naming convention according to the recent changes in the project. Code:
# This configuration file is not used any more Code:
aisexec { Code:
# Code:
# This is the output of crm_mon -1 Cluster1 Code:
============ Code:
Last updated: Sat Nov 19 02:07:35 2011 and e1 of both configured as 192.168.30.70 & 71 and connect through a cross cable Yes, I have installed heartbeat and peacemaker both but not yet configured the peacemaker as i want to check the cluster connectivity first In heartbeat communication channels, I have given 192.168.30.0 as Bind network address as it only shows subnets (192.100.100.0/192.168.30.0) to select. Also it asks for mulicats address and port which i assigned 226.0.1.5:5454 and 226.0.1.6:5454 respectively Cluster1 node id is 1 and cluster2 is 2 rrp mode none for both as i dont have redundant channels I hope this will help you to get an idea of my setup. Thanks very much for the interest you shown in this issue. ps: there is a few seconds difference between the servers whic can show 1 min difference some times. Regards, ccilleperuma. |
Well, first of all, eventhough you installed pacemaker, you seem to be using OpenAIS/corosync instead of pacemaker.
Either you continue with corosync and set it up according to this link (I haven't used corosync so I can't help you with that). Or you uninstall corosync and use pacemaker. In my case (I'm on Debian) the packages that where needed where heartbeat, cluster-glue, pacemaker and their dependencies. In case you decide to use pacemaker, you will need a /etc/ha.d/ha.cf file (Check my previous posts for reference, but installation should have generated a reference one) and a /etc/ha.d/authkeys (installation should have generated this one too). This last file can have just two lines containing Code:
auth 2 bcast e0 e1 node cluster1 cluster2 The first line is for the heartbeats. You want them to go through both interfaces (suppose your link is diconnected or one of those network cards goes down, your cluster will know the other node is alive because it answers through the network and no resorces will be reallocated (which is what should happen, you should only recive a dead link in the log). According to the first line you posted, your distro seems to be replacing heartbeat with OpenAIS, if packages doesn't seem to work, go to the link of my previous post and install manually. |
I know this is an old thread, but wanted to say I had the same issue (and beat my head for a long time). My cluster was having issues because there was configurations in two places. I was using SLES 11. Eventually I had a phone call with Novell/SUSE support and it took them awhile to until they finally found out my config was in two places. It wasn't affecting heartbeat, but it was affecting cluster function.
That's the problem with having so many different documents to try and piece together how to build a cluster. I was looking at both Novell's docs and ClusterLabs docs: http://clusterlabs.org/quickstart-suse.html https://www.suse.com/documentation/s...ook_sleha.html If you're interested in my outline of steps I finally put together for myself, that's here: http://geekswing.com/geek/building-a...ter-on-vmware/ Turned out to be much easier than I thought just using the SLES gui instead of working from command line. |
All times are GMT -5. The time now is 11:33 AM. |