LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 03-22-2017, 02:22 PM   #1
thaylin
LQ Newbie
 
Registered: Oct 2015
Posts: 22

Rep: Reputation: Disabled
Mariadb galera cluster failure issues


So I am running a 3 node galera/mariadb setup on rhel 7, behind haproxy.

On this I am running several applications including 1 instance of confluence and multiple instances of drupal test databases.

OUr webteam, while working on the drupal test instances are causing all sorts of bad SQL statements using drush sql-sync as well as the XML import. At one point I counted about 100 in a 1 second interval on each of the machines.

When this happens the node, and sometimes the cluster itself, crash and burns. In eh case yesterday the webteam was doing an SQL sync and a large number of out of range messages appeared because an int(11) field was trying to be updated with -62106393600

This caused all the nodes to sart throwing errors like this,
Code:
WSREP: Ignoring error for TO isolated action: source: 5a5cadb2-0d9c-11e7-ab52-a200c70c84fe version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24791 trx_id: -1 seqnos (l: 160865, g: 21333163,  s: 21333162, d: 21333162, ts: 2849868880088527
[ERROR] Slave SQL: Error 'Out of range value for column 'created' 
 Internal MariaDB error code: 1264
[Warning] WSREP: RBR event 1 Query apply warning: 1, 21333166
leading to

Code:
WSREP: Failed to apply trx 21165072 4 times
WSREP: Node consistency compromized, aborting...
SREP: Closing send monitor...
SREP: Closed send monitor.
SREP: gcomm: terminating thread
SREP: gcomm: joining thread
SREP: gcomm: closing backend

WSREP: TO isolation failed for: 3, schema: drupal_huntweb_dev01, sql: INSERT INTO semaphore (name, value, expire) VALUES ('variable_init', '186192520558d0140813f645.18635465', '1490031625.08')
Eventually node 0 and node 2 completed crashed , leaving only node 1 active, but rejecting requests because of no quorum. nodes 0 and 2 attempted a self restart but was unable to communicate with node 1. The only way I could seem to correct this was to shut down node 1, but then all the nodes were shutdown improperly and to run a galera_new_cluster required changing safe_to_boot to 1 (all were 0).
Code:
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679524715264 [Note] WSREP: evs::proto(557b5e4c, LEAVING, view_id(REG,41acb7bc,182)) suspecting node: 5a5cadb2
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679524715264 [Note] WSREP: view(view_id(NON_PRIM,41acb7bc,182) memb {
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 557b5e4c,0
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: } joined {
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: } left {
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: } partitioned {
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 41acb7bc,0
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 5a5cadb2,0
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: })
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679524715264 [Note] WSREP: view((empty))
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679524715264 [Note] WSREP: gcomm: closed
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Flow-control interval: [16, 16]
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Received NON-PRIMARY.
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 21346716)
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Received self-leave message.
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Flow-control interval: [0, 0]
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Mar 21 13:46:27 mysqlcl02. mysqld[13085]: 2017-03-21 13:46:27 139679215118080 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 21346716)
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: STATE EXCHANGE: sent state msg: 5f167da0-0e5e-11e7-ab71-2bb6b9da3e79
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: STATE EXCHANGE: got state msg: 5f167da0-0e5e-11e7-ab71-2bb6b9da3e79 from 0 (mysqlcl00)
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: STATE EXCHANGE: got state msg: 5f167da0-0e5e-11e7-ab71-2bb6b9da3e79 from 1 (mysqlcl01)
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: STATE EXCHANGE: got state msg: 5f167da0-0e5e-11e7-ab71-2bb6b9da3e79 from 2 (mysqlcl02)
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: Quorum results:
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: version    = 4,
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: component  = PRIMARY,
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: conf_id    = 153,
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: members    = 1/3 (joined/total),
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: act_id     = 21346759,
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: last_appl. = -1,
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: protocols  = 0/7/3 (gcs/repl/appl),
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: group UUID = abe18d1f-66ce-11e5-be04-6fbc04652ff4
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: Flow-control interval: [866, 866]
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168543594240 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 21346759)
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168853141760 [Note] WSREP: Waiting for SST to complete.
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168852822784 [Note] WSREP: State transfer required:
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: Group state: abe18d1f-66ce-11e5-be04-6fbc04652ff4:21346759
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: Local state: 00000000-0000-0000-0000-000000000000:-1
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168852822784 [Note] WSREP: New cluster view: global state: abe18d1f-66ce-11e5-be04-6fbc04652ff4:21346759, view# 154: Primary, number of nodes: 3, my index: 2, protocol version 3
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168852822784 [Warning] WSREP: Gap in state sequence. Need state transfer.
Mar 21 13:46:54 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:54 140168514238208 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.76.10.103' --datadir '/mysql_data/mysql/'   --parent '8177'  '' '
Mar 21 13:46:54 mysqlcl02. rsyncd[8217]: rsyncd version 3.0.9 starting, listening on port 4444
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: Prepared SST request: rsync|10.76.10.103:4444/rsync_sst
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: REPL Protocols: 7 (3, 2)
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: Assign initial position for certification: 21346759, protocol version: 3
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168602318592 [Note] WSREP: Service thread queue flushed.
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (abe18d1f-66ce-11e5-be04-6fbc04652ff4): 1 (Operation not permitted)
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168543594240 [Warning] WSREP: Member 2.0 (mysqlcl02) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: Requesting state transfer failed: -11(Resource temporarily unavailable). Will keep retrying every 1 second(s)
Mar 21 13:46:56 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:56 140168543594240 [Warning] WSREP: Member 2.0 (mysqlcl02) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
in this cat TOI is active. I cannot effectively prevent SQL errors, and I would not expect general SQL errors would crash the entire cluster. Can anyone tell me what they think is happening?

ive read this and I think something similar is happening, but this just talks about a node, not the entire cluster:

https://www.percona.com/blog/2014/07...mment-10967902
Code:
MariaDB [(none)]> SHOW VARIABLES LIKE "%version%"; 
+-------------------------+---------------------------------+
| Variable_name           | Value                           |
+-------------------------+---------------------------------+
| innodb_version          | 5.6.35-80.0                     |
| protocol_version        | 10                              |
| slave_type_conversions  |                                 |
| version                 | 10.1.22-MariaDB                 |
| version_comment         | MariaDB Server                  |
| version_compile_machine | x86_64                          |
| version_compile_os      | Linux                           |
| version_malloc_library  | system jemalloc                 |
| version_ssl_library     |  |
| wsrep_patch_version     | wsrep_25.16                     |
+-------------------------+---------------------------------+
10 rows in set (0.00 sec)

#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#

# this is read by the standalone daemon and embedded servers
[server]
bind-address    = 0.0.0.0

# this is only for the mysqld standalone daemon
[mysqld]
#skip-name-resolve
datadir=/mysql_data/mysql
socket=/var/lib/mysql/mysql.sock
innodb_log_file_size=4GB
transaction-isolation=READ-COMMITTED
default-storage-engine=INNODB
character-set-server=utf8
collation-server=utf8_bin
max_allowed_packet=600M
general_log = on
general_log_file=/var/log/query.log

#
# * Galera-related settings
#
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://10.76.10.101,10.76.10.102
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep_cluster_name=' cluster1 '
wsrep_node_name='mysqlcl02'
wsrep_node_address='10.76.10.103'
wsrep_sst_method=rsync
wsrep_replicate_myisam=ON
wsrep_dirty_reads=1
wsrep_provider_options="gcs.fc_limit=500;gcs.fc_factor=1.0"
wsrep_notify_cmd=/usr/bin/galeranotify.py
#bind-address=0.0.0.0
#
# Optional setting
#wsrep_slave_threads=1
#innodb_flush_log_at_trx_commit=0

# this is only for embedded server
[embedded]

# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]

# This group is only read by MariaDB-10.0 servers.
# If you use the same .cnf file for MariaDB of different versions,
# use this group for options that older servers don't understand
[mariadb-10.0]
 
Old 03-22-2017, 10:30 PM   #2
dijetlo
Senior Member
 
Registered: Jan 2009
Location: RHELtopia....
Distribution: Solaris 11.2/Slackware/RHEL/
Posts: 1,491
Blog Entries: 2

Rep: Reputation: Disabled
Hi Thaylin,

Quote:
At one point I counted about 100 in a 1 second interval on each of the machines.
When this happens the node, and sometimes the cluster itself, crash and burns....Can anyone tell me what they think is happening?
You're probably suffering from kernel thrashing, check your swap usage during one of these events to confirm. If you are, what's most likely happening is you're nodes are failing their cluster "check_ins" ( because they just aren't processing the cluster signals fast enough, too much contention for kernel bandwidth).
Code:
3085]: 2017-03-21 13:46:27 139679524715264 [Note] WSREP: evs::proto(557b5e4c, LEAVING, view_id(REG,41acb7bc,182)) suspecting node: 5a5cadb2

During the check, if the cluster finds that the time since the last 
time it received a network packet from the node is greater than the value of the evs.keepalive_period parameter, it begins to emit heartbeat beacons.
If the cluster continues to receive no network packets from the node for the period of the evs.suspect_timeout parameter,
the node is declared suspect. Once all members of the Primary Component see the node as suspect, it is declared inactive—that is, failed.
You can adjust these values
wsrep_provider_options="evs.keepalive_period=PT?S",
wsrep_provider_options="evs.inactive_check_period=PT?S",
wsrep_provider_options="evs.suspect_timeout=PT?S",
wsrep_provider_options="evs.inactive_timeout=PT?S"

To give yourself a little more breathing room, since this is a dev environment, it would probably make sense

As far as auto-recovery, make sure all your nodes are set to bootstrap.
SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';

or you may have to build some logic around the notify command...
Mar 21 13:46:55 mysqlcl02. mysqld[8177]: 2017-03-21 13:46:55 140168852822784 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification

Something like ->if state change ->quorom 1 of 3: then SET GLOBAL wsrep_provider_options='pc.bootstrap=YES'

As a final suggestion, move the drupal dev web server over to a raspberry pi and jiggle the plug every time you see swapping in the cluster. It wont improve the WebDevs coding skills, but it will reduce the rate at which they can set things on fire.

Hope that helps

Last edited by dijetlo; 03-22-2017 at 10:58 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
unable to update MariaDB from 10.0 to 10.1, error indicating galera libboost dependencies issue cyberdome Linux - Newbie 1 02-15-2016 03:16 AM
MariaDB Cluster with Galera Replication - max_allowed_packet issue artemidis CentOS 6 01-14-2016 05:51 AM
MariaDB Galera Cluster Starting Issue blakk Red Hat 2 03-05-2015 02:17 PM
LXer: Installing The Galera-Iworx Cluster LXer Syndicated Linux News 0 06-27-2013 11:00 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:46 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration