Drbd heartbeat déconnexion du noeud primaire

Fermé
Lassaad MATHLOUTHI - 10 oct. 2008 à 11:01
 Lassaad MATHLOUTHI - 10 oct. 2008 à 13:51
Bonjour,

je débarque sur un problème il ya 4 jours, en fait je suis entrain de mettre en place un cluster
avec 2 noeuds un primaire et un secondaire.

voici l'architecture et les conf :

2 IBM servers with RAID 1
drbd version: 8.2.6 (API: 88/proto :86-88) and heartbeat installed on 2 servers
############################################
############################################
drbd.conf:

#
# drbd.conf
#
resource r1 {
protocol B;
#incon-degr-cmd "halt -f";

#incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
#handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; }

startup {
#degr-wfc-timeout 120; # 2 minutes.
}

disk {
#on-io-error detach;
}


net {
#sndbuf-size 512k;
#timeout 60; # 6 seconds (unit = 0.1 seconds)
#connect-int 10; # 10 seconds (unit = 1 second)
#ping-int 10; # 10 seconds (unit = 1 second)
#ping-timeout 50; # 500 ms (unit = 0.1 seconds)
#max-buffers 8000;
#max-epoch-size 8000;
}




syncer {
rate 2048;
#group 1;
#al-extents 257;
}

on ardiaserv11 {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.1.246:7788;
meta-disk internal;
}

on ardiaserv12 {
device /dev/drbd0;
disk /dev/sda4;
address 192.168.1.247:7788;
meta-disk internal;
}
}
############################################
/etc/ha.d/ha.cf:

#bcast eth0 car le reseau contient 2 cluster donc on va utiliser le
unicast

ucast eth0 192.168.1.247
#baud 19200
#serial /dev/ttyS0
#bcast eth1


debugfile /var/log/ha-debug
logfile /var/log/ha-log
logfacility local0

keepalive 2
deadtime 10
warntime 6
initdead 60

udpport 694
node ardiaserv11
node ardiaserv12

auto_failback off
##############################################
/etc/ha.d/haressources
ardiaserv11 drbddisk::r1 Filesystem::/dev/drbd0::/data::ext3 IPaddr::192.168.1.250 MailTo::lassaad.mathlouthi@alpha-engineering.net::Cluster1-StatusUpdated fetchmail
#################################################



donc, lorsque je redémarre heartbeat sur le noeud primaire le noeud secondaire prend la main
et monte la partition /data et devient primaire -----> comportement correct

mais le sousci, losque je débranche le noeud primaire-----> blocage du cluster et voila le log sur
le noeud secondaire:


#####################################################

Oct 10 12:16:31 ardiaserv12 kernel: drbd0: PingAck did not arrive in time.
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: peer (Primary -> Unknown) conn (
SyncTarget -> NetworkFailure) pdsk (UpToDate -> DUnknown)
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: asender terminated
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: Terminating thread asender
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: short read expecting header on sock: r =- 512
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: Writing meta data super block now.
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: tl_clear ()
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: Connection closed
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: conn (NetworkFailure -> Unconnected)
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: receiver terminated
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: receiver (re) started
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: conn (Unconnected -> WFConnection
)
Oct 10 12:16:31 ardiaserv12 heartbeat [2810]: WARN: node ardiaserv11: is deadOct 10 12:16:31 ardiaserv12 heartbeat [2810]: WARN: No stonith device configured.
Oct 10 12:16:31 ardiaserv12 heartbeat [2810]: WARN: Shared disks are not protected.
Oct 10 12:16:31 ardiaserv12 heartbeat [2810]: info: Resources being acquired
from ardiaserv11.
Oct 10 12:16:31 ardiaserv12 heartbeat [2810]: info: Link ardiaserv11: eth2 dead.
Oct 10 12:16:31 ardiaserv12 heartbeat [3039]: debug: notify_world: setting SIGCHLD Handler to SIG_DFL
Oct 10 12:16:31 ardiaserv12 heartbeat: info: Running / etc / ha.d / rc.d / status status
Oct 10 12:16:31 ardiaserv12 heartbeat [3040]: info: No local resources [/ usr / lib / heartbeat / ResourceManagement listkeys ardiaserv12] to acquire.
Oct 10 12:16:31 ardiaserv12 heartbeat [2810]: debug: StartNextRemoteRscReq ():
1 child count
Oct 10 12:16:31 ardiaserv12 heartbeat: info: Taking over resource group drbddisk: r1
Oct 10 12:16:31 ardiaserv12 heartbeat: info: Acquiring resource group: ardiaserv11 drbddisk: Filesystem r1:: / dev/drbd0:: / data:: ext3 IPAddr: 192.168.1.250 MailTo: @ alpha-lassaad.mathlouthi engineering.net: Cluster1-StatusUpdated fetchmail
Oct 10 12:16:31 ardiaserv12 heartbeat: info: Running / etc / ha.d / resource.d / drbddisk r1 start
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:31 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:32 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:32 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:32 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:33 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:33 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:33 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:34 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:34 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:48 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:48 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:48 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:49 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:49 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:49 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:50 ardiaserv12 kernel: drbd0: State change failed: Refusing to
Primary be without at least one disk UpToDate
Oct 10 12:16:50 ardiaserv12 kernel: drbd0: (state = cs: WFConnection st: Secondary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:50 ardiaserv12 kernel: drbd0: = (wanted cs: WFConnection st: Primary / Unknown ds: Inconsistent / DUnknown --- r)
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / ha.d / resource.d / drbddisk
r1 start done. RC = 1
Oct 10 12:16:50 ardiaserv12 heartbeat: ERROR: Return code 1 from / etc / ha.d / resource.d / drbddisk
Oct 10 12:16:50 ardiaserv12 heartbeat: CRIT: Giving up resources due to failure of drbddisk: r1
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Releasing resource group: ardiaserv11 drbddisk: Filesystem r1:: / dev/drbd0:: / data:: ext3 IPAddr: 192.168.1.250 MailTo: @ alpha-lassaad.mathlouthi engineering.net: Cluster1-StatusUpdated fetchmail
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Running / etc / init.d / fetchmail
stop
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: Starting / etc / init.d / fetchmail
stop
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / init.d / fetchmail stop done. RC = 0
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Running / etc / ha.d / resource.d / MailTo lassaad.mathlouthi @ alpha-engineering.net Cluster1-stop StatusUpdated
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: Starting / etc / ha.d / resource.d / MailTo lassaad.mathlouthi @ alpha-engineering.net Cluster1-stop StatusUpdated
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / ha.d / resource.d / MailTo lassaad.mathlouthi @ alpha-engineering.net Cluster1-stop StatusUpdated done. RC = 0
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Running / etc / ha.d / resource.d / 192.168.1.250 stop IPAddr
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: Starting / etc / ha.d / resource.d / 192.168.1.250 stop IPAddr
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / ha.d / resource.d / IPAddr 192.168.1.250 stop done. RC = 0
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Running / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: Starting / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop
Oct 10 12:16:50 ardiaserv12 heartbeat: WARNING: Filesystem / data not mounted?
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / ha.d / resource.d / Filesystem / dev/drbd0 / data ext3 stop done. RC = 0
Oct 10 12:16:50 ardiaserv12 heartbeat: info: Running / etc / ha.d / resource.d / drbddisk stop r1
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: Starting / etc / ha.d / resource.d / drbddisk stop r1
Oct 10 12:16:50 ardiaserv12 heartbeat: debug: / etc / ha.d / resource.d / drbddisk
r1 stop done. RC = 0
Oct 10 12:16:50 ardiaserv12 heartbeat: info: / usr / lib / heartbeat / mach_down: nice_failback: foreign resources acquired
Oct 10 12:16:50 ardiaserv12 heartbeat [2810]: info: mach_down complete takeover.
Oct 10 12:16:50 ardiaserv12 heartbeat: info: mach_down takeover complete for
ardiaserv11 node.
##############################################

Note: lorsque je remet le cable réseau du noeud primaire le cluster se débloque et monte sur le primaire.


Merci beaucoup en avance.
Lassaad.

1 réponse

Lassaad MATHLOUTHI
10 oct. 2008 à 13:51
pas de réponse ?
1