public interface CPSubsystemManagementService
Moreover, the current CP subsystem implementation works only in memory
without persisting any state to disk. It means that a crashed CP member
will not be able to recover by reloading its previous state. Therefore,
crashed CP members create a danger for gradually losing majority of
CP groups and eventually total loss of the availability of the CP subsystem.
To prevent such situations, CPSubsystemManagementService
offers
APIs for dynamic management of CP members.
The CP subsystem relies on Hazelcast's failure detectors to test reachability of CP members. Before removing a CP member from the CP subsystem, please make sure that it is declared as unreachable by Hazelcast's failure detector and removed from Hazelcast's member list.
CP member additions and removals are internally handled by performing a single membership change at a time. When multiple CP members are shutting down concurrently, their shutdown process is executed serially. First, the Metadata CP group creates a membership change plan for CP groups. Then, the scheduled changes are applied to the CP groups one by one. After all removals are done, the shutting down CP member is removed from the active CP members list and its shutdown process is completed.
When a CP member is being shut down, it is replaced with another available CP member in all of its CP groups, including the Metadata group, in order to not to decrease or more importantly not to lose the majority of CP groups. If there is no available CP member to replace a shutting down CP member in a CP group, that group's size is reduced by 1 and its majority value is recalculated.
A new CP member can be added to the CP subsystem to either increase the number of available CP members for new CP groups or to fulfill the missing slots in the existing CP groups. After the initial Hazelcast cluster startup is done, an existing Hazelcast member can be be promoted to the CP member role. This new CP member automatically joins to CP groups that have missing members, and the majority value of these CP groups is recalculated.
A CP member may crash due to hardware problems or a defect in user code, or it may become unreachable because of connection problems, such as network partitions, network hardware failures, etc. If a CP member is known to be alive but only has temporary communication issues, it will catch up the other CP members and continue to operate normally after its communication issues are resolved. If it is known to be crashed or communication issues cannot be resolved in a short time, it can be preferable to remove this CP member from the CP subsystem, hence from all its CP groups. In this case, the unreachable CP member should be terminated to prevent any accidental communication with the rest of the CP subsystem.
When the majority of a CP group is lost for any reason, that CP group cannot
make progress anymore. Even a new CP member cannot join to this CP group,
because membership changes also go through the Raft consensus algorithm.
For this reason, the only option is to force-destroy the CP group via the
forceDestroyCPGroup(String)
API. When this API is used, the CP
group is terminated non-gracefully, without the Raft algorithm mechanics.
Then, all CP data structure proxies that talk to this CP group fail with
CPGroupDestroyedException
. However, if a new proxy is created
afterwards, then this CP group will be re-created from scratch with a new
set of CP members. Losing majority of a CP group can be likened to
partition-loss scenario of AP Hazelcast.
Please note that the CP groups that have lost their majority must be force-destroyed immediately, because they can block the Metadata CP group to perform membership changes.
Loss of the majority of the Metadata CP group is the doomsday scenario for
the CP subsystem. It is a fatal failure and the only solution is to reset
the whole CP subsystem state via the restart()
API. To be able to
reset the CP subsystem, the initial size of the CP subsystem must be
satisfied, which is defined by CPSubsystemConfig.getCPMemberCount()
.
For instance, CPSubsystemConfig.getCPMemberCount()
is 5 and only 1
CP member is currently alive, when restart()
is called,
additional 4 regular Hazelcast members should exist in the cluster.
New Hazelcast members can be started to satisfy
CPSubsystemConfig.getCPMemberCount()
.
There is a subtle point about graceful shutdown of CP members.
If there are N CP members in the cluster,
Even though the shutdown API is called concurrently on multiple members,
the Metadata CP group handles shutdown requests serially. Therefore,
it would be simpler to shut down CP members one by one, by calling
The reason behind this limitation is, each shutdown request internally
requires a Raft commit to the Metadata CP group. A CP member proceeds to
shutdown after it receives a response of its commit to the Metadata CP
group. To be able to perform a Raft commit, the Metadata CP group must have
its majority available. When there are only 2 CP members left after graceful
shutdowns, the majority of the Metadata CP group becomes 2. If the last 2 CP
members shut down concurrently, one of them is likely to perform its Raft
commit faster than the other one and leave the cluster before the other CP
member completes its Raft commit. In this case, the last CP member waits for
a response of its commit attempt on the Metadata group, and times out
eventually. This situation causes an unnecessary delay on shutdown process
of the last CP member. On the other hand, when the last 2 CP members shut
down serially, the N-1th member receives response of its commit after its
shutdown request is committed also on the last CP member. Then, the last CP
member checks its local data to notice that it is the last CP member alive,
and proceeds its shutdown without attempting a Raft commit on the Metadata
CP group.HazelcastInstance.shutdown()
can be called on N-2 CP members concurrently. Once these N-2 CP members
complete their shutdown, the remaining 2 CP members must be shut down
serially.
HazelcastInstance.shutdown()
on the next CP member once the current
CP member completes its shutdown.
CPMember
,
CPSubsystemConfig
Modifier and Type | Method and Description |
---|---|
boolean |
awaitUntilDiscoveryCompleted(long timeout,
TimeUnit timeUnit)
Blocks until CP discovery process is completed, or the timeout occurs,
or the current thread is interrupted, whichever happens first.
|
ICompletableFuture<Void> |
forceDestroyCPGroup(String groupName)
Unconditionally destroys the given active CP group without using
the Raft algorithm mechanics.
|
ICompletableFuture<CPGroup> |
getCPGroup(String name)
Returns the active CP group with the given name.
|
ICompletableFuture<Collection<CPGroupId>> |
getCPGroupIds()
Returns all active CP group ids.
|
ICompletableFuture<Collection<CPMember>> |
getCPMembers()
Returns the current list of CP members
|
CPMember |
getLocalCPMember()
Returns the local CP member if this Hazelcast member is part of
the CP subsystem, returns null otherwise.
|
boolean |
isDiscoveryCompleted()
Returns whether CP discovery process is completed or not.
|
ICompletableFuture<Void> |
promoteToCPMember()
Promotes the local Hazelcast member to a CP member.
|
ICompletableFuture<Void> |
removeCPMember(String cpMemberUuid)
Removes the given unreachable CP member from the active CP members list
and all CP groups it belongs to.
|
ICompletableFuture<Void> |
restart()
Wipes and resets the whole CP subsystem and initializes it
as if the Hazelcast cluster is starting up initially.
|
CPMember getLocalCPMember()
This field is initialized when the local Hazelcast member is one of
the first CPSubsystemConfig.getCPMemberCount()
members
in the cluster and the CP subsystem discovery process is completed.
HazelcastException
if the CP subsystem
is not enabled.HazelcastException
- if the CP subsystem is not enabledisDiscoveryCompleted()
,
awaitUntilDiscoveryCompleted(long, TimeUnit)
ICompletableFuture<Collection<CPGroupId>> getCPGroupIds()
ICompletableFuture<CPGroup> getCPGroup(String name)
ICompletableFuture<Void> forceDestroyCPGroup(String groupName)
CPGroupDestroyedException
.
Once a CP group is destroyed, it can be created again with a new set of CP members.
This method is idempotent. It has no effect if the given CP group is already destroyed.
ICompletableFuture<Collection<CPMember>> getCPMembers()
ICompletableFuture<Void> promoteToCPMember()
This method is idempotent. If the local member is already in the active CP members list, then this method has no effect. When the current member is promoted to a CP member, its member UUID is assigned as CP member UUID.
Once the returned Future
object is completed, the promoted CP
member has been added to the CP groups that have missing members, i.e.,
whose size is smaller than CPSubsystemConfig.getGroupSize()
.
If the local member is currently being removed from
the active CP members list, then the returned Future
object
will throw IllegalArgumentException
.
If there is an ongoing membership change in the CP subsystem when this
method is invoked, then the returned Future
object throws
IllegalStateException
If the CP subsystem initial discovery process has not completed when
this method is invoked, then the returned Future
object throws
IllegalStateException
IllegalArgumentException
- If the local member is currently being
removed from the active CP members listIllegalStateException
- If there is an ongoing membership change
in the CP subsystemIllegalStateException
- If local member is a lite-memberICompletableFuture<Void> removeCPMember(String cpMemberUuid)
Before removing a CP member from the CP subsystem, please make sure that it is declared as unreachable by Hazelcast's failure detector and removed from Hazelcast's member list. The behavior is undefined when a running CP member is removed from the CP subsystem.
IllegalStateException
- When another CP member is being removed
from the CP subsystemIllegalArgumentException
- if the given CP member is already
removed from the CP member listICompletableFuture<Void> restart()
After this method is called, all CP state and data are wiped and the CP members start with empty state.
This method can be invoked only from the Hazelcast master member.
This method must not be called while there are membership changes in the cluster. Before calling this method, please make sure that there is no new member joining and all existing Hazelcast members have seen the same member list.
Use with caution: This method is NOT idempotent and multiple invocations can break the whole system! After calling this API, you must observe the system to see if the restart process is successfully completed or failed before making another call.
IllegalStateException
- When this method is called on
a Hazelcast member that is not the Hazelcast cluster masterIllegalStateException
- if current member count of the cluster
is smaller than CPSubsystemConfig.getCPMemberCount()
boolean isDiscoveryCompleted()
true
if CP discovery completed, false
otherwiseawaitUntilDiscoveryCompleted(long, TimeUnit)
boolean awaitUntilDiscoveryCompleted(long timeout, TimeUnit timeUnit) throws InterruptedException
timeout
- maximum time to waittimeUnit
- time unit of the timeouttrue
if CP discovery completed, false
otherwiseInterruptedException
- if interrupted while waitingisDiscoveryCompleted()
Copyright © 2022 Hazelcast, Inc.. All Rights Reserved.