High availability problems in clusters

There are several reasons why you might experience problems and unexpected behavior when you configure high availability (HA) in clusters of Eclipse Amlen servers.

It is good practice, in the event of problems when configuring and using HA in clusters, to check:

The cluster status and server status on both servers in the HA pair.
The cluster membership configuration on both servers in the HA pair.

Error scenario 1: Did you attempt to enable clustering on an HA pair, and now one or both of your HA servers are in maintenance mode?

Check the server status and HA status of your servers. Use the Eclipse Amlen REST API GET method with the following Eclipse Amlen service URI:

http://<Server-IP:Port>/ima/v1/service/status

Check the ErrorCode, ErrorMessage, ReasonCode, and ReasonString fields in the status information that is returned. The significant fields are highlighted in the following example of status information.

{
  "Version": "v1",
  "Server": {
    "Name": "examplesystem01.com:9089",
    "UID": "lz3Qj3Kd",
    "Status": "Running",
    "State": 9,
    "StateDescription": "Running (maintenance)",
    "ServerTime": "2016-04-13T13:32:28.546Z",
    "UpTimeSeconds": 94,
    "UpTimeDescription": "0 days 0 hours 1 minutes 34 seconds",
    "Version": "2.0 20160413-1109",
    "ErrorCode": 509,
    "ErrorMessage": "Store High-Availability error."
  },
  "Container": {
    "UUID": "bb41d6d23772d9062d1eb7c7fe6864246bafae565b7ecae32972492e63c61006"
  },
  "HighAvailability": {
    "Status": "Active",
    "Enabled": true,
    "Group": "mygroup01",
    "NewRole": "UNSYNC_ERROR",
    "OldRole": "UNSYNC",
    "ActiveNodes": 1,
    "SyncNodes": 0,
    "PrimaryLastTime": "",
    "PctSyncCompletion": -1,
    "ReasonCode": 1,
    "ReasonString": "Cluster.EnableClusterMembership - CONFIG_ERROR",
    "RemoteServerName": ""

A possible cause of this error condition is that cluster membership was enabled on the primary server in the HA pair but only one of the servers in the HA pair was restarted.

Restart both servers in the HA pair at the same time.

Error scenario 2: Did you attempt to enable clustering on an HA pair, and now, after restarting servers, your HA servers are in maintenance mode?

Check the status of your servers. Use the Eclipse Amlen REST API GET method with the following Eclipse Amlen service URI:

http://<Server-IP:Port>/ima/v1/service/status

On each server, check the ErrorCode, and ErrorMessage fields in the status information that is returned. The significant fields are highlighted in the following example of status information.

On the server that was the primary server in the HA pair:

{
    "Version": "v1",
    "Server": {
        "Name": "examplesystem01:9089",
        "UID": "DnAUsuJb",
        "Status": "Running",
        "State": 9,
        "StateDescription": "Running (maintenance)",
        "ServerTime": "2016-04-13T13:20:40.702Z",
        "UpTimeSeconds": 515,
        "UpTimeDescription": "0 days 0 hours 8 minutes 35 seconds",
        "Version": "2.0 20160413-1109",
        "ErrorCode": 509,
        "ErrorMessage": "Store High-Availability error."
    },
    "Container": {
        "UUID": "bb41d6d23772d9062d1eb7c7fe6864246bafae565b7ecae32972492e63c61006"
    },
    "HighAvailability": {
        "Status": "Active",
        "Enabled": true,
        "Group": "mygroup02",
        "NewRole": "UNSYNC_ERROR",
        "OldRole": "UNSYNC",
        "ActiveNodes": 1,
        "SyncNodes": 0,
        "PrimaryLastTime": "2016-04-13T13:05:02Z",
        "PctSyncCompletion": -1,
        "ReasonCode": 2,
        "ReasonString": " - DISCOVERY_TIMEOUT",
        "RemoteServerName": ""
    },
    "Cluster": {
        "Status": "Initializing",
        "Name": "MyCluster",
        "Enabled": true,
        "ConnectedServers": 0,
        "DisconnectedServers": 0
    },
    "Plugin": {
        "Status": "Inactive",
        "Enabled": false
    },
    "MQConnectivity": {
        "Status": "Inactive",
        "Enabled": false
    },
    "SNMP": {
        "Status": "Inactive",
        "Enabled": false
    }
}

On the standby server:

{
    "Version": "v1",
    "Server": {
        "Name": "examplesystem02:9089",
        "UID": "DnAUsuJb",
        "Status": "Running",
        "State": 9,
        "StateDescription": "Running (maintenance)",
        "ServerTime": "2016-04-13T19:22:50.403Z",
        "UpTimeSeconds": 958,
        "UpTimeDescription": "0 days 0 hours 15 minutes 58 seconds",
        "Version": "2.0 20160413-1109",
        "ErrorCode": 112,
        "ErrorMessage": "The property value is not valid: Property: Cluster.ControlAddress Value: \"NULL\"."
    },
    "Container": {
        "UUID": "b308915aa0525a62eaf70a8f5c08b508153caac4e6d1200eb0cd9d53396c8c62"
    },
    "HighAvailability": {
        "Status": "Active",
        "Enabled": true,
        "Group": "mygroup02",
        "NewRole": "UNSYNC",
        "OldRole": "UNSYNC",
        "ActiveNodes": 0,
        "SyncNodes": 0,
        "PrimaryLastTime": "",
        "PctSyncCompletion": 0,
        "ReasonCode": 0,
        "RemoteServerName": ""
    },
    "Cluster": {
        "Status": "Unavailable",
        "Enabled": true
    },
    "Plugin": {
        "Status": "Inactive",
        "Enabled": false
    },
    "MQConnectivity": {
        "Status": "Inactive",
        "Enabled": false
    },
    "SNMP": {
        "Status": "Inactive",
        "Enabled": false
    }
}

In this scenario, a value for the cluster control address had not been specified on the standby server before the cluster was enabled. A similar error scenario can occur if the cluster messaging address is not specified.

Ensure that values for control address and messaging address are specified on both members of the HA pair before you enable them for cluster membership.

Restart both servers in the HA pair.

Error scenario 3: Did you attempt to disable clustering on an HA pair, and now, after restarting servers, your HA servers are in maintenance mode?

Check the status of your servers. Use the Eclipse Amlen REST API GET method with the following Eclipse Amlen service URI:

http://<Server-IP:Port>/ima/v1/service/status

In this scenario, you are likely to see similar information as that described in Error scenario 1 and you are likely to see the information on both of your servers. The significant server status fields and values are:

"ErrorCode": 509,
"ErrorMessage": "Store High-Availability error."

The significant HA status fields and values are:

"ReasonCode": 1,
"ReasonString": "Cluster.EnableClusterMembership - CONFIG_ERROR",

A possible cause of this error condition is that cluster membership was disabled on the primary server in the HA pair while the standby server was inactive.

Disable cluster membership on both servers in the HA pair. Restart both servers.

Error scenario 4: "ReasonString": "Store.TotalMemSizeMB_CONFIG_ERROR" is issued

"ReasonString": "Store.TotalMemSizeMB - CONFIG_ERROR" in HA status indicates that there is a mismatch between the memory configuration of the store of the nodes and, consequently, the nodes cannot form an HA pair. A possible scenario in which this error can arise is when you are using one node that is a Docker container that has a controlled memory configuration, and another node that has been installed as an RPM on the host OS which means that all the memory that is available on the machine is used.

It is best practice to ensure that the two nodes in an HA pair are identical particularly with regard to the amount of memory available to them.