Skip to content

Helix fails to connect with Kerberos enabled ZK#3102

Merged
junkaixue merged 1 commit intoapache:masterfrom
arshadmohammad:zk_sasl_master
Feb 27, 2026
Merged

Helix fails to connect with Kerberos enabled ZK#3102
junkaixue merged 1 commit intoapache:masterfrom
arshadmohammad:zk_sasl_master

Conversation

@arshadmohammad
Copy link
Contributor

Issues

Description

Refer #3101 for details on the issue

Tests

  • Verified the changes through unit test cases
  • Verified Change through Quickstart sample App
    In the Quickstart sample app, I have enabled Zookeeper Kerberos authentication and verified the fix

Quickstart Output Before Fix
Creating cluster: HELIX_QUICKSTART
Adding 2 participants to the cluster
Added participant: localhost_12000
Added participant: localhost_12001
Configuring StateModel: MyStateModel with 1 Leader and 1 Standby
Adding a resource MyResource: with 6 partitions and 2 replicas
Starting Participants
ERROR ZKHelixManager zkClient is not connected after waiting 10000ms., clusterName: HELIX_QUICKSTART, zkAddress: sl73tskrapd1044.visa.com:2181
ERROR ZKHelixManager fail to createClient. retry 1
org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster HELIX_QUICKSTART
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417)
at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:688)
at org.apache.helix.manager.zk.ParticipantManager.(ParticipantManager.java:118)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1441)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1391)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:783)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:818)
at org.apache.helix.examples.Quickstart$MyProcess.start(Quickstart.java:247)
at org.apache.helix.examples.Quickstart.startNodes(Quickstart.java:146)
at org.apache.helix.examples.Quickstart.main(Quickstart.java:164)
ERROR ZKHelixManager fail to createClient. retry 2
org.apache.helix.zookeeper.zkclient.exception.ZkTimeoutException: Waiting to be connected to ZK server has timed out.
at org.apache.helix.zookeeper.zkclient.ZkClient.waitForEstablishedSession(ZkClient.java:2082)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:776)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:818)
at org.apache.helix.examples.Quickstart$MyProcess.start(Quickstart.java:247)
at org.apache.helix.examples.Quickstart.startNodes(Quickstart.java:146)
at org.apache.helix.examples.Quickstart.main(Quickstart.java:164)

Quickstart Output After Fix:

Creating cluster: HELIX_QUICKSTART
Adding 2 participants to the cluster
Added participant: localhost_12000
Added participant: localhost_12001
Configuring StateModel: MyStateModel with 1 Leader and 1 Standby
Adding a resource MyResource: with 6 partitions and 2 replicas
Starting Participants
Started Participant: localhost_12000
Started Participant: localhost_12001
Starting Helix Controller
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_1
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_4
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_3
LeaderStandbyStateModel.onBecomeStandbyFromOffline():localhost_12000 transitioning from OFFLINE to STANDBY for MyResource MyResource_5

  • The following tests are written for this issue:

org.apache.helix.zookeeper.impl.client.TestRawZkClient
#testWaitForKeeperStateWithSaslAuthenticated
#testWaitForKeeperStateWithConnectedReadOnly
#testWaitForKeeperStateWithOtherStates
#testWaitForKeeperStateExactMatchStillWorks

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

(Consider including all behavior changes for public methods or API. Also include these changes in merge description so that other developers are aware of these changes. This allows them to make relevant code changes in feature branches accounting for the new method/API behavior.)

Documentation (Optional)

  • In case of new functionality, my PR adds documentation in the following wiki page:

(Link the GitHub wiki you added)

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@arshadmohammad
Copy link
Contributor Author

@junkaixue, could you please review this PR?

@arshadmohammad arshadmohammad force-pushed the zk_sasl_master branch 2 times, most recently from 4f8412f to 70f0516 Compare February 25, 2026 18:57
@arshadmohammad
Copy link
Contributor Author

The failure in testEvacuateWithDisabledPartition(org.apache.helix.integration.rebalancer.TestInstanceOperation) appears unrelated to this change. I will retrigger the build.

Copy link
Contributor

@junkaixue junkaixue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, lgtm! Thanks for your contribution!

@arshadmohammad
Copy link
Contributor Author

The failure in testEvacuateWithDisabledPartition(org.apache.helix.integration.rebalancer.TestInstanceOperation) appears unrelated to this change. I will retrigger the build.

The build passed successfully this time

@junkaixue
Copy link
Contributor

@arshadmohammad please follow the checkin steps. We need the author confirm the PR is good to check and no more changes.

@arshadmohammad
Copy link
Contributor Author

I confirm that this PR is ready for check-in and no further changes are required.

@junkaixue junkaixue merged commit 89511bd into apache:master Feb 27, 2026
2 checks passed
@arshadmohammad
Copy link
Contributor Author

Thanks, @junkaixue , for reviewing and merging the PR.

Should we also merge this change into the helix-1.3.x branch? If so, I can raise a PR for it.

@arshadmohammad
Copy link
Contributor Author

The changes in this PR are also fully applicable to the helix-1.3.x branch

@vishalsuvagia
Copy link

Thank-you @arshadmohammad for the fix, hopefully this should. help resolve issue observed in #3071,
@junkaixue, any thoughts / plans for a release with this fix included ?

@junkaixue
Copy link
Contributor

@arshadmohammad @vishalsuvagia if you feel there is a need for 1.3.x release. Please send a request to dev@helix.apache.org. We can process the release with backported change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helix fails to connect with Kerberos enabled ZK

3 participants