Skip to content

Port providers to New Failover WIP#8532

Draft
justin-stephenson wants to merge 18 commits intoSSSD:masterfrom
justin-stephenson:failover_justin_v1_1
Draft

Port providers to New Failover WIP#8532
justin-stephenson wants to merge 18 commits intoSSSD:masterfrom
justin-stephenson:failover_justin_v1_1

Conversation

@justin-stephenson
Copy link
Contributor

No description provided.

pbrezina and others added 9 commits March 17, 2026 13:45
so it can be directly modified
So it can be modified later.
This crafts and implements the new failover interface,
it does not provide complete implementation of the failover
mechanism yet. It brings the code to a state were the public
and private interfaces are stable, working and testable so
the following tasks can be split and work on in parallel.

What is missing at this state:
- server configuration and discovery
  (failover_server_group/batch/vtable_op)
- server selection mechanism (sss_failover_vtable_op_server_next)
- kerberos authentication
- sharing servers between IPA/AD LDAP and KDC
- online/offline callbacks (resolve callback should not be needed)

But especially it is possible to start refactoring SSSD code to start
using the new failover implementation.
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new failover mechanism for SSSD providers, exemplified by a new 'minimal' provider. The changes involve significant refactoring of error handling in existing LDAP and AD providers, removing dp_error parameters and related macros, and integrating a new sss_failover_transaction for connection management. The new failover logic is also integrated into various LDAP and IPA components for tasks like account info, enumeration, and sudo rules. However, several critical issues were identified: test hooks that intentionally trigger server failures are present in production code paths (ldap_id.c, minimal_id_services.c) and should be removed or guarded. Additionally, key failover options, DNS discovery domains, and server URIs are hardcoded (failover.c, failover_group.c, failover_srv.c, ldap_init.c), limiting configurability and dynamic behavior, which is unsuitable for a production-ready feature.


count = talloc_array_length(fctx->groups);

for (slot = 0; fctx->groups[slot] != NULL && slot < count; slot++) {

Check failure

Code scanning / CodeQL

Array offset used before range check High

This use of offset 'slot' should follow the
range check
.
}

if (state->sdap_ret == ENOENT && state->noexist_delete == true) {
if (ret == ENOENT && state->noexist_delete == true) {

Check warning

Code scanning / CodeQL

Comparison result is always the same Warning

Comparison is always false because ret <= 0.
Comment on lines +3091 to +3093
/* FIXME: Placeholder to be updated when
* AD is moved to new Failover
* Set to state->conn */

Check notice

Code scanning / CodeQL

FIXME comment Note

FIXME comment: Placeholder to be updated when [...]
@justin-stephenson justin-stephenson force-pushed the failover_justin_v1_1 branch 8 times, most recently from d56510c to 6d953d4 Compare March 18, 2026 13:27
Assisted by: Claude code (Sonnet 4.6)
The data provider handler methods now return a single output
argument. Remove 'dp_error/dp_err' and 'error_message' usage
across provider code.

The getAccountDomain method still needs to return 'domain_name' string.

Assisted by: Claude code (Sonnet 4.6)
@justin-stephenson justin-stephenson force-pushed the failover_justin_v1_1 branch 4 times, most recently from 8540239 to 410fced Compare March 19, 2026 15:04
To allow system tests to run in upstream PRCI
** Temporary commit as each provider is ported to new failover **

These provider tests will often fail because they have not yet been ported
to the new failover code, and they call into ldap functions which
are utilizing the new failover. For example crash in :

 #0  0x00007fdb8d72cb41 in sss_failover_transaction_ex_send () from /usr/lib64/sssd/libsss_ldap_common.so
 #1  0x00007fdb8d72ccbd in sss_failover_transaction_send () from /usr/lib64/sssd/libsss_ldap_common.so
 #2  0x00007fdb8d6e367b in groups_by_user_send () from /usr/lib64/sssd/libsss_ldap_common.so
 #3  0x00007fdb8d6e6a88 in sdap_handle_acct_req_send () from /usr/lib64/sssd/libsss_ldap_common.so
 #4  0x00007fdb8d799076 in ipa_id_get_account_info_get_original_step () from /usr/lib64/sssd/libsss_ipa.so
 #5  0x00007fdb8d7a038b in ipa_id_get_account_info_send () from /usr/lib64/sssd/libsss_ipa.so
 #6  0x00007fdb8d7a2560 in ipa_account_info_send () from /usr/lib64/sssd/libsss_ipa.so
 #7  0x00007fdb8d7a2877 in ipa_account_info_handler_send () from /usr/lib64/sssd/libsss_ipa.so
* test_failover tests are expected to fail due to failover interface changes.

* test_logging 'offline' tests and test_autofs__propagate_offline_status_for_multiple_domains
fail because they are asserting offline related log messages which are not in the new
failover code.

tests/test_autofs.py::test_autofs__propagate_offline_status_for_multiple_domains (ldap) FAILED [ 35%]
tests/test_failover.py::test_failover__reactivation_timeout_is_honored[None-31] (ldap) FAILED [ 39%]
tests/test_failover.py::test_failover__reactivation_timeout_is_honored[15-31] (ldap) FAILED [ 39%]
tests/test_failover.py::test_failover__reactivation_timeout_is_honored[60-60] (ldap) FAILED [ 39%]
tests/test_failover.py::test_failover__connect_using_ipv4_second_family (ldap) FAILED [ 40%]
tests/test_logging.py::test_logging__default_settings_logs_offline_errors (ldap) FAILED [ 60%]
tests/test_logging.py::test_logging__default_settings_logs_to_syslog_when_ldap_is_offline (ldap) FAILED [ 60%]
@justin-stephenson justin-stephenson changed the title Test only WIP Port providers to New Failover WIP Mar 19, 2026
@pbrezina
Copy link
Member

I pushed the failover feature branch, can you please rebase and use it as the base branch?

https://github.com/SSSD/sssd/tree/failover

Copy link
Member

@pbrezina pbrezina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Justin, it's a lot of code and you didn't want a full review yet so I just scrolled over it.

It looks good. I have some concerns about the first two patches that converts the interface to a single return code - it looks like some important logic was removed as well at few places.

And the sdap_id_op code is responsible of terminating the retry logic and bringing SSSD offline. I think it should, at some point, return ERR_NO_MORE_SERVER. And the LDAP search code should return ERR_SERVER_FAILURE so it can be picked up by sss_failover_transaction.

ret = sdap_id_op_done(state->sdap_op, ret, &dp_error);
ret = sdap_id_op_done(state->sdap_op, ret);
if (ret != EOK) {
if (dp_error == DP_ERR_OK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like some retry logic was accidentally removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and here the call to sdap_id_op_done() is removed entirely in a later commit in this PR.

To fix this should I create a function that performs the same logic as sdap_id_op_done() but returns the ERR_* codes such as ERR_NO_MORE_SERVER and ERR_SERVER_FAILURE instead?

dp_err = DP_ERR_FATAL;
}

if (dp_err == DP_ERR_OK && retval != EOK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic removed?

sdap_account_info_handler_send, sdap_account_info_handler_recv, id_ctx,
struct sdap_id_ctx, struct dp_id_data, struct dp_reply_std);
sdap_account_info_handler_send, sdap_account_info_handler_recv, init_ctx,
struct ldap_init_ctx, struct dp_id_data, struct dp_reply_std);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimal_init_ctx was created to do minimal changes required for the minimal provider, without completely refactoring it. Here, you should refactor the id_ctx to hold the necessary data (failover contexts) and remove old failover bits.

@pbrezina
Copy link
Member

The end goal is to get rid of sdap_id_op code completely. Maybe you could remove the file and avoid building ipa and ad providers for now to make sure we do not depend on anything from sdap_id_op?

@justin-stephenson
Copy link
Contributor Author

Hi Justin, it's a lot of code and you didn't want a full review yet so I just scrolled over it.

Thank you.

It looks good. I have some concerns about the first two patches that converts the interface to a single return code - it looks like some important logic was removed as well at few places.

Functions below contain logic checking the values of dp_error or sdap_ret, I could use some suggestion on how to retain this logic alongside the changes in this PR.

 sdap_ad_resolve_sids_done()
 ad_gpo_target_dn_retrieval_done()
 groups_by_user_done()
 get_user_and_group_users_done()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants