slapd-meta doesn't continue with multiple uri's

-----Ursprüngliche Nachricht-----
An: openldap-***@openldap.org;
Von: Liam Gretton <***@leicester.ac.uk>
Gesendet: Di 14.08.2012 15:18
Betreff: slapd-meta doesn't continue with multiple uri's

Post by Liam Gretton
I've been trying to get slapd-meta to failover using multiple URIs but
can't get it to work.
Initially I was using 2.4.26, but having seen the report in ITS#7050
I've now built 2.4.32 but the problem is still there as far as I can
tell. This bug was quashed in 2.4.29 according to the change log.
In the example below, if host1 is not contactable at the point a search
is performed, host2 will be contacted and the result returned correctly
but ldapsearch then hangs indefinitely and the server's debug (level 1)
ldap_sasl_bind
ldap_send_initial_request
ldap_int_poll: fd: 10 tm: 0
502a4634 conn=1001 op=1 <<< meta_search_dobind_init[0]=4
502a4634 conn=1001 op=1 >>> meta_search_dobind_init[0]
database meta
suffix dc=local
rootdn cn=administrator,dc=local
rootpw secret
network-timeout 3
uri ldap://host1:3268/ou=dc1,dc=local
uri ldap://host2:3268/ou=dc1,dc=local
uri ldap://host3:3268/ou=dc1,dc=local
suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"
idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"
idassert-authzfrom "dn.exact:cn=administrator,dc=local"
Am I doing something wrong or has the bug described in ITS#7050 crept
back in?
--
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Did You ever try

uri "ldap://host1:3268/ou=dc1,dc=local" "ldap://host2:3268" "ldap://host3:3268" ?

m***@aero.polimi.it

2012-08-14 13:52:40 UTC

You are. The above is creating three targets, one pointing to host1, one
pointing to host2 and one pointing to host3. The rest of the
configuration is associated to the last target, the others are sort of
dangling. A correct configuration for failover would be

uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/
suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"
idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"
idassert-authzfrom "dn.exact:cn=administrator,dc=local"

Note that URIs other than the first one cannot have the DN part (the same
of the first URI is assumed).

p.

Post by Liam Gretton
or has the bug described in ITS#7050 crept
back in?
--
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-14 14:08:42 UTC

Post by m***@aero.polimi.it
You are. The above is creating three targets, one pointing to host1, one
pointing to host2 and one pointing to host3. The rest of the
configuration is associated to the last target, the others are sort of
dangling. A correct configuration for failover would be
uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/
suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"
idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"
idassert-authzfrom "dn.exact:cn=administrator,dc=local"
Note that URIs other than the first one cannot have the DN part (the same
of the first URI is assumed).

Understood. However in that case the server never attempts to contact
host2 or host3 at all. Here's the output from the debug log:

502a5ae6 >>> slap_listener(ldapi://%2Fvar%2Frun%2Fslapd%2Fldapi-meta)
502a5ae6 connection_get(8): got connid=1000
502a5ae6 connection_read(8): checking for input on id=1000
ber_get_next
ber_get_next: tag 0x30 len 43 contents:
502a5ae6 op tag 0x60, time 1344953062
ber_get_next
502a5ae6 conn=1000 op=0 do_bind
ber_scanf fmt ({imt) ber:
ber_scanf fmt (m}) ber:
502a5ae6 >>> dnPrettyNormal: <cn=administrator,dc=local>
502a5ae6 <<< dnPrettyNormal: <cn=administrator,dc=local>,
<cn=administrator,dc=local>
502a5ae6 do_bind: version=3 dn="cn=administrator,dc=local" method=128
502a5ae6 conn=1000 op=0: rootdn="cn=administrator,dc=local" bind succeeded
502a5ae6 do_bind: v3 bind: "cn=administrator,dc=local" to
"cn=administrator,dc=local"
502a5ae6 send_ldap_result: conn=1000 op=0 p=3
502a5ae6 send_ldap_response: msgid=1 tag=97 err=0
ber_flush2: 14 bytes to sd 8
502a5ae6 connection_get(8): got connid=1000
502a5ae6 connection_read(8): checking for input on id=1000
ber_get_next
ber_get_next: tag 0x30 len 44 contents:
502a5ae6 op tag 0x63, time 1344953062
ber_get_next
502a5ae6 conn=1000 op=1 do_search
ber_scanf fmt ({miiiib) ber:
502a5ae6 >>> dnPrettyNormal: <dc=local>
502a5ae6 <<< dnPrettyNormal: <dc=local>, <dc=local>
ber_scanf fmt ({mm}) ber:
ber_scanf fmt ({M}}) ber:
ldap_create
ldap_url_parse_ext(ldap://host3:3268)
ldap_url_parse_ext(ldap://host2:3268)
ldap_url_parse_ext(ldap://host1:3268)
502a5ae6 conn=1000 op=1: meta_back_getconn[0]
502a5ae6 conn=1000 op=1 meta_back_getconn: candidates=1 conn=ROOTDN inserted
502a5ae6 conn=1000 op=1 >>> meta_back_search_start[0]
502a5ae6 conn=1000 op=1 >>> meta_search_dobind_init[0]
ldap_sasl_bind
ldap_send_initial_request
ldap_new_connection 1 1 0
ldap_int_open_connection
ldap_connect_to_host: TCP host1:3268
ldap_new_socket: 10
ldap_prepare_socket: 10
ldap_connect_to_host: Trying 192.168.1.1:3268
ldap_pvt_connect: fd: 10 tm: 5 async: -1
ldap_ndelay_on: 10
ldap_int_poll: fd: -1 tm: 0
502a5ae6 conn=1000 op=1 <<< meta_search_dobind_init[0]=4
502a5ae6 conn=1000 op=1 <<< meta_back_search_start[0]=4
502a5ae6 conn=1000 op=1 meta_back_search: ncandidates=1 cnd="*"
502a5ae6 conn=1000 op=1 >>> meta_search_dobind_init[0]
ldap_sasl_bind
ldap_send_initial_request
ldap_int_poll: fd: 10 tm: 0
502a5ae6 conn=1000 op=1 <<< meta_search_dobind_init[0]=4
502a5ae6 conn=1000 op=1 >>> meta_search_dobind_init[0]

ldap_sasl_bind
ldap_send_initial_request
ldap_int_poll: fd: 10 tm: 0
502a5ae6 conn=1000 op=1 <<< meta_search_dobind_init[0]=4
502a5ae6 conn=1000 op=1 >>> meta_search_dobind_init[0]

ldap_sasl_bind
ldap_send_initial_request
ldap_int_poll: fd: 10 tm: 0
502a5ae6 conn=1000 op=1 <<< meta_search_dobind_init[0]=4
502a5ae6 conn=1000 op=1 >>> meta_search_dobind_init[0]

...etc

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

m***@aero.polimi.it

2012-08-14 14:28:29 UTC

Understood. However in that case the server never attempts to contact

Correct. When host1 is down, host2 is contacted instead, and so forth.

p.

Liam Gretton

2012-08-14 14:33:11 UTC

Understood. However in that case the server never attempts to contact

Correct. When host1 is down, host2 is contacted instead, and so forth.

If I wasn't clear, I changed the config as you suggested. The debug
output I posted was from that configuration. The server never attempts
to contact anything other than host1.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

m***@aero.polimi.it

2012-08-14 15:06:13 UTC

Understood. However in that case the server never attempts to contact

Correct. When host1 is down, host2 is contacted instead, and so forth.

If I wasn't clear, I changed the config as you suggested. The debug
output I posted was from that configuration. The server never attempts
to contact anything other than host1.

Did you try stopping host1 in between client operations? I did and it
works as intended.

p.

Liam Gretton

2012-08-14 15:20:13 UTC

Post by Liam Gretton
If I wasn't clear, I changed the config as you suggested. The debug
output I posted was from that configuration. The server never attempts
to contact anything other than host1.

Did you try stopping host1 in between client operations? I did and it
works as intended.

No, I've been initially testing with the case where host1 is down when
the LDAP service starts.

If I remove host1 after the LDAP server has started, the debug output is
at least different. It's attempting to contact host1, failing, doubling
the timeout and trying again continuously, never attempting to try host2
or host3.

** ld 0xa2e4e0 Connections:
* host: host1 port: 3268 (default)
refcnt: 2 status: Connected
last used: Tue Aug 14 16:11:36 2012

** ld 0xa2e4e0 Outstanding Requests:
* msgid 7, origid 7, status InProgress
outstanding referrals 0, parent count 0
ld 0xa2e4e0 request count 1 (abandoned 0)
** ld 0xa2e4e0 Response Queue:
Empty
ld 0xa2e4e0 response count 0
ldap_chkResponseList ld 0xa2e4e0 msgid 7 all 2
ldap_chkResponseList returns ld 0xa2e4e0 NULL
ldap_int_select
ldap_result ld 0xa2e4e0 msgid 7
wait4msg ld 0xa2e4e0 msgid 7 (timeout 100000 usec)
wait4msg continue ld 0xa2e4e0 msgid 7 all 2
** ld 0xa2e4e0 Connections:
* host: host1 port: 3268 (default)
refcnt: 2 status: Connected
last used: Tue Aug 14 16:11:36 2012

** ld 0xa2e4e0 Outstanding Requests:
* msgid 7, origid 7, status InProgress
outstanding referrals 0, parent count 0
ld 0xa2e4e0 request count 1 (abandoned 0)
** ld 0xa2e4e0 Response Queue:
Empty
ld 0xa2e4e0 response count 0
ldap_chkResponseList ld 0xa2e4e0 msgid 7 all 2
ldap_chkResponseList returns ld 0xa2e4e0 NULL
ldap_int_select
ldap_result ld 0xa2e4e0 msgid 7
wait4msg ld 0xa2e4e0 msgid 7 (timeout 200000 usec)
wait4msg continue ld 0xa2e4e0 msgid 7 all 2
** ld 0xa2e4e0 Connections:
* host: host1 port: 3268 (default)
refcnt: 2 status: Connected
last used: Tue Aug 14 16:11:36 2012

** ld 0xa2e4e0 Outstanding Requests:
* msgid 7, origid 7, status InProgress
outstanding referrals 0, parent count 0
ld 0xa2e4e0 request count 1 (abandoned 0)
** ld 0xa2e4e0 Response Queue:
Empty
ld 0xa2e4e0 response count 0
ldap_chkResponseList ld 0xa2e4e0 msgid 7 all 2
ldap_chkResponseList returns ld 0xa2e4e0 NULL
ldap_int_select
ldap_result ld 0xa2e4e0 msgid 7
wait4msg ld 0xa2e4e0 msgid 7 (timeout 400000 usec)
wait4msg continue ld 0xa2e4e0 msgid 7 all 2
** ld 0xa2e4e0 Connections:
* host: host1 port: 3268 (default)
refcnt: 2 status: Connected
last used: Tue Aug 14 16:11:36 2012

...etc.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

m***@aero.polimi.it

2012-08-14 16:18:57 UTC

Post by Liam Gretton
If I wasn't clear, I changed the config as you suggested. The debug
output I posted was from that configuration. The server never attempts
to contact anything other than host1.

Did you try stopping host1 in between client operations? I did and it
works as intended.

The timeout you see is an internal timeout used for each poll on a
target's connection. It keeps doubling when the connection is valid but
nothing comes. Did you actually kill host1, or just stopped it? In the
latter case, the connection is not dead, it's just returning nothing. You
need to kill the process (or let it timeout using the "timeout"
directive).

p.

Liam Gretton

2012-08-14 16:35:42 UTC

Post by Liam Gretton
If I remove host1 after the LDAP server has started, the debug
output is at least different. It's attempting to contact host1,
failing, doubling the timeout and trying again continuously, never
attempting to try host2 or host3.

In the first case (host1 down when LDAP starts), I was testing by
pointing at a host which has no LDAP service running on it at all,
although the host itself was up.

In the second case (host1 down after LDAP starts), I was using a proper
target (an AD domain controller) and setting an iptables rule to prevent
outbound traffic to it:

iptables -A OUTPUT -d host1 -j DROP

Post by m***@aero.polimi.it
In the latter case, the connection is not dead, it's just returning
nothing. You need to kill the process (or let it timeout using the
"timeout" directive).

Which timeout directive? I've already set network-timeout in the config
for slapd-meta, and setting bind-timeout doesn't help either. I have no
control over the configuration of the targets.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its/
IT Services Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

m***@aero.polimi.it

2012-08-14 20:57:50 UTC

In the first case (host1 down when LDAP starts), I was testing by
pointing at a host which has no LDAP service running on it at all,
although the host itself was up.
In the second case (host1 down after LDAP starts), I was using a proper
target (an AD domain controller) and setting an iptables rule to prevent
iptables -A OUTPUT -d host1 -j DROP

Post by m***@aero.polimi.it
In the latter case, the connection is not dead, it's just returning
nothing. You need to kill the process (or let it timeout using the
"timeout" directive).

Which timeout directive? I've already set network-timeout in the config
for slapd-meta, and setting bind-timeout doesn't help either. I have no
control over the configuration of the targets.

bind-timeout and network-timeout have specific, connection-level meaning.
Just "timeout <seconds>" (you can make it search-specific if you don't
want it to affect other operations, using "timeout search=<seconds>".

p.

Liam Gretton

2012-08-15 03:30:03 UTC

Post by m***@aero.polimi.it
bind-timeout and network-timeout have specific, connection-level meaning.
Just "timeout <seconds>" (you can make it search-specific if you don't
want it to affect other operations, using "timeout search=<seconds>".

Setting timeout doesn't solve the problem, but it changes the behaviour.
Now the ldapsearch times out after the value specified and reports:

result: 11 Administrative limit exceeded
text: Operation timed out

...but the LDAP server still doesn't attempt to contact the failover
hosts. I've also verified this with tcpdump.

To recap, here's my current config. I can't help but think I'm doing
something obviously wrong here if it's working for others.

database meta
suffix dc=local
rootdn cn=administrator,dc=local
rootpw secret

network-timeout 1
timeout 1

uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/

suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"

idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"

idassert-authzfrom "dn.exact:cn=administrator,dc=local"

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its/
IT Services Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-15 19:33:13 UTC

Can anyone explain the interaction between 'network-timeout' and
'timeout'? I'm tearing my hair out with this problem and the timeout
options are the only straws I have to clutch at.

Setting timeout doesn't solve the problem, but it changes the behaviour.
result: 11 Administrative limit exceeded
text: Operation timed out
...but the LDAP server still doesn't attempt to contact the failover
hosts. I've also verified this with tcpdump.
To recap, here's my current config. I can't help but think I'm doing
something obviously wrong here if it's working for others.
database meta
suffix dc=local
rootdn cn=administrator,dc=local
rootpw secret
network-timeout 1
timeout 1
uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/
suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"
idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"
idassert-authzfrom "dn.exact:cn=administrator,dc=local"

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its/
IT Services Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-17 15:32:47 UTC

I'm trying to get my head round the source code now to see if this is a bug.

One thing that looks odd to me in the debug output:

ldap_url_parse_ext(ldap://host1:3268)
ldap_url_parse_ext(ldap://host2:3268)
ldap_url_parse_ext(ldap://host3:3268)
502e5f7e conn=1000 op=1: meta_back_getconn[0]
502e5f7e conn=1000 op=1 meta_back_getconn: candidates=1 conn=ROOTDN inserted

Shouldn't this be 'candidates=3' in the last line above?

If anyone familiar with the source could let me know I'd be grateful.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-22 19:54:30 UTC

My fault: "timeout" is operation-wide; when it's hit, the
operation ends as you reported. "network-timeout" is related to
connect(2) only. As far as I understand, by looking at the code,
there is no practical means, so far, to perform what you're asking
for. Either the connection cannot be established, and in this case
the code works as intended, or if it hangs for ever the operation
can only be aborted/timed out. In those cases, you definitely need
to fix the configuration by removing the hung URI.

But what's the point of specifying multiple targets in the uri option if
it doesn't fall through to subsequent ones when the first is not
contactable?

Have I completely missed the point of the documentation?

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its/
IT Services Tel: +44 (0)116 2522254
University Of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-23 09:00:49 UTC

Post by Liam Gretton
But what's the point of specifying multiple targets in the uri
option if it doesn't fall through to subsequent ones when the first
is not contactable?
Have I completely missed the point of the documentation?

The point is that your condition is *not* a server unreachable.

There's obviously some subtlety I'm missing here. How would you describe
it instead?

Current failover only deals with failures/timeouts of connect(2). I
don't think handling your case using failover is appropriate. Your
case should be handled by removing the non-responding URI from the
list.

I don't understand the difference. If a server is unavailable for
whatever reason (offline, firewalled, switched off, nothing listening on
the specified port), then connect() will timeout as you describe.

Which failures are the current mechanism actually expected to cope with
that don't include a server being unreachable?

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Pierangelo Masarati

2012-08-23 09:22:29 UTC

The point is that your condition is *not* a server unreachable.

There's obviously some subtlety I'm missing here. How would you describe
it instead?

When connect(2) times out the code behaves as expected.

p.

Post by Liam Gretton
Which failures are the current mechanism actually expected to cope with
that don't include a server being unreachable?

--
Pierangelo Masarati
Associate Professor
Dipartimento di Ingegneria Aerospaziale
Politecnico di Milano

Liam Gretton

2012-08-23 09:48:06 UTC

Post by Pierangelo Masarati

When connect(2) times out the code behaves as expected.

Can you explain further please? 'As expected' to you is obviously
different to what I expect from the documentation and what you've said
previously. You say that the failover mechanism works when connect()
fails or times out, but that's not the behaviour I'm seeing.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Howard Chu

2012-08-23 10:18:06 UTC

Post by Pierangelo Masarati

When connect(2) times out the code behaves as expected.

Your description of your procedure is so vague and imprecise it's difficult
for anybody to decipher what you're talking about.

Reading back thru the several posts in this thread, what I see you saying is
that you have tested a few different configurations:

1) target host is up, target LDAP server is down
this should fail immediately because the host OS will immediately send a
TCP Connection Refused response

2) target host is initially down
this will not fail until the first TCP connect request times out

3) target host is initially up and connected, but thru your iptables
manipulation you sever the link
this will not fail until the TCP connection times out, which it won't
unless you're using TCP Keepalives, and by default those are only sent once
every 2 hours.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Liam Gretton

2012-08-24 08:51:05 UTC

Post by Howard Chu
Your description of your procedure is so vague and imprecise it's difficult
for anybody to decipher what you're talking about.
Reading back thru the several posts in this thread, what I see you saying is
1) target host is up, target LDAP server is down
this should fail immediately because the host OS will immediately send a
TCP Connection Refused response
2) target host is initially down
this will not fail until the first TCP connect request times out
3) target host is initially up and connected, but thru your iptables
manipulation you sever the link
this will not fail until the TCP connection times out, which it won't
unless you're using TCP Keepalives, and by default those are only sent once
every 2 hours.

Let me make it less vague then.

What I've been trying to simulate are the various modes by which a uri
target will become unavailable. What I'm trying to achieve is to have
the meta backend point to four domain controllers and cope with one or
more DCs being unavailable.

Having gone through this and let the system time out each time, I've
found it does fail over under one of the conditions listed below, but it
takes about 15 minutes to do so.

Scenarios:

1. slapd starts, first target is unreachable;

2. slapd starts, first target is reachable but has no service running;

3. slapd already running, first target up and connected then later
becomes unreachable.

Simulations:

a. 'Unreachable' simulated by blocking outbound access with the
following iptables rule:

iptables -A OUTPUT -d host1 -j DROP

b. 'Unreachable' simulated making the first target a host that is up but
with no service running.

Results (all with 2.4.32):

Case 1a: slapd retries host1 continuously and times out after about
180s. No attempt is made to contact additional targets.

Case 2b: slapd retries host1 continuously and times out after about
180s. No attempt is made to contact additional targets.

Case 3a: slapd retries host1 continuously, doubling an internal timeout
value each time, eventually timing out after 19 retries and about 15m.
It does then fall through to host2 and subsequent connections don't
attempt to contact host1.

Here's my config. I've also tried setting nretries explicitly to 3, but
it makes no difference.

database meta
suffix dc=local
rootdn cn=administrator,dc=local
rootpw secret

network-timeout 1

uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/

suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"

idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"

idassert-authzfrom "dn.exact:cn=administrator,dc=local"

These results suggest to me that network-timeout and nretries (which
should default to 3) don't work as documented.

Having said that, it does seem to at least cope with scenario 3, albeit
with a long timeout.

Ideally it'd work in all cases. Pierangelo says the failover works when
connect() times out, but I'd have thought that would include scenarios 1
and 2 but not 3.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Liam Gretton

2012-08-24 12:22:58 UTC

I am really not astonished about your results.
Run your tests again, but use "reject" as iptables target.
"drop" means, that you never ever get an answer.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Brett Maxfield

2012-08-24 13:40:48 UTC

Hi Liam,

IMHO you'd be better off using a hardware/software failover device. There are several free linux based ones that will run on commodity or dedicated hardware.

Then you have complete control of the failover policy. Using a single app server to provide failover for some other app server(s) is like cracking walnuts with a ming vase. It will work until it breaks.

Software like pfSense works at a low level, does ip pooling, and itself can be made redundant.. And run as an appliance on vmware etc.,

Ditto setting up 2 new servers with centos/redhat you get LVS, but is a bit harder to configure unless you are willing to spend the extra time learning how..

The openldap code probably is not ideal the way you are using it, probably because other people in the past have not done failover like you are doing it..

Cheers
Brett

I am really not astonished about your results.
Run your tests again, but use "reject" as iptables target.
"drop" means, that you never ever get an answer.

Ok, tried that.
For scenario 1, search against slapd times out after about 3s, doesn't attempt to contact host1.
For scenario 3 it makes no difference, after about 15 mins slapd times out against host1 and contacts host2 instead.
--
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

Howard Chu

2012-08-24 18:55:58 UTC

Let me make it less vague then.
What I've been trying to simulate are the various modes by which a uri
target will become unavailable. What I'm trying to achieve is to have
the meta backend point to four domain controllers and cope with one or
more DCs being unavailable.
Having gone through this and let the system time out each time, I've
found it does fail over under one of the conditions listed below, but it
takes about 15 minutes to do so.
1. slapd starts, first target is unreachable;
2. slapd starts, first target is reachable but has no service running;
3. slapd already running, first target up and connected then later
becomes unreachable.
a. 'Unreachable' simulated by blocking outbound access with the
iptables -A OUTPUT -d host1 -j DROP
b. 'Unreachable' simulated making the first target a host that is up but
with no service running.
Case 1a: slapd retries host1 continuously and times out after about
180s. No attempt is made to contact additional targets.
Case 2b: slapd retries host1 continuously and times out after about
180s. No attempt is made to contact additional targets.
Case 3a: slapd retries host1 continuously, doubling an internal timeout
value each time, eventually timing out after 19 retries and about 15m.
It does then fall through to host2 and subsequent connections don't
attempt to contact host1.
Here's my config. I've also tried setting nretries explicitly to 3, but
it makes no difference.
database meta
suffix dc=local
rootdn cn=administrator,dc=local
rootpw secret
network-timeout 1
uri ldap://host1:3268/ou=dc1,dc=local
ldap://host2:3268/
ldap://host3:3268/
suffixmassage "ou=dc1,dc=local" "dc=example,dc=com"
idassert-bind bindmethod=simple
binddn="cn=proxyuser,dc=example,dc=com"
credentials="password"
idassert-authzfrom "dn.exact:cn=administrator,dc=local"
These results suggest to me that network-timeout and nretries (which
should default to 3) don't work as documented.
Having said that, it does seem to at least cope with scenario 3, albeit
with a long timeout.
Ideally it'd work in all cases. Pierangelo says the failover works when
connect() times out, but I'd have thought that would include scenarios 1
and 2 but not 3.

Sounds like you should file an ITS.

Pierangelo: looking at libldap/request.c and libldap/.open.c, it appears that
request.c:ldap_new_connection() expects open.c:ldap_int_open_connection() to
return -2 on an asynch open, but ldap_int_open_connection() unconditionally
returns 0. This is probably interfering with back-meta's urllist_proc.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

Liam Gretton

2012-08-28 08:08:19 UTC

Post by Howard Chu

Post by Liam Gretton
Ideally it'd work in all cases. Pierangelo says the failover works when
connect() times out, but I'd have thought that would include scenarios 1
and 2 but not 3.

Sounds like you should file an ITS.

ITS#7372 submitted.

--
Liam Gretton ***@le.ac.uk
HPC Architect http://www.le.ac.uk/its
IT Services Tel: +44 (0)116 2522254
University of Leicester, University Road
Leicestershire LE1 7RH, United Kingdom

h***@arcor.de

2012-08-24 11:48:43 UTC