Bug 10652 - Samba 4 consuming a lot of CPU when re-reading printcap info
Samba 4 consuming a lot of CPU when re-reading printcap info
Status: RESOLVED FIXED
Product: Samba 4.1 and newer
Classification: Unclassified
Component: Printing
4.1.7
All Linux
: P5 normal
: ---
Assigned To: Karolin Seeger
Samba QA Contact
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-06-06 20:12 UTC by Alex K
Modified: 2014-09-19 17:09 UTC (History)
3 users (show)

See Also:


Attachments
testparm.txt - testparm output (1.22 KB, text/plain)
2014-06-06 20:12 UTC, Alex K
no flags Details
strace of idle smbd (1.09 MB, text/plain)
2014-06-06 20:13 UTC, Alex K
no flags Details
CPU load before and after upgrade from 3 to 4 (32.26 KB, image/png)
2014-06-06 20:13 UTC, Alex K
no flags Details
log.smbd of a spike (74 bytes, text/plain)
2014-06-12 20:29 UTC, Alex K
no flags Details
strace.log of a spiking process (74 bytes, text/plain)
2014-06-12 20:30 UTC, Alex K
no flags Details
log.smbd of a spike (1.81 MB, application/octet-stream)
2014-06-12 20:49 UTC, Alex K
no flags Details
strace.log.gz of a spike (403.27 KB, application/gzip)
2014-06-12 20:49 UTC, Alex K
no flags Details
24 CPU server with 1k printers & 400 clients doing an exclusive traverse (80.48 KB, image/png)
2014-07-11 08:57 UTC, David Disseldorp
no flags Details
24 CPU server with 1k printers & 400 clients doing a traverse read (82.13 KB, image/png)
2014-07-11 08:58 UTC, David Disseldorp
no flags Details
24 CPU server with 1k printers & 400 idle clients doing a traverse read on enumeration only (76.96 KB, image/png)
2014-07-24 09:45 UTC, David Disseldorp
no flags Details
24 CPU server with 1k printers, 400 idle clients, and 40 netshareenum rpcclients, doing a traverse read on enumeration only (76.68 KB, image/png)
2014-07-24 09:59 UTC, David Disseldorp
no flags Details
printcap update correctness test (2.40 KB, text/plain)
2014-08-04 13:03 UTC, David Disseldorp
no flags Details
simple smbclient and rpcclient load generator (2.05 KB, application/octet-stream)
2014-08-04 13:04 UTC, David Disseldorp
no flags Details
patch series for v4-1-test branch, cherry-picked from master (44.14 KB, patch)
2014-08-09 00:53 UTC, David Disseldorp
ddiss: review? (asn)
jra: review+
Details
patch series for v4-0-test branch (44.92 KB, application/x-download)
2014-08-11 13:27 UTC, David Disseldorp
ddiss: review? (asn)
jra: review+
Details
patch series for v3-6 branch (31.17 KB, patch)
2014-08-11 13:38 UTC, David Disseldorp
no flags Details
screenshot-after.png (36.02 KB, image/png)
2014-09-19 17:09 UTC, Alex K
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alex K 2014-06-06 20:12:16 UTC
Hi team, 

After I upgraded my printservers from Samba 3 to Samba 4, their CPU load spiked dramatically. Every 60 seconds (printcap cache time) all Samba processes re-read printcap configuration. With a couple of hundreds processes doing it at exactly the same moment the load on the server is huge. 

Why does Samba 4 causes more load when re-reading printcap info?
Perhaps you can make just one smbd process re-read the printcap? 
Or at least have them do that at random moment instead of the same second as every other process?

I'm attaching testparm output, strace output of an idle smbd process, and a graph of CPU/memory load over the last three months (before and after upgrade).
Comment 1 Alex K 2014-06-06 20:12:46 UTC
Created attachment 10008 [details]
testparm.txt - testparm output
Comment 2 Alex K 2014-06-06 20:13:26 UTC
Created attachment 10009 [details]
strace of idle smbd
Comment 3 Alex K 2014-06-06 20:13:52 UTC
Created attachment 10010 [details]
CPU load before and after upgrade from 3 to 4
Comment 4 Jeremy Allison 2014-06-11 21:33:43 UTC
Hmmm. Are you sure it's reloading the printcap that's causing this ? There are other 60-second cycles in Samba.
Comment 5 Alex K 2014-06-11 22:05:24 UTC
No, I didn't think about it. Maybe it is something else, I was hoping strace would provide some clue. 

Here is what is defined in the configuration for various timeouts:

$ testparm -sv | grep time.*[0-9]
	passwd chat timeout = 2
	deadtime = 15
	lpq cache time = 30
	name cache timeout = 660
	ctdb timeout = 0
	ctdb locktime warn threshold = 0
	printcap cache time = 60
	cups connection timeout = 30
	machine password timeout = 604800
	username map cache time = 0
	lock spin time = 200
	oplock break wait time = 0
	ldap timeout = 15
	ldap connection timeout = 2
	afs token lifetime = 604800
	idmap cache time = 604800
	idmap negative cache time = 120
	winbind cache time = 300
	dfree cache time = 0
Comment 6 Jeremy Allison 2014-06-11 23:09:08 UTC
Yeah, it probably *is* the printcap cache time though :-).

Can you get an strace from an smbd when the load is spiking ?

Jeremy.
Comment 7 Alex K 2014-06-12 20:26:48 UTC
Looks like it reloads the printers. 
I'm attaching the strace and log.smbd (debug level 7) of the spiking process.

I changed printcap cache time to 3 minutes, but it's not clear if Samba is following this delay correctly. From the logs (two processes are running) it seems like one process is refreshing its cache every 3 min, and another every 4 min:
[2014/06/12 19:41:43.497343,  5] ..(cups_cache_reload_async)
[2014/06/12 19:42:45.560515,  5] ..(cups_cache_reload_async)
[2014/06/12 19:45:45.670602,  5] ..(cups_cache_reload_async)
[2014/06/12 19:48:45.800344,  5] ..(cups_cache_reload_async)
[2014/06/12 19:49:49.453206,  5] ..(cups_cache_reload_async)
[2014/06/12 19:51:45.971307,  5] ..(cups_cache_reload_async)
[2014/06/12 19:53:52.395327,  5] ..(cups_cache_reload_async)
[2014/06/12 19:54:46.141277,  5] ..(cups_cache_reload_async)
...
[2014/06/12 20:10:04.214391,  5] ..(cups_cache_reload_async)
[2014/06/12 20:12:46.938543,  5] ..(cups_cache_reload_async)
[2014/06/12 20:14:07.182487,  5] ..(cups_cache_reload_async)
[2014/06/12 20:15:47.038809,  5] ..(cups_cache_reload_async)
[2014/06/12 20:18:10.145206,  5] ..(cups_cache_reload_async)
[2014/06/12 20:18:47.145976,  5] ..(cups_cache_reload_async)

print_queue_housekeeping is happening every 60 sec, doesn't seem to be configurable via smb.conf:
[2014/06/12 20:19:47.197872,  5] ..(print_queue_housekeeping)
[2014/06/12 20:20:47.209912,  5] ..(print_queue_housekeeping)
[2014/06/12 20:21:47.267869,  5] ..(print_queue_housekeeping)
Comment 8 Alex K 2014-06-12 20:29:29 UTC
Created attachment 10026 [details]
log.smbd of a spike

log.smbd of a spiking process.
Comment 9 Alex K 2014-06-12 20:30:55 UTC
Created attachment 10027 [details]
strace.log of a spiking process

strace.log of a spiking process
Comment 10 Alex K 2014-06-12 20:49:09 UTC
Created attachment 10028 [details]
log.smbd of a spike

log.smbd of a spike
Comment 11 Alex K 2014-06-12 20:49:58 UTC
Created attachment 10029 [details]
strace.log.gz of a spike

strace.log.gz of a spike
Comment 12 David Disseldorp 2014-07-10 13:50:09 UTC
I'm looking at a problem similar to this at the moment on a similar 3.6 code-base:

Upon pulling down a list of printers from the printing backend (CUPS in this
case), the background printer process writes this data to printer_list.tdb. It then notifies all smbd processes, which sees each one traverse printer_list.tdb and update its share listings. During traversal, the printer_list.tdb database is locked _exclusively_ by every smbd process.

With one smbd process per client connection, contention is only encountered
when the number of clients and number of printers reaches a large number
(hundreds or thousands).

Using traverse_read() for the per-client smbd printer_list.tdb traversal does
reduce lock contention, and appears to improve the responsiveness of smbd. However, with all smbds now able to perform the traversal in parallel,
there is a large spike in CPU usage across the period from printcap update until all processes are finished with the traversal.

I'm now looking at implementing a change notification based method for
printcap updates, such that the background printing process identifies added
and removed printers and only tells per-client smbds of these changes, saving
the cost of processing all printers obtained in every printcap update.

@Jeremy, does this approach sound sane to you?
Comment 13 Alex K 2014-07-10 14:33:02 UTC
I can confirm that I only see this problem on servers where there are 100+ printers and 2000+ users.
Comment 14 Volker Lendecke 2014-07-10 15:02:12 UTC
(In reply to comment #12)
> I'm looking at a problem similar to this at the moment on a similar 3.6
> code-base:
> 
> Upon pulling down a list of printers from the printing backend (CUPS in this
> case), the background printer process writes this data to printer_list.tdb. It
> then notifies all smbd processes, which sees each one traverse printer_list.tdb
> and update its share listings. During traversal, the printer_list.tdb database
> is locked _exclusively_ by every smbd process.
> 
> With one smbd process per client connection, contention is only encountered
> when the number of clients and number of printers reaches a large number
> (hundreds or thousands).
> 
> Using traverse_read() for the per-client smbd printer_list.tdb traversal does
> reduce lock contention, and appears to improve the responsiveness of smbd.
> However, with all smbds now able to perform the traversal in parallel,
> there is a large spike in CPU usage across the period from printcap update
> until all processes are finished with the traversal.
> 
> I'm now looking at implementing a change notification based method for
> printcap updates, such that the background printing process identifies added
> and removed printers and only tells per-client smbds of these changes, saving
> the cost of processing all printers obtained in every printcap update.
> 
> @Jeremy, does this approach sound sane to you?

I'm not Jeremy, but I'm still uttering thoughts here :-)

First, we might be sitting on the fcntl thundering herd problem that eventually led to the tdb mutex work.

Then, a broader question: Why does smbd have to have an up-to-date printer list in memory at all? When a client issues an enumeration RPC, sure, that's the last moment when this list should be assembled, but that comes in at arbitrary, random times. smbd only needs to know that its view is not current anymore, so that *when* such an enumeration RPC comes in it can update. This can be done very cheaply with the TDB_SEQNUM trick. Other calls always go to specific printer names. If smbd is missing one, it can look specificially at a tdb key without doing a traverse.

Unfortunately I haven't looked at the printing subsystem in a while, so I don't know where I'm wrong, but if there's one thing I've learned in the past, then it is that enumeration of *anything* should be avoided at high cost :-)
Comment 15 David Disseldorp 2014-07-10 17:07:18 UTC
(In reply to comment #14)
...
> I'm not Jeremy, but I'm still uttering thoughts here :-)

Thanks for the feedback anyhow :-)

> First, we might be sitting on the fcntl thundering herd problem that eventually
> led to the tdb mutex work.

With traverse_read() the fcntl thundering herd doesn't appear to be an issue, it's the sheer number of processes doing the traversal + processing in parallel that contributes to the high CPU usage.

FWIW, the (3.6 based) traverse_read changes I'm talking about can be seen at 
http://git.samba.org/?p=ddiss/samba.git;a=commitdiff;h=56dbb02308ce54dbe141ab5bc951aa8e40066252

> Then, a broader question: Why does smbd have to have an up-to-date printer list
> in memory at all? When a client issues an enumeration RPC, sure, that's the
> last moment when this list should be assembled, but that comes in at arbitrary,
> random times. smbd only needs to know that its view is not current anymore, so
> that *when* such an enumeration RPC comes in it can update. This can be done
> very cheaply with the TDB_SEQNUM trick. Other calls always go to specific
> printer names. If smbd is missing one, it can look specificially at a tdb key
> without doing a traverse.
> 
> Unfortunately I haven't looked at the printing subsystem in a while, so I don't
> know where I'm wrong, but if there's one thing I've learned in the past, then
> it is that enumeration of *anything* should be avoided at high cost :-)

Updating on client enumeration sounds like a reasonable approach, but would be much more intrusive than the existing push-based model. IIUC, we'd have a workflow similar to:
- background print process poles printer backend (e.g. CUPS) at printcap cache time interval, fills printer_list.tdb (includes TDB_SEQNUM update), handles AD unpublishing of deleted printers, and awaits MSG_PRINTER_PCAP messages.
- forked per-client smbd processes await enumeration requests. On receiving such a request, they compare the printer_list.tdb TDB_SEQNUM with their in-memory value (initially zero).
- if TDB_SEQNUM doesn't match, then they process the current list of printers and update their in memory seqnum value.

I'm not so sure I see how this removes the traversal, but it would reduce the number of processes hitting the printer_list.tdb to only the ones that are interested.

One reason why I initially favoured the background process change-notification approach is that it would potentially also allow us to convert the CUPS printing backend to accept asynchronous CUPS notifications via the subscriptions API - http://www.cups.org/documentation.php/ref-subscriptions-conf.html .
Comment 16 Jeremy Allison 2014-07-10 18:59:41 UTC
(In reply to comment #15)

> FWIW, the (3.6 based) traverse_read changes I'm talking about can be seen at 
> http://git.samba.org/?p=ddiss/samba.git;a=commitdiff;h=56dbb02308ce54dbe141ab5bc951aa8e40066252

This change looks sensible to me. Does it also apply to 4.0.x/4.1.x (or can it be easily forward-ported) ?
Comment 17 Jeremy Allison 2014-07-10 20:54:51 UTC
(In reply to comment #15)

> Updating on client enumeration sounds like a reasonable approach, but would be
> much more intrusive than the existing push-based model. IIUC, we'd have a
> workflow similar to:
> - background print process poles printer backend (e.g. CUPS) at printcap cache
> time interval, fills printer_list.tdb (includes TDB_SEQNUM update), handles AD
> unpublishing of deleted printers, and awaits MSG_PRINTER_PCAP messages.
> - forked per-client smbd processes await enumeration requests. On receiving
> such a request, they compare the printer_list.tdb TDB_SEQNUM with their
> in-memory value (initially zero).
> - if TDB_SEQNUM doesn't match, then they process the current list of printers
> and update their in memory seqnum value.

Yes, this seems to be the most scalable way of doing things I think.
Nice summation of the logic needed !

> I'm not so sure I see how this removes the traversal, but it would reduce the
> number of processes hitting the printer_list.tdb to only the ones that are
> interested.

It doesn't remove the traversal, but it makes it read-only and only
at the point required, which should make it much more efficient.

> One reason why I initially favoured the background process change-notification
> approach is that it would potentially also allow us to convert the CUPS
> printing backend to accept asynchronous CUPS notifications via the
> subscriptions API -

A background process change-notification approach
would also fit nicely into the above refactoring
wouldn't it ? It makes the background process
printer list processing much more efficient.

Jeremy.
Comment 18 Volker Lendecke 2014-07-10 21:00:48 UTC
(In reply to comment #17)
> (In reply to comment #15)
> > I'm not so sure I see how this removes the traversal, but it would reduce the
> > number of processes hitting the printer_list.tdb to only the ones that are
> > interested.
> 
> It doesn't remove the traversal, but it makes it read-only and only
> at the point required, which should make it much more efficient.

Well, not necessarily more efficient as such, but hopefully more distributed over time, so we avoid these heavy spikes.

Volker
Comment 19 David Disseldorp 2014-07-11 08:57:38 UTC
Created attachment 10097 [details]
24 CPU server with 1k printers & 400 clients doing an exclusive traverse
Comment 20 David Disseldorp 2014-07-11 08:58:10 UTC
Created attachment 10098 [details]
24 CPU server with 1k printers & 400 clients doing a traverse read
Comment 21 Volker Lendecke 2014-07-11 09:01:56 UTC
(In reply to comment #20)
> Created attachment 10098 [details]
> 24 CPU server with 1k printers & 400 clients doing a traverse read

Wow, traverse read looks a LOT worse than exclusive traverse. Maybe much more parallelism kicking in?
Comment 22 David Disseldorp 2014-07-11 09:19:05 UTC
(In reply to comment #21)
> (In reply to comment #20)
> > Created attachment 10098 [details] [details]
> > 24 CPU server with 1k printers & 400 clients doing a traverse read
> 
> Wow, traverse read looks a LOT worse than exclusive traverse. Maybe much more
> parallelism kicking in?

I wouldn't say it's worse - smbd is much more responsive, as the lock contention is mostly gone. That said, the traverse_read change alone doesn't appear to be a viable solution for the high CPU usage, which is why I plan to couple it with your suggested update-on-enumerate change.
Comment 23 David Disseldorp 2014-07-11 09:19:46 UTC
(In reply to comment #16)
> (In reply to comment #15)
> 
> > FWIW, the (3.6 based) traverse_read changes I'm talking about can be seen at 
> > http://git.samba.org/?p=ddiss/samba.git;a=commitdiff;h=56dbb02308ce54dbe141ab5bc951aa8e40066252
> 
> This change looks sensible to me. Does it also apply to 4.0.x/4.1.x (or can it
> be easily forward-ported) ?

It should be easy to forward port, but as mentioned in comment#12, converting to printer_list.tdb traverse_read alone doesn't really fix the high CPU usage.

With exclusive traverse (attachment#10097 [details]), the printer_list.tdb lock contention means that the printcap update takes an eternity to be propagated to all client smbd processes, but CPU usage is only moderate due to the lock-step behaviour.

With traverse_read (attachment#10098 [details]), the printer_list.tdb lock contention is gone, but that means that (currently) all smbds process printer_list.tdb in parallel, resulting in a relatively short but large spike in CPU usage.
Comment 24 David Disseldorp 2014-07-23 13:57:23 UTC
I was on vacation last week. Am now back and testing a 3.6 patch-set based on the traverse-only-if-changed + traverse-on-client-enum modifications suggested by Volker:

The following changes since commit 260ab8a34c076e98868a2c8de2a7dc14138d3947:

  VERSION: Bump version up to 3.6.25. (2014-06-23 11:26:19 +0200)

are available in the git repository at:

  git://git.samba.org/ddiss/samba.git printer_list_traverse_read_when_changed_36s

for you to fetch changes up to 2d5a47a9ce35e5105674db7343242c8f82211834:

  smbd: only reprocess printer_list.tdb if it changed (2014-07-23 15:37:18 +0200)

----------------------------------------------------------------
David Disseldorp (11):
      printing: traverse_read the printer list for share updates
      printing: only reload printer shares on client enum
      printing: reload printer_list.tdb from in memory list
      printing: expose printer_list.tdb transaction helpers
      printing: perform atomic printer_list.tdb updates
      printing: remove unused pcap_cache_add()
      printing: add printer_list_clean_all() function
      printing: use printer_list_clean_all() to replace pcap cache
      printing: remove unused printer_list_clean_old()
      printing: return last change time with pcap_cache_loaded()
      smbd: only reprocess printer_list.tdb if it changed

Volker Lendecke (1):
      smbd: Avoid double-free in get_print_db_byname

 source3/printing/load.c                     |   4 ++--
 source3/printing/pcap.c                     |  84 ++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------
 source3/printing/pcap.h                     |  13 ++++++-------
 source3/printing/print_aix.c                |  17 ++++++++++++++---
 source3/printing/print_iprint.c             |  16 ++++++++++------
 source3/printing/print_standard.c           |   8 ++++++--
 source3/printing/print_svid.c               |  11 +++++++----
 source3/printing/printer_list.c             | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------------------------------------
 source3/printing/printer_list.h             |  16 ++++++++++------
 source3/printing/printing_db.c              |   4 ++--
 source3/printing/spoolssd.c                 |  17 +----------------
 source3/rpc_server/spoolss/srv_spoolss_nt.c |   8 ++++++++
 source3/rpc_server/srvsvc/srv_srvsvc_nt.c   |   1 +
 source3/smbd/server.c                       |  27 +++++----------------------
 source3/smbd/server_reload.c                |  21 +++++++++++++++++++++
 source3/web/swat.c                          |   4 ++--
 16 files changed, 220 insertions(+), 163 deletions(-)

I wouldn't say it's ready for review yet. Will perform some new benchmarks and report back.
Comment 25 David Disseldorp 2014-07-24 09:45:53 UTC
Created attachment 10149 [details]
24 CPU server with 1k printers & 400 idle clients doing a traverse read on enumeration only

Note that (as with the previous graphs) none of the clients are performing share enumeration, and as per-client smbd processes now only update their printer share list on client enumeration, there is very little stress on the system.
Comment 26 Volker Lendecke 2014-07-24 09:54:50 UTC
(In reply to comment #24)
> I was on vacation last week. Am now back and testing a 3.6 patch-set based on
> the traverse-only-if-changed + traverse-on-client-enum modifications suggested
> by Volker:
> 
> The following changes since commit 260ab8a34c076e98868a2c8de2a7dc14138d3947:

Took a look nevertheless, this looks really, really interesting!

Two thoughs come to my mind by a quick glance (did not do a real review yet):

We already have a good printer_list_get_last_refresh to avoid traverses. This
could be tuned further by only accessing the LAST_REFRESH record if the
database seqnum changed. This would involve two static variables inside
printer_list_get_last_refresh, the current seqnum and the LAST_REFRESH value.
If the seqnum did not change, just return the last refresh value from memory.

The other comment involves the database transactions. In current master dbwrap
(have not looked at 3.6), transactions are only possible on persistent
databases. I'm not sure the printer list is really persistent as passdb.tdb or
secrets.tdb is.
Comment 27 David Disseldorp 2014-07-24 09:59:39 UTC
Created attachment 10150 [details]
24 CPU server with 1k printers, 400 idle clients, and 40 netshareenum rpcclients, doing a traverse read on enumeration only

For this graph, an additional 40 clients are added to the existing 400 idle client load. These extra 40 clients are issuing netshareenum(level=1) requests at regular ~2s intervals.
The short, moderate spikes in CPU usage occur when the background print process updates printer_list.tdb (printcap cache time = 120). On subsequent enumeration the per-client smbd processes notice the printer list change and reload their printer share inventories.
The per-client smbd reload only occurs on the first enumeration after the printer_list.tdb change.
Comment 28 David Disseldorp 2014-07-24 10:21:50 UTC
Thanks for the feedback Volker!

(In reply to comment #26)
> (In reply to comment #24)
> > I was on vacation last week. Am now back and testing a 3.6 patch-set based on
> > the traverse-only-if-changed + traverse-on-client-enum modifications suggested
> > by Volker:
> > 
> > The following changes since commit 260ab8a34c076e98868a2c8de2a7dc14138d3947:
> 
> Took a look nevertheless, this looks really, really interesting!
> 
> Two thoughs come to my mind by a quick glance (did not do a real review yet):
> 
> We already have a good printer_list_get_last_refresh to avoid traverses. This
> could be tuned further by only accessing the LAST_REFRESH record if the
> database seqnum changed. This would involve two static variables inside
> printer_list_get_last_refresh, the current seqnum and the LAST_REFRESH value.
> If the seqnum did not change, just return the last refresh value from memory.

For change detection I used the printer_list_get_last_refresh record only. I could change it over to use the seqnum, but wouldn't expect much in terms of a performance gain. I don't quite see how using both fields would be necessary.

> The other comment involves the database transactions. In current master dbwrap
> (have not looked at 3.6), transactions are only possible on persistent
> databases. I'm not sure the printer list is really persistent as passdb.tdb or
> secrets.tdb is.

Oh, I missed this gem. printer_list.tdb is not persistent, so I'll have to convert it over to use locking. Would be good if dbwrap/TDB transaction_start() returned error against non-persistent databases.
Comment 29 Volker Lendecke 2014-07-24 10:26:15 UTC
(In reply to comment #28)
> For change detection I used the printer_list_get_last_refresh record only. I
> could change it over to use the seqnum, but wouldn't expect much in terms of a
> performance gain. I don't quite see how using both fields would be necessary.

True, possibly we would not gain much, in particular as it's unlikely that many smbds come in really concurrently trying to read that record. But I've learned the hard way that anything like that can become a bottleneck. As this is a simple local optimization, it's perfectly fine to leave it for a later day, should it turn out to be a problem.

Volker
Comment 30 David Disseldorp 2014-07-24 11:40:25 UTC
(In reply to comment #28)
...
> > The other comment involves the database transactions. In current master dbwrap
> > (have not looked at 3.6), transactions are only possible on persistent
> > databases. I'm not sure the printer list is really persistent as passdb.tdb or
> > secrets.tdb is.
> 
> Oh, I missed this gem. printer_list.tdb is not persistent, so I'll have to
> convert it over to use locking. Would be good if dbwrap/TDB transaction_start()
> returned error against non-persistent databases.

OK, looks like this is a new restriction, and is only there for the case of CTDB...

416 int dbwrap_transaction_start(struct db_context *db)
417 {
418         if (!db->persistent) {
419                 /*
420                  * dbwrap_ctdb has two different data models for persistent
421                  * and non-persistent databases. Transactions are supported
422                  * only for the persistent databases. This check is here to
423                  * prevent breakages of the cluster case, autobuild at this
424                  * point only tests non-clustered Samba. Before removing this
425                  * check, please make sure that this facility has also been
426                  * added to dbwrap_ctdb.
427                  *
428                  * Thanks, vl
429                  */
430                 DEBUG(1, ("transactions not supported on non-persistent "
431                           "database %s\n", db->name));
432                 return -1;
433         }

Locking logic is going to make this much uglier :(.
Comment 31 Volker Lendecke 2014-07-24 12:32:11 UTC
(In reply to comment #30)
> 416 int dbwrap_transaction_start(struct db_context *db)
> 417 {
> 418         if (!db->persistent) {
> 419                 /*
> 420                  * dbwrap_ctdb has two different data models for persistent
> 421                  * and non-persistent databases. Transactions are supported
> 422                  * only for the persistent databases. This check is here to
> 423                  * prevent breakages of the cluster case, autobuild at this
> 424                  * point only tests non-clustered Samba. Before removing
> this
> 425                  * check, please make sure that this facility has also been
> 426                  * added to dbwrap_ctdb.
> 427                  *
> 428                  * Thanks, vl
> 429                  */
> 430                 DEBUG(1, ("transactions not supported on non-persistent "
> 431                           "database %s\n", db->name));
> 432                 return -1;
> 433         }
> 
> Locking logic is going to make this much uglier :(.

Why do partial lists hurt badly? The individual record should be fine I guess. And with a fresh enumeration everything should clean up eventually.
Comment 32 David Disseldorp 2014-07-24 18:49:07 UTC
(In reply to comment #29)
...
> > Locking logic is going to make this much uglier :(.
> 
> Why do partial lists hurt badly? The individual record should be fine I guess.
> And with a fresh enumeration everything should clean up eventually.

If a client smbd reloads from a partial list, then it will remove (valid) printers that haven't been added to the list yet.

Instead of locking, I could go back to the old procedure of:
    - write timestamp X
    - populate new entries from replacement pcap cache
    - weed out any old entries with a timestamp less than X
Comment 33 Volker Lendecke 2014-07-25 08:16:45 UTC
(In reply to comment #32) 
> If a client smbd reloads from a partial list, then it will remove (valid)
> printers that haven't been added to the list yet.
> 
> Instead of locking, I could go back to the old procedure of:
>     - write timestamp X
>     - populate new entries from replacement pcap cache
>     - weed out any old entries with a timestamp less than X

Ok. I had not thought about the possibility to first delete all printers and then re-add the current list. This would indeed be disturbing to users. Your new approach certainly looks better and is of course less disruptive.
Comment 34 Jeremy Allison 2014-07-25 21:25:40 UTC
David - I also looked over this code. Looking good to me (other than the transaction issue).

Ping me when you're done and I'll help review to get it into master + back ported for 4.x.x.

Cheers,

Jeremy.
Comment 35 David Disseldorp 2014-08-04 13:03:33 UTC
Created attachment 10173 [details]
printcap update correctness test
Comment 36 David Disseldorp 2014-08-04 13:04:31 UTC
Created attachment 10174 [details]
simple smbclient and rpcclient load generator
Comment 37 David Disseldorp 2014-08-04 13:15:37 UTC
I plan to submit the following upstream today...

The following changes since commit fb9d8c402614556d7a36f9e9efb72b3f4afe838a:

  lib: Remove unused nstrcpy (2014-08-04 09:58:16 +0200)

are available in the git repository at:

  git://git.samba.org/ddiss/samba.git printer_list_traverse_read_when_changed_notrans_master

for you to fetch changes up to 0b823e038e1c807b2ff9f13256a4b1b7b33c7fbd:

  smbd: split printer reload processing (2014-08-04 14:11:21 +0200)

----------------------------------------------------------------
David Disseldorp (7):
      printing: traverse_read the printer list for share updates
      printing: only reload printer shares on client enum
      printing: reload printer_list.tdb from in memory list
      printing: remove pcap_cache_add()
      printing: return last change time with pcap_cache_loaded()
      smbd: only reprocess printer_list.tdb if it changed
      smbd: split printer reload processing

 source3/printing/load.c                     |  4 ++--
 source3/printing/pcap.c                     | 54 ++++++++++++++++++++++++++----------------------------
 source3/printing/pcap.h                     | 13 ++++++-------
 source3/printing/print_aix.c                | 17 ++++++++++++++---
 source3/printing/print_iprint.c             | 16 ++++++++++------
 source3/printing/print_standard.c           |  8 ++++++--
 source3/printing/print_svid.c               | 11 +++++++----
 source3/printing/printer_list.c             | 17 +++++++++++------
 source3/printing/printer_list.h             |  4 ++--
 source3/printing/queue_process.c            | 10 ++++++----
 source3/printing/spoolssd.c                 | 38 +++++++-------------------------------
 source3/rpc_server/spoolss/srv_spoolss_nt.c | 11 ++++++++++-
 source3/rpc_server/srvsvc/srv_srvsvc_nt.c   |  1 +
 source3/smbd/proto.h                        |  2 ++
 source3/smbd/server.c                       | 20 --------------------
 source3/smbd/server_reload.c                | 95 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
 16 files changed, 194 insertions(+), 127 deletions(-)
Comment 38 David Disseldorp 2014-08-09 00:53:23 UTC
Created attachment 10187 [details]
patch series for v4-1-test branch, cherry-picked from master

One very slight change from master patch-set, in order to handle absence of
4286e147a62af784 - loadparm: rename lp[cfg]_print_ok to lp[cfg]_printable for consistency with docs
Comment 39 David Disseldorp 2014-08-11 13:27:23 UTC
Created attachment 10190 [details]
patch series for v4-0-test branch

Mostly the same as the 4.1 patch-set, with Swat and minor context changes.
Comment 40 David Disseldorp 2014-08-11 13:38:23 UTC
Created attachment 10191 [details]
patch series for v3-6 branch
Comment 41 Jeremy Allison 2014-08-11 22:35:27 UTC
Comment on attachment 10187 [details]
patch series for v4-1-test branch, cherry-picked from master

LGTM.
Comment 42 Jeremy Allison 2014-08-11 23:31:07 UTC
Comment on attachment 10190 [details]
patch series for v4-0-test branch

LGTM.
Comment 43 Jeremy Allison 2014-08-11 23:31:33 UTC
Re-assigning to Karolin for inclusion in 4.1.next, 4.0.next.
Comment 44 David Disseldorp 2014-08-12 08:53:49 UTC
Thanks for the review Jeremy!
Comment 45 Karolin Seeger 2014-09-01 19:09:25 UTC
Pushed to autobuild-v4-[0|1]-test.
Comment 46 Karolin Seeger 2014-09-03 07:12:46 UTC
Pushed to both branches.
Closing out bug report.

Thanks!
Comment 47 Alex K 2014-09-19 17:09:13 UTC
Created attachment 10295 [details]
screenshot-after.png

Just to show you guys how awesome you are (look at the blue line -- CPU). 
Thanks, much appreciated!