This issue applies to 3.4, 3.5, and 3.6. The used printing system is CUPS. Known way to work around the issue: send a SIGHUP twice to smbd To reproduce you have to add a new printer via the lpadmin command lpadmin -p dummy0 -E -v /dev/null Use "lpstat -v | grep dummy" to check if the printer was added. Next call "smbclient -N -L localhost | grep dummy" to see if we got the printer. From the logs we see cups_cache_reload_async() gets called and the intended read and write callbacks are executed for the printer in question. From the log we see also pcap_cache_add_specific() gets called.
The function chain lp_servicenumber() -> add_a_service() -> hash_a_service() -> lp_add_printer() isn't triggered. But we see these functions being called in the case a second SIGHUP is sent. And without lp_add_printer() being called our new printer isn't available. Now the next question is why isn't it triggered already before? Therefore I like to understand what's causing the trigger or how this is expected to work.
Created attachment 6104 [details] Initial smbd log without the dummy0 printer (3.4.3)
Created attachment 6105 [details] smbd log after the printer dummy0 got added (3.4.3)
Created attachment 6106 [details] after first SIGHUP (3.4.3)
Created attachment 6107 [details] after 2nd SIGHUP (3.4.3)
For 3.5.6 this looks even worse. Even a third SIGHUP doesn't make the dummy0 printer visible. Only killing the running smbd and launching a new one showed the new printer. Stacked log files of the 3.5.6 case follow.
Created attachment 6112 [details] Initial smbd log without the dummy0 printer (3.5.6)
Created attachment 6113 [details] smbd log after the printer dummy0 got added (3.5.6)
Created attachment 6114 [details] after first SIGHUP (3.5.6)
Created attachment 6115 [details] after 2nd SIGHUP (3.5.6)
Created attachment 6116 [details] after 3rd SIGHUP (3.5.6)
Created attachment 6117 [details] after kill of the running and new launch of smbd (3.5.6) Now we see the dummy0 printer after a smbclient -L
Created attachment 6154 [details] Fix race in cups async printer services reload. Call post_cache_fill_fn() after cups_cache_reload() succeeded. Patch written in pair programming with David Disseldorp <ddiss at suse dot de>
BTW adding the PID to the log file via setting "debug pid = Yes" had been quite useful. Why aren't we doing this by default? In particular since we're using async code this wide spread nowadays?
Hmm, looks like the attached patch is not a simple rebase to master, an additional race condition is introduced with commit 7022554 - s3-printing: Use printer list tdb in pcap. Prior to commit 7022554 each smbd retained it's own pcap cache, therefore ordering of events between smbd processes during pcap update was not important. A two process example: 1) smbd1 receives HUP 2) smbd1 forks child1 to obtain cups printer info 3) smbd1 reloads printer shares prior to child1 completion (stale pcap) 4) smbd2 receives HUP 5) smbd2 forks child to obtain cups printer info 6) smbd2 reloads printer shares prior to child2 completion (stale pcap) 7) child1 completion, smbd1 pcap is updated 8) child2 completion, smbd2 pcap is updated The simple fix for the above race condition was to have smbd only reload printer shares following child completion, such that: 1) smbd1 receives HUP 2) smbd1 forks child1 to obtain cups printer info 3) smbd2 receives HUP 4) smbd2 forks child to obtain cups printer info 5) child1 completion, smbd1 pcap is updated 6) smbd1 reloads printer shares after child1 completion (fresh pcap) 7) child2 completion, smbd2 pcap is updated 8) smbd2 reloads printer shares after child2 completion (fresh pcap) Since commit 7022554, smbds share a pcap cache (printer_list.tdb), therefore ordering of events between smbd1 and smbd2 *is* important. A two process example: 1) smbd1 receives HUP 2) smbd1 checks whether pcap needs refresh, it does 3) smbd1 marks pcap as refreshed 4) smbd1 forks child1 to obtain cups printer info 5) smbd1 reloads printer shares prior to child1 completion (stale pcap) 6) smbd2 receives HUP 7) smbd2 checks whether pcap needs refresh, it does not (due to step 3) 8) smbd2 reloads printer shares prior to child1 completion (stale pcap) 9) child1 completion, pcap is updated In this case both smbd1 and smbd2 are reliant on the pcap update performed on child1 completion. The fix used in 3.4 is suitable for ensuring smbd1 obtains fresh pcap data for print shares, however determining when smbd2 should reload printer shares is not so straight forward.
(In reply to comment #15) > Since commit 7022554, smbds share a pcap cache (printer_list.tdb), > therefore ordering of events between smbd1 and smbd2 *is* important. A > two process example: > 1) smbd1 receives HUP > 2) smbd1 checks whether pcap needs refresh, it does > 3) smbd1 marks pcap as refreshed > 4) smbd1 forks child1 to obtain cups printer info > 5) smbd1 reloads printer shares prior to child1 completion (stale pcap) > 6) smbd2 receives HUP > 7) smbd2 checks whether pcap needs refresh, it does not (due to step 3) > 8) smbd2 reloads printer shares prior to child1 completion (stale pcap) > 9) child1 completion, pcap is updated > > In this case both smbd1 and smbd2 are reliant on the pcap update > performed on child1 completion. The fix used in 3.4 is suitable for > ensuring smbd1 obtains fresh pcap data for print shares, however > determining when smbd2 should reload printer shares is not so straight > forward. > Thinking about this a little more, it may be worth moving the pcap cache reload code to the parent (listener) smbd process such that: - the parent smbd adds a pcap_reload event handler which gets kicked on SIGHUP, smbcontrol reload and pcap cache time expiry - the pcap_reload event handler reloads printer_list.tdb from the pcap backend and emits MSG_PCAP_CHANGED on completion - child smbds register a message handler for MSG_PCAP_CHANGED which reads printer_list.tdb and updates printer services accordingly This wouldn't be a small change, but would have the following advantages from the current architecture: - removal of the race condition outlined in comment 15 - only one process requests printer information from the pcap backend - printer_list.tdb has only a single writer Does this sound sane?
Created attachment 6170 [details] reload printers test script
Comment on attachment 6154 [details] Fix race in cups async printer services reload. This version of the patch got superseded.
Created attachment 6186 [details] Patch set to make pcap cache changes visible after async refactor This patch set consists of: 0001-s3-printing-reload-shares-after-pcap-cache-fill.patch 0002-s3-printing-Initiate-pcap-reload-from-parent-smbd.patch 0003-s3-printing-remove-old-entries-in-pcap_cache_replace.patch 0004-s3-printing-remove-printer_list_need_refresh.patch 0005-s3-printing-use-time_mono-in-housekeeping-functions.patch and is taken from oss.sgi.com/ddiss/samba branch bso7836_fix_pcap_races
Comment on attachment 6154 [details] Fix race in cups async printer services reload. Obsoleted by attachment #6186 [details]
This issue is a release show stopper for 3.6.0.
I know. I'm planning to do the review today (I have a cold/flu :-). Jeremy.
(In reply to comment #22) > I know. I'm planning to do the review today (I have a cold/flu :-). Thanks Jeremy. I've rebased against master and incorporated changes following feedback from bj@SerNet.DE and asn@samba.org. New branch is: git://oss.sgi.com/ddiss/samba bso7836_fix_pcap_races_rb2 Get well soon :-)
Created attachment 6188 [details] bso7836_fix_pcap_races_rb2.patch
Comment on attachment 6186 [details] Patch set to make pcap cache changes visible after async refactor This patch got obsoleted by attachment #6188 [details]
The fix for 3.4 and 3.5 is less invasive, as smbd processes maintain separate printcap caches rather than sharing a common printer_list.tdb. See: git://oss.sgi.com/ddiss/samba bso7836_34t git://oss.sgi.com/ddiss/samba bso7836_35t Both branches include our change to reload shares after pcap cache fill, as well as a further commit which adds a parent smbd housekeeping function.
Created attachment 6190 [details] bso7836_34t.patch
Created attachment 6191 [details] bso7836_35t.patch
Comment on attachment 6190 [details] bso7836_34t.patch The patch looks fine for me.
Comment on attachment 6191 [details] bso7836_35t.patch The patch looks good.
Karolin, as both patches got a positive reviewe please merge attachment #6190 [details] into the 3.4 and #6191 into the 3.5 branch.
(In reply to comment #31) > Karolin, as both patches got a positive reviewe please merge attachment #6190 [details] > into the 3.4 and #6191 into the 3.5 branch. > Done. Closing out bug report. Thanks!
Reverted both patches for now, because "s3-printing: reload shares after pcap cache fill" broke 'make test' on my system (v3-4-test). Could you please investigate again? Thanks!
(In reply to comment #33) > Reverted both patches for now, because "s3-printing: reload shares after pcap > cache fill" broke 'make test' on my system (v3-4-test). v3-5-test seems to be affected also.
(In reply to comment #33) > Reverted both patches for now, because "s3-printing: reload shares after pcap > cache fill" broke 'make test' on my system (v3-4-test). > > Could you please investigate again? Will do, thanks Karolin.
two asserts introduced by the 'reload shares after pcap cache fill' change should be removed in the 3.4 and 3.5 branches: pcap_cache_loaded() assertions were added to the (re)load_printers() functions, to ensure the caller had called pcap_cache_reload() prior to reloading printer shares. The problem is, pcap_cache_loaded() returns false if the the pcap_cache contains no printer entries. i.e. pcap_cache_reload() has run but not detected any printers. Remove these assertions, correct call ordering is already enforced. These assertions do not cause problems in master, as pcap_cache_loaded() is determined by the existence of a time stamp in printer_list.tdb. The time stamp is added by printer_list_mark_reload() in which gets called in pcap_cache_reload(). I've pushed a new commit to both bso7836_34t and bso7836_35t branches which removes the asserts.
Created attachment 6208 [details] bso7836_34t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch
Created attachment 6209 [details] bso7836_35t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch
Re-assigning to Günther for patch review.
David, I'm confused, is it now a 3.5/3.4 issue or something for master/3.6 ?
(In reply to comment #40) > David, I'm confused, is it now a 3.5/3.4 issue or something for master/3.6 ? The fix for this bug made it into master and 3.6. The 3.5 and 3.4 changes are different, as each process maintains it's own pcap cache rather than the shared printer_list.tdb in 3.6+. Those changes have not yet made it into the tree.
ok, thanks.
Comment on attachment 6208 [details] bso7836_34t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch looks good
Comment on attachment 6209 [details] bso7836_35t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch looks good
karolin, please add to 3.4 and 3.5. thanks.
Comment on attachment 6209 [details] bso7836_35t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch Looks good
(In reply to comment #45) > karolin, please add to 3.4 and 3.5. thanks. Both patches do not apply to the current branches: user@host:/data/git/samba/v3-5-test> git am bso7836_35t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch Applying: s3-printing: remove pcap_cache_loaded asserts error: patch failed: source3/printing/load.c:57 error: source3/printing/load.c: patch does not apply error: patch failed: source3/smbd/server.c:760 error: source3/smbd/server.c: patch does not apply Patch failed at 0001 s3-printing: remove pcap_cache_loaded asserts user@host:/data/git/samba/v3-4-test> git am bso7836_34t-0001-s3-printing-remove-pcap_cache_loaded-asserts.patch Applying: s3-printing: remove pcap_cache_loaded asserts error: patch failed: source3/printing/load.c:57 error: source3/printing/load.c: patch does not apply error: patch failed: source3/smbd/server.c:715 error: source3/smbd/server.c: patch does not apply Patch failed at 0001 s3-printing: remove pcap_cache_loaded asserts Have I missed something?
Hi Karolin, (In reply to comment #47) > > Both patches do not apply to the current branches: The patches were on top of the initial changes which were reverted (rightly so) following make test failure. One option is to revert the revert commits, then pull in the "remove pcap_cache_loaded asserts" patch. i.e in v3.5-test: ddiss@lutze:~/isms/samba/source3> git log --pretty=oneline origin/v3-5-test |grep Revert|head -n 2 b6268f507fa3276c2ef22c58bad400a3fed48cd9 Revert "s3-printing: update parent smbd pcap cache" e4579eab7fe3eab7a5209e6de74e6fd2f53099d0 Revert "s3-printing: reload shares after pcap cache fill" > git revert e4579eab7fe3eab7a5209e6de74e6fd2f53099d0 > git revert b6268f507fa3276c2ef22c58bad400a3fed48cd9 I've pushed two new branches: git://oss.sgi.com/ddiss/samba bso7836_35t_rr and git://oss.sgi.com/ddiss/samba bso7836_34t_rr Revert "Revert "s3-printing: reload shares after pcap cache fill"" Revert "Revert "s3-printing: update parent smbd pcap cache"" s3-printing: remove pcap_cache_loaded asserts
Created attachment 6474 [details] 3-5-test.patch
Created attachment 6475 [details] v3-4-test.patch
Karolin, please add to 3.4 and 3.5, thanks.
(In reply to comment #51) > Karolin, please add to 3.4 and 3.5, thanks. Pushed, closing out bug report. Thanks!