5976 – Continuous smbd panics after upgrade - printing affected.

Bug 5976 - Continuous smbd panics after upgrade - printing affected.

Summary: Continuous smbd panics after upgrade - printing affected.

Status:	RESOLVED FIXED

Alias:	None

Product:	Samba 3.2
Classification:	Unclassified
Component:	Printing (show other bugs)
Version:	3.2.6
Hardware:	x86 Windows XP

Importance:	P3 major
Target Milestone:	---
Assignee:	Samba Bugzilla Account
QA Contact:	Samba QA Contact

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-12-16 05:05 UTC by Pedro Rodrigues
Modified:	2009-05-11 05:44 UTC (History)
CC List:	2 users (show)

See Also:

Attachments
proposed patch (1.91 KB, patch) 2009-01-24 05:32 UTC, Volker Lendecke	no flags	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Pedro Rodrigues 2008-12-16 05:05:01 UTC

Upgraded to Samba 3.2.6, and immediately printing went into coma. Observed
that printers lose their configurations (driver models, locations, etc).
And if I reset the printer drivers on each share, eventually they will be
lost again.

Looking at the logs we are getting panics like this:

---
[2008/12/16 05:12:51,  0] lib/util_tdb.c:tdb_log(682)
 tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
changed!
[2008/12/16 05:12:51,  0] lib/util.c:reinit_after_fork(1050)
 tdb_reopen_all failed.
[2008/12/16 05:12:51,  0] printing/print_cups.c:cups_pcap_load_async(392)
 cups_pcap_load_async: reinit_after_fork() failed
[2008/12/16 05:12:51,  0] lib/util.c:smb_panic(1663)
 PANIC (pid 2590): cups_pcap_load_async: reinit_after_fork() failed
[2008/12/16 05:12:51,  0] lib/util.c:log_stack_trace(1767)
 BACKTRACE: 11 stack frames:
  #0 smbd(log_stack_trace+0x2d) [0x479c2f]
  #1 smbd(smb_panic+0x84) [0x479d71]
  #2 smbd(cups_cache_reload+0x299) [0x440d2c]
  #3 smbd(pcap_cache_reload+0x107) [0x43d189]
  #4 smbd(reload_printers+0x31) [0x6c477a]
  #5 smbd(check_reload+0x130) [0x2c6510]
  #6 smbd [0x2c655f]
  #7 smbd(smbd_process+0xc92) [0x2c8a76]
  #8 smbd(main+0x212c) [0x6c70ec]
  #9 /lib/libc.so.6(__libc_start_main+0xdc) [0xdd9dec]
  #10 smbd [0x24c061]
[2008/12/16 05:12:51,  0] lib/fault.c:dump_core(201)
 dumping core in /var/log/samba/cores/smbd
---

They come in pairs, and always relate to one of the printers tdb file at /var/lib/samba/printing/ .

I've already checked the tdb files for corruption and cleaned some of them (the ones I can), with no luck.

Comment 1 Pedro Rodrigues 2008-12-16 05:07:36 UTC

Should add we were upgrading from Samba 3.2.3, and configuration hasn't been changed since 3.2.1. Also, we are running it in Centos 5.2 fully updated.

Comment 2 Jeremy Allison 2008-12-16 12:16:12 UTC

> This message is the key :
>
>>   tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode
>>   has
>> changed!
>> [2008/12/16 05:12:51,  0] lib/util.c:reinit_after_fork(1050)
>>   tdb_reopen_all failed.
>
> which causes a deliberate panic in the code here :
>
>         /* Child. */
>         if (!reinit_after_fork(smbd_messaging_context(), true)) {
>                 DEBUG(0,("cups_pcap_load_async: reinit_after_fork()
> failed\n"));
>                 smb_panic("cups_pcap_load_async: reinit_after_fork()
> failed");
>         }
>
> How many printers do you have ? Do you have some sort of a batch
> job running that may delete printing tdb's ?
>
> Jeremy.
>

There are five printers, four of which with p&p drivers. And no, no batch
running, nothing that even touches the tdb files.

A ten minuet session of lsof shows only smbd accesses those tdb files. I
would be surprised if it was otherwise, since only Windows clients ever
hit this server.

/Pedro

Comment 3 Jeremy Allison 2008-12-16 12:21:35 UTC

This :

tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
changed!

is the line that confuses me. On a reopen, how can the dev/inode have changed unless the file has been deleted ? What filesystem type are you using for var/lib/samba/printing/CanonC1.tdb ?

Jeremy.

Comment 4 Pedro Rodrigues 2008-12-16 15:25:19 UTC

(In reply to comment #3)
> This :
> 
> tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
> changed!
> 
> is the line that confuses me. On a reopen, how can the dev/inode have changed
> unless the file has been deleted ? What filesystem type are you using for
> var/lib/samba/printing/CanonC1.tdb ?
> 
> Jeremy.
> 

It's an ext3 filesystem:

/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw,noatime)

Output from df:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      16188328   2653988  12698756  18% /

Could some kind of filesystem corruption do this?

/Pedro

Comment 5 Jeremy Allison 2008-12-16 15:28:14 UTC

No, it shouldn't be filesystem corruption. Does this always happen to a <printername>.tdb file when someone is printing to that printer ? How reproducible is this (and exactly what do you do to reproduce it) ?

Jeremy.

Comment 6 Pedro Rodrigues 2008-12-16 15:37:15 UTC

(In reply to comment #3)
> This :
> 
> tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
> changed!
> 
> is the line that confuses me. On a reopen, how can the dev/inode have changed
> unless the file has been deleted ? What filesystem type are you using for
> var/lib/samba/printing/CanonC1.tdb ?
> 
> Jeremy.
> 

It's an ext3 filesystem:

/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw,noatime)

Output from df:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      16188328   2653988  12698756  18% /

Could some kind of filesystem corruption do this?

/Pedro




(In reply to comment #5)
> No, it shouldn't be filesystem corruption. Does this always happen to a
> <printername>.tdb file when someone is printing to that printer ? How
> reproducible is this (and exactly what do you do to reproduce it) ?
> 
> Jeremy.
> 

I don't need to reproduce it. In the last 24 hours we got 394 "tdb_reopen: file dev/inode has changed!" messages, with a corresponding panic and core dump. It is happening as I am writing this.

It happens more frequently during the day , of course. At night there is very little if any activity on the server. I assume any activity is because users leave their workstations on.

/Pedro

Comment 7 Pedro Rodrigues 2008-12-16 15:40:25 UTC

(In reply to comment #3)
> This :
> 
> tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
> changed!
> 
> is the line that confuses me. On a reopen, how can the dev/inode have changed
> unless the file has been deleted ? What filesystem type are you using for
> var/lib/samba/printing/CanonC1.tdb ?
> 
> Jeremy.
> 

It's an ext3 filesystem:

/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw,noatime)

Output from df:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      16188328   2653988  12698756  18% /

Could some kind of filesystem corruption do this?

/Pedro




(In reply to comment #5)
> No, it shouldn't be filesystem corruption. Does this always happen to a
> <printername>.tdb file when someone is printing to that printer ? How
> reproducible is this (and exactly what do you do to reproduce it) ?
> 
> Jeremy.
> 

I don't need to reproduce it. In the last 24 hours we got 394 "tdb_reopen: file dev/inode has changed!" messages, with a corresponding panic and core dump. It is happening as I am writing this.

It happens more frequently during the day , of course. At night there is very little if any activity on the server. I assume any activity is because users leave their workstations on.

/Pedro(In reply to comment #6)
> (In reply to comment #3)
> > This :
> > 
> > tdb(/var/lib/samba/printing/CanonC1.tdb): tdb_reopen: file dev/inode has
> > changed!
> > 
> > is the line that confuses me. On a reopen, how can the dev/inode have changed
> > unless the file has been deleted ? What filesystem type are you using for
> > var/lib/samba/printing/CanonC1.tdb ?
> > 
> > Jeremy.
> > 
> 
> It's an ext3 filesystem:
> 
> /dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw,noatime)
> 
> Output from df:
> 
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
>                       16188328   2653988  12698756  18% /
> 
> Could some kind of filesystem corruption do this?
> 
> /Pedro
> 
> 
> 
> 
> (In reply to comment #5)
> > No, it shouldn't be filesystem corruption. Does this always happen to a
> > <printername>.tdb file when someone is printing to that printer ? How
> > reproducible is this (and exactly what do you do to reproduce it) ?
> > 
> > Jeremy.
> > 
> 
> I don't need to reproduce it. In the last 24 hours we got 394 "tdb_reopen: file
> dev/inode has changed!" messages, with a corresponding panic and core dump. It
> is happening as I am writing this.
> 
> It happens more frequently during the day , of course. At night there is very
> little if any activity on the server. I assume any activity is because users
> leave their workstations on.
> 
> /Pedro
> 

(In reply to comment #5)
> No, it shouldn't be filesystem corruption. Does this always happen to a
> <printername>.tdb file when someone is printing to that printer ? How
> reproducible is this (and exactly what do you do to reproduce it) ?
> 
> Jeremy.
> 


Oh, and no one needs to be printing, also. We are getting these now, and I am sure there is no one working at that office at this time.

/Pedro

Comment 8 Michael Croes 2009-01-22 04:55:53 UTC

Having the same issue as Pedro Rodrigues explains. We applied the patch from commit 101ef64e3181335b66524296af08f7df04391b9a . The panics stopped, but we keep losing printer settings. It seems to be that samba is corrupting the pdb files.

Also it seems that not all threads contain the same information about printers. I've experienced that windows clients will show all the correct settings when displaying the printer properties page for the server printer, but when trying to connect windows will say the driver is missing and then trying to connect again will succeed without errors.

If you you have a possible solution I'll be happy to test it on at least one of our servers.

Comment 9 Michael Croes 2009-01-22 08:30:38 UTC

That is tdb files of course, not pdb.

Comment 10 Martijn Berger 2009-01-23 12:54:06 UTC

I have tried 101ef64e3181335b66524296af08f7df04391b9a but for me that did resolve the crashing issue. 
Currently runing 3.2.7 + 
http://samba.org/~vlendec/0001-Ensure-null-termination.patch

This seems so fix the issue and my issue with rinter settings unsetting.
I will post again on in some days if it the fix still holds (no crashes or printers unsetting)

Comment 11 Volker Lendecke 2009-01-23 12:58:24 UTC

Looking closer, this definitely is another workaround for a deeper problem. The memory overflow that this fixes just "can't happen" from looking at the code. So there's deeper problems. I've spent the last hours staring at that code, I hope I'll find something over the weekend.

Volker

Comment 12 Volker Lendecke 2009-01-23 17:40:27 UTC

Took the better part of today, but I think why my patch fixes a valgrind error. In 3.2.7, after a fork, smbd_event_context() is not correctly re-initialized. Thus if we have just forked a cups printcap updater and connect while that is still active, we have two processes trying to read from the pipe that is only intended for the parent process which has forked the updater. Thus we get garbage in the binary pipe protocol. I have no clue yet how this can affect the tdb dev/ino thing, but this particular one is fixed in v3-2-test where we have a proper event_context_reinit.

Volker

Comment 13 Martijn Berger 2009-01-24 03:52:38 UTC

Ok both my machines ( one 3.2.7 + http://samba.org/~vlendec/0001-Ensure-null-termination.patch)
( and the other 3.2.7 + http://samba.org/~vlendec/0001-Ensure-null-termination.patch + 7648b8f94c0709f39b873b872bc1273a6568ad3b + e72978ae81580b1ec74d3b21df82dd0d49551c64) 
Have not had crashed smbd processes but they have both unset one or more printer drivers overnight.

rpcclient enumprinters
description:[\\atlas\ToshibaESTD2500c,TOSHIBA e-STUDIO3510cSeries PS3,ToshibaESTD2500c]
turned into: 
description:[\\atlas\ToshibaESTD2500c,,ToshibaESTD2500c]
without any users doing anything. Last person left the office yesterday and then at about 20:00 it was still ok this morning it was broken again.

I think I will have to donwgrade to 3.0 since users are not always willing guinea pigs.

Part of my log not sure if this is helpfull:

Jan 23 20:44:58 [smbd] [2009/01/23 20:44:58,  0] lib/substitute.c:alloc_sub_basic(556)_
Jan 23 20:44:58 [smbd] alloc_sub_basic: NULL source string!  This should not happen_

Comment 14 Volker Lendecke 2009-01-24 04:21:34 UTC

See my latest comment. I'll have a patch that fixes the problem I mentioned in a few hours. I have a theory why this can also cause the tdb reopen problems. I'm very certain that the v3-2-test git tree fixes this, but I'll provide a minimal patch latest by sunday.

Volker

Comment 15 Volker Lendecke 2009-01-24 05:32:42 UTC

Created attachment 3891 [details]
proposed patch

Ok, I think this is it. It does not go upstream, v3-2-test has already fixed this in another way.

Volker

Comment 16 Jeremy Allison 2009-01-24 12:27:40 UTC

Wow - thanks for tracking that one down. I would *never* have found that :-).
Great detective work.
Jeremy.

Comment 17 Karolin Seeger 2009-05-11 05:44:11 UTC

Closing out bug report.
Seems to be fixed.

Thanks for reporting!