Bug 1998 - queue/cancel on AIX broken (jobids messed up)
Summary: queue/cancel on AIX broken (jobids messed up)
Status: CLOSED FIXED
Alias: None
Product: Samba 3.0
Classification: Unclassified
Component: Printing (show other bugs)
Version: 3.0.10
Hardware: PPC AIX
: P3 major
Target Milestone: none
Assignee: Gerald (Jerry) Carter (dead mail address)
QA Contact: Samba QA Contact
URL:
Keywords:
: 2298 (view as bug list)
Depends on:
Blocks:
 
Reported: 2004-11-04 12:52 UTC by Eduard Bloch
Modified: 2005-08-24 10:19 UTC (History)
1 user (show)

See Also:


Attachments
log.smbd (comment #12) (52.83 KB, application/x-gzip)
2005-03-15 05:32 UTC, Olaf Imig
no flags Details
This is the fix for this Bug (1.55 KB, patch)
2005-04-26 06:31 UTC, Olaf Imig
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eduard Bloch 2004-11-04 12:52:51 UTC
Hello,

I have a problem on our AIX system with samba-3.0.7 and 3.0.8-pre2. Summary: I
get weird jobid numbers from the queue command in smbclient using that AIX
server. I cannot delete anything.

Long story:

when I login with smbclient to a printers share for the first time and I run the
queue command, I see the job IDs starting from 1 and enumerating to the last
one. This looks okay. When I repeat the queue command, I see weird numbers. They
beginn with 256 and increase in steps of 256 (not 1). Sometimes, when a fresh
job is comming in, I see an entry with the expected number (say, for 101 job, I
see 245, 512, ..., 25600 and then 101).

When I try to delete any of this jobs, either with the number reported by
smbclient, or the one I get by division of the reported one by 256, it always
says Job FOO deleted. But nothing is deleted, it is still in the queue (visible
with the queue command and lpq on the command line). BTW, I know this behaviour
from my Linux server using lprng. There, it always says "Job FOO deleted" when
nothing has been deleted, but "Error cancelling job FOO : NT_STATUS_OK" if it
was successful.

Only once I was able to do delete a job on the AIX server, when I set cache time
to 30 and entered "cancel nr" very quickly after the _first_ run of "queue". I
was not able to reproduce it again, no matter what I used as cache time.

To make sure that it is not our fault, I added some log commans to the "lprm
command", "lpr command" and "lpq command" vars - the last ones do work as
expected, lprm command does not work. Not executed at all. I compared the output
of our lpq command with the example in lpq_parse.c, the format is identical.

With debugging, the following messages (grep'ed for "job") appear:

[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 10750 maps to RAP jobid 38400
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(64)
  pjobid_to_rap: called.
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 10748 maps to RAP jobid 38144
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(64)
  pjobid_to_rap: called.
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 58 maps to RAP jobid 38656
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(64)
  pjobid_to_rap: called.
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 5668 maps to RAP jobid 38912
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(64)
  pjobid_to_rap: called.
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 10763 maps to RAP jobid 39168
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(64)
  pjobid_to_rap: called.
[2004/11/04 20:30:50, 10] printing/printing.c:pjobid_to_rap(84)
  pjobid_to_rap: jobid 5674 maps to RAP jobid 39424
[2004/11/04 20:30:52, 10] printing/printing.c:rap_to_pjobid(106)
  rap_to_pjobid called.
[2004/11/04 20:30:52, 10] printing/printing.c:rap_to_pjobid(126)
  rap_to_pjobid: Failed to lookup RAP jobid 38656

And now, I really confused. What are those "jobids" assigned to RAP entries?
They are not real jobids (that would be somewhere between 500 and 800) and they
are not lines in lpq output or something like that. And why there is something
not beeing resolved that has been stored before (38656)?

Thanks for your work,
Eduard.
Comment 1 Gerald (Jerry) Carter (dead mail address) 2004-11-12 10:18:26 UTC
Might be realted to BUG 1941
Comment 2 Eduard Bloch 2004-11-13 07:42:22 UTC
(In reply to comment #1)
> Might be realted to BUG 1941

Maybe. But that one is marked as fixed, but this bug is not. I get the same
results with 3.0.8.
Comment 3 Gerald (Jerry) Carter (dead mail address) 2004-11-13 09:55:57 UTC
Yeah./  I mentioned bug 1941 based on the subject 
(related to lprng) but then I looked at the details
and realized that the lprng issue reported was just 
a misconfiguration.

I'm still looking into this one.
Comment 4 Gerald (Jerry) Carter (dead mail address) 2005-01-06 09:46:36 UTC

*** This bug has been marked as a duplicate of 2170 ***
Comment 5 Gerald (Jerry) Carter (dead mail address) 2005-02-03 07:41:20 UTC
not fixed.  Reopening....
Comment 6 Gerald (Jerry) Carter (dead mail address) 2005-02-03 07:41:29 UTC
*** Bug 2298 has been marked as a duplicate of this bug. ***
Comment 7 Gerald (Jerry) Carter (dead mail address) 2005-02-07 09:29:28 UTC
could you test 3.0.11 and let me know how things go.  Lots of 
printing changes in that release.
Comment 8 Iván Couto 2005-02-09 03:56:27 UTC
The problems with jobids are still present in the 3.0.11 release. And it's 
even worse: when I kill a job (assuming that the real jobid is the showed/256) 
the job gets killed ok. But then, when I do a queue command the killed job 
appears with a new jobid (the highest jobid of the queue+256). 
My samba server is running on a Solaris 9 sparc and the clients are Debian 
Linux. 
Comment 9 Gerald (Jerry) Carter (dead mail address) 2005-02-09 04:59:48 UTC
ok.  I didnt' believe that this was fixed in 3.0.11 but wanted to 
start frrom the latest code.  I'll work on this some more.  Thanks 
for the feedback.
Comment 10 Gerald (Jerry) Carter (dead mail address) 2005-03-08 11:46:56 UTC
I cannot reproduce this one.  Can you attach a gzipped level 
10 debug log with this report?

If I understand correctly, all of this surrounds using 
smbclient to cancel print jobs.  Do WinNT/2k/xp clients 
work ok ?  What about Windows 9x ?
Comment 11 Gerald (Jerry) Carter (dead mail address) 2005-03-08 14:37:13 UTC
could you attach your smb.conf and the output from 
`lpq -P<printername>`.   Thanks.
Comment 12 Olaf Imig 2005-03-15 05:28:17 UTC
(In reply to comment #10)
> I cannot reproduce this one.  Can you attach a gzipped level 
> 10 debug log with this report?
> 
> If I understand correctly, all of this surrounds using 
> smbclient to cancel print jobs.  Do WinNT/2k/xp clients 
> work ok ?  What about Windows 9x ?
I can reproduce the described behavior exact. My system is a Samba-Printserver
at Solaris 8 in a domain. smbclient run at the same system.When I start a print
job at a Windows-XP-client and I do the que-command twice at smbclient I see the
following result:
Domain=[BIFAB] OS=[Unix] Server=[Samba 3.0.11]
smb: \> que
1        115760       Testseite
smb: \> que
256      115760       Testseite
I think the canceling-behavior is a result of this. 
I will create an attachment for this test with loglevel 10.
Comment 13 Olaf Imig 2005-03-15 05:32:07 UTC
Created attachment 1033 [details]
log.smbd (comment #12)
Comment 14 Olaf Imig 2005-04-26 06:28:28 UTC
(In reply to comment #13)
> Created an attachment (id=1033) [edit]
> log.smbd (comment #12)
> 
Because nobody is interested in fixing this bug, I did it myself. The following
patch for 
samba-3.0.15pre2 fix the failure in behavior of the the que-command (get weird
jobid numbers from the queue command in smbclient) and the failure in the
behavior of the cancel command in smbclient.
The reason for the bug is a byte order problem, that means we have this bug at
all BIG-ENDIAN machines. In the functions pjobid_to_rap, rap_to_pjobid,
rap_jobid_delete an uint16 is extracted with SVAL from a tdb-file. At BIG-ENDIAN
machines this switch the byte oder. The programmer forgot to put the uint16 with
SSVAL in the tdb-file. An simple cast work only on LITTLE-ENDIAN machines. For
example the functions get_updating_pid and set_updating_pid work in the right way.
Comment 15 Olaf Imig 2005-04-26 06:31:43 UTC
Created attachment 1174 [details]
This is the fix for this Bug
Comment 16 Gerald (Jerry) Carter (dead mail address) 2005-04-26 07:12:48 UTC
Olaf,  Thanks very much for the fix.  I knew it was a byte 
ordering issue when storing and retreiving the job id's.  But the 
actual 32-bit job id are ok.  It didn't realize it was the 16-bit 
RAP job ids.  And with a dead sparc, I could never reproduce it.
Wonderful work.  

Patch applied.  WIll be on the next 3.0.15 snapshot.




Comment 17 Gerald (Jerry) Carter (dead mail address) 2005-08-24 10:19:35 UTC
sorry for the same, cleaning up the database to prevent unecessary reopens of bugs.