Bug 12769 - error allocating core memory buffers (code 22) depending on source file system
Summary: error allocating core memory buffers (code 22) depending on source file system
Status: RESOLVED FIXED
Alias: None
Product: rsync
Classification: Unclassified
Component: core (show other bugs)
Version: 3.1.0
Hardware: All Linux
: P5 normal (vote)
Target Milestone: ---
Assignee: Wayne Davison
QA Contact: Rsync QA Contact
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-05-05 13:54 UTC by Roland Haberkorn
Modified: 2021-10-05 12:56 UTC (History)
2 users (show)

See Also:


Attachments
remove MALLOC_MAX from util2.c (1.33 KB, patch)
2020-06-24 11:42 UTC, MulticoreNOP
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roland Haberkorn 2017-05-05 13:54:02 UTC
We run an openSuSE Leap 42.2 and an Ubuntu 14.04.5 on two servers. Copying a large number of files (in this case about 28 million) leads to different results depending on the source file system. 
We copy with rsync -rlptgoDAxHnP --info=progress2 --delete --link-dest=$LINK_DEST root@$SERVER:/$FOLDER /backup/rsynctest/ . Replacing --delete by --delete-delay doesn't change the behaviour as expected. The error occurs with and without the option -n, in this case it is just for testing reasons included.
In case the source is located on an Ext4 file system we run into the following error message after about 26 million files copied:
ERROR: out of memory in hashtable_node [sender]
rsync error: error allocating core memory buffers (code 22) at util2.c(102) [sender=3.1.0]
In case the source is located on an XFS file system the above command copies all files without error.
Both of the file systems hold the same data as the one is the backup copy of the other. The behaviour appears as well when we use rsync via an rsync server and not via SSH and as well when we copy locally on one of the two machines. And it appears regardless of the operating system (openSuSE 42.2 or Ubuntu 14.04.5). 
I did not try in the last time but replaying the backup from an Ext4 showed this error at least one year ago as well. With the change to XFS on the source file system the error suddenly disappeared.
As the error appears even if just doing a --dry-run it seems to be related to the way rsync handles metadata. The data size seems to be unimportant.
Comment 1 Roland Haberkorn 2017-05-05 15:14:57 UTC
If you want me to run further testings with other file systems I am totally willing to produce fake data and run tests. I just haven't done yet because of my lack of knowledge about the underlying mechanisms and because I am not totally sure whether this is a rsync's problem or a kernel issue.
To add two more things: We saw this issue also when having mounted the data with NFSv3 or v4. The target file system does not matter, we've had this issue with btrfs, Ext4 and XFS.
Comment 2 Roland Haberkorn 2017-05-19 11:48:07 UTC
I did some further investigation... 
First thing to add: The ext4 file systems are hard-linked differential rsync backups of the real data on XFS.
I changed the testcase by deleting the --link-dest option.
When rsyncing from an XFS, the rsync process on the client uses about 3% RAM (of total 8GB). When rsyncing from an ext4, it uses up to about 50% RAM.
This picture totally changes when I delete the option -H. In this case also copying from an ext4 uses only less than 2% RAM. 
My guess would be that perhaps -H breaks the incremental recursion when copying from an ext4.
Comment 3 Roland Haberkorn 2017-07-24 15:20:16 UTC
Ok, I digged somewhat deeper. I've found a second difference between my two sources. The one is the original data, the other one is a diffential rsync backup with hard links.
I then built a testcase with about 50 million dummy files with something like this:

#!/bin/bash
for i in {1..50}
do
mkdir $i
#cd $i
for j in {1..1000}
do
mkdir $i/$j
#cd $j
for k in {1..1000}
do
touch $i/$j/$k
done
done
done

Rsyncing this testfolder work fine from and to any of the tested file systems (ext4 64Bit, XFS, btrfs). This is true for with and without the Option -H and as long as in the source file system there is no hard linked copy of the source folder. 
In the moment when there is at least one hard linked copy the option -H breaks the run:

roland@msspc25:~$ stat /mnt/rsynctestsource/1/1/1
  File: /mnt/rsynctestsource/1/1/1
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 811h/2065d      Inode: 153991349   Links: 2
Access: (0644/-rw-r--r--)  Uid: ( 2001/  roland)   Gid: (  100/   users)
Access: 2017-05-09 10:54:10.341300841 +0200
Modify: 2017-05-08 09:33:51.535967423 +0200
Change: 2017-05-26 16:21:57.610628573 +0200
 Birth: -
roland@msspc25:~$ rsync -rlptgoDxAn --info=name,progress2  --delete --link-dest=/mnt2/link3/ /mnt/rsynctestsource/ /mnt2/link1/.
              0 100%    0.00kB/s    0:00:00 (xfr#0, to-chk=0/49049050)   
roland@msspc25:~$ rsync -rlptgoDxHAn --info=name,progress2  --delete --link-dest=/mnt2/link3/ /mnt/rsynctestsource/ /mnt2/link1/.
              0 100%    0.00kB/s    0:00:00 (xfr#0, ir-chk=1000/25191050)
ERROR: out of memory in hashtable_node [sender]
rsync error: error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]

You can see, the first run without -H works, the last one with -H doesn't. 

So I would have to somewhat rename the bug report into "-H breaks the incremental recursion on hard linked sources". This is as well true for all the three file systems tested.
Comment 4 Ovidiu Stanila 2019-09-09 10:16:59 UTC
We hit the same issue on a CentOS 6 server (kernel 2.6.32-754.18.2.el6.x86_64), the sync would break with the following error:

# /usr/bin/rsync --debug=HASH --stats --no-inc-recursive -aHn --delete /app/ <remote>:/app/
[sender] created hashtable 2013770 (size: 16, keys: 64-bit)
[sender] created hashtable 2015370 (size: 512, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 1024, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 2048, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 4096, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 8192, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 16384, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 32768, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 65536, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 131072, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 262144, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 524288, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 1048576, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 2097152, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 4194304, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 8388608, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 16777216, keys: 64-bit)
[sender] growing hashtable 2015370 (size: 33554432, keys: 64-bit)
ERROR: out of memory in hashtable_node [sender]
rsync error: error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]

Both the sender and receiver part have the same rsync and OS versions.

The source we try to transfer is over 3.4Tb (65 million files) with a big number of hard links between various directories(for avoiding duplicate files).

We initially increased the memory from 16Gb to 32Gb but even with that rsync would die with the same error. On above example, there were over 22Gb free RAM at the point rsync stopped working.

We also tried "--inplace" instead of "--no-inc-recursive" with the same result.

Did we hit some kind of limitation inside of rsync ? Is there anything else we should check ?
Comment 5 Simon Matter 2019-10-11 12:53:02 UTC
I'm suffering the same problem and was wondering if anyone found a solution or work around or other tool to do the job?

First I thought maybe it's a bug which is fixed already and tried with the latest release 3.1.3. Unfortunately no joy and I still get the same issue.

Any help would be much appreciated!
Comment 6 Dave Gordon 2019-10-21 20:47:36 UTC
The hash table doubles each time it reaches 75% full. A hash table for 32m items @ 16 bytes each (8 byte key, 8 byte void *data) needs 512MB of memory. At the next doubling (up to 64m items) it hits the array allocator limit in utils2.c:

#define MALLOC_MAX 0x40000000

void *_new_array(unsigned long num, unsigned int size, int use_calloc)
{
        if (num >= MALLOC_MAX/size)
                return NULL;
        return use_calloc ? calloc(num, size) : malloc(num * size);
}
void *_realloc_array(void *ptr, unsigned int size, size_t num)
{
        if (num >= MALLOC_MAX/size)
                return NULL;
        if (!ptr)
                return malloc(size * num);
        return realloc(ptr, size * num);
}

No single array allocation or reallocation is allowed to exceed MALLOC_MAX (1GB). Hence rsync can only handle up to 32m items per invocation if a hash table is required (e.g. for tracking hardlinks when -H is specified).

HTH,
Dave
Comment 7 Simon Matter 2020-03-20 13:17:27 UTC
I've patches like this to solve our issues:

--- rsync-3.1.3/util2.c.orig	2018-01-15 04:55:07.000000000 +0100
+++ rsync-3.1.3/util2.c	2020-03-11 13:07:07.138141415 +0100
@@ -59,7 +59,7 @@
 	return True;
 }
 
-#define MALLOC_MAX 0x40000000
+#define MALLOC_MAX 0x100000000
 
 void *_new_array(unsigned long num, unsigned int size, int use_calloc)
 {
Comment 8 Roland Haberkorn 2020-03-27 14:58:22 UTC
Is it possible to totally get rid of the restriction? I'd prefer running in out of memory situations rather than in this restriction. Without I could just throw some more RAM on the machine, with this restriction I would have to rebuild rsync whenever there is a newer version.
Comment 9 MulticoreNOP 2020-06-24 11:42:20 UTC
Created attachment 16074 [details]
remove MALLOC_MAX from util2.c

change data types from "unsigned int" to "size_t" as it should be and remove arbitrary size limit.
Comment 10 MulticoreNOP 2020-06-24 11:44:04 UTC
(In reply to Simon Matter from comment #7)
#define MALLOC_MAX 0x100000000

is greater than uint32-MAX, therefore will overflow and result in an unpredictable and unfriendly manner.

#define MALLOC_MAX 0xD09DC300
(~3,5GiB) leaves some space to detect a "just too big for this implementation" and will fail gracefully.

Yet, the real culprit here is the use of "unsigned int" as opposed to size_t.

I therefore propose the attached patch that removes MALLOC_MAX in its entirety and should allow arrays as big as available virtual memory can support.
Comment 11 MulticoreNOP 2020-06-24 11:47:45 UTC
I want to add that the original implementation also leads to the following error:


ERROR: out of memory in flist_expand [sender]
rsync error: error allocating core memory buffers (code 22) at util2.c(106) [sender=3.1.2]

For me it hit that message for an
rsync --delete-before -H --moreStuff..
at around the 117 million files mark.

The patch fixed that problem for me.
Comment 12 Wayne Davison 2020-06-26 03:56:35 UTC
I fixed the allocation args to be size_t values (and improved a bunch of allocation error checking while I was at it).

I then added an option that lets you override this allocation sanity-check value. The default is still 1G per allocation, but you can now specify a much larger value (up to "--max-alloc=8192P-1").

If you want to make a larger value the default for your copies, export RSYNC_MAX_ALLOC in the environment with the size value of your choice.

Committed for release in 3.2.2.
Comment 13 MulticoreNOP 2020-06-26 07:32:51 UTC
(In reply to Wayne Davison from comment #12)
Hi Wayne,

that is great news!

Could you shine some light on why there is such a limit in the first place?

Personally I think such an arbitrary limit is rather unexpected and many people will have a long running rsync command (for me each try took ~6hrs until I could reproduce the error) fail on them until they could even be aware that there is such a limit.

In addition to that, I guess only a small fraction of those people will probably find this parameter as a solution to their problem.

To me this is like 'cp -R a b' failing, if a/ contains more than 1337 files.
Cheers,

Mc NOP
Comment 14 Roland Haberkorn 2021-10-05 12:56:10 UTC
After the new version made it into my system I can confirm it works like a charm. Many thanks for the effort.