Skip to content

archive:create-from and archive:create support for archives larger than 2 GB#2624

Merged
ChristianGruen merged 8 commits intoBaseXdb:mainfrom
vincentml:archive-create-from-2gb
Mar 26, 2026
Merged

archive:create-from and archive:create support for archives larger than 2 GB#2624
ChristianGruen merged 8 commits intoBaseXdb:mainfrom
vincentml:archive-create-from-2gb

Conversation

@vincentml
Copy link
Copy Markdown
Contributor

@vincentml vincentml commented Mar 24, 2026

This pull request makes it possible for archive:create and archive:create-from to create zip files that are larger than 2 Gb.

Using BaseX version 12.2, when attempting to create zip files using archive:create or archive:create-from and the size of the files is larger than about 2 Gb I've run into error messages such as "java.lang.ArrayIndexOutOfBoundsException: Maximum array size exceeded (2147483640 > 2147483639)."

For example, this error is produced if the total size of a folder being zipped is 3 Gb when passing the result of archive:create-from directly to file:write-binary:

declare variable $large_3gb_folder :=
  "path/to/folder";
declare variable $zipFile :=
  "path/to/file.zip";

file:write-binary($zipFile, archive:create-from($large_3gb_folder))

and when using a variable to pass the result of archive:create-from to file:write-binary:

declare variable $large_3gb_folder :=
  "path/to/folder";
declare variable $zipFile :=
  "path/to/file.zip";

let $archive := archive:create-from($large_3gb_folder)
return file:write-binary($zipFile || '\file.zip', $archive)

After the changes in this pull request, the above queries produce the expected zip file and the error does not occur.

The current limitation of ~ 2 Gb is due to the file contents being accumulated in memory and exceding the maximum array size set by Java's Integer.MAX_VALUE.

This pull request solves this problem by avoiding the use of an array, and instead accumulates data in memory up to a threshold then switches to a temporary file if the data exceeds the threshold. The threshold is determined from available memory capped at the maximum array size. The temporary file, if created, is deleted automatically. This approach attempts to optimize for the typical use cases of creating small or mid-size archives while making it possible to create very large archives.

@ChristianGruen ChristianGruen merged commit 1ca8383 into BaseXdb:main Mar 26, 2026
1 check passed
@ChristianGruen
Copy link
Copy Markdown
Member

@vincentml Thanks for the PR; the solution is complete, well-tested and creative. My subsequent revisions are all secondary (26ce16e).

One observation: For the threshold computation to work properly, rt.maxMemory() - rt.freeMemory() would have needed to be changed to rt.freeMemory() (otherwise, the threshold would increase as available memory decreases). I decided to remove it completely, as memory heuristics turn out to be erratic when multiple threads are running at the same time.

@vincentml
Copy link
Copy Markdown
Contributor Author

@ChristianGruen Thank you for merging this PR and your subsequent improvements!

I've continued to have second thoughts about the threshold computation based on free memory, and am still unsure of what computation would work well for situations where multiple processes run in parallel using the job module or xquery:fork-join. Using Array.MAX_SIZE as you've done seems like a good way to resolve the problem and avoids the original error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants