Utilize memory allocator in ReadProperties.GetStream #547

daniel-adam-tfs · 2025-10-24T12:34:47Z

Rationale for this change

Optimization of memory usage, enables the use of custom allocators when reading column data with both buffered and unbuffered readers.

What changes are included in this PR?

Changes to bufferedReader type, new bytesBufferReader type and modification of ReadProperties.GetStream to propagate the custom memory allocator to the readers.

Are these changes tested?

TODO: add unit tests

Are there any user-facing changes?

The allocator if provided with reader properties will be used to allocate the underlying buffers for the buffered/unbuffered readers.

The BufferedReader interface was extended by the Free method to allow returning of the memory to the allocator.

daniel-adam-tfs · 2025-10-24T12:38:26Z

parquet/file/page_reader.go

+	// underlying reader, even if there is less data available than that. So even if there are no more bytes,
+	// the buffer must have at least bytes.MinRead capacity remaining to avoid a relocation.
+	allocSize := lenCompressed
+	if p.decompressBuffer.Cap() < lenCompressed+bytes.MinRead {


This is dependent on the combined behavior of io.LimitReader and bytes.Buffer. Which seems fragile to me, but I don't have any other ideas how to deal with it. I'll at least add unit tests that the reallocation happens when I don't add bytes.MinRead to the allocation size and doesn't happen when I do.

I agree that this seems really fragile. Maybe io.ReadFull directly into p.decompressBuffer.Bytes()[:lenCompressed] instead of using the intermediate bytes.Buffer?

Yes, lets go with ReadFull and we can skip ``bytes.Buffer` altogether.

daniel-adam-tfs · 2025-10-28T12:06:14Z

parquet/file/page_reader.go

+	if n != lenUncompressed {
+		return nil, fmt.Errorf("parquet: expected to read %d bytes but only read %d", lenUncompressed, n)
+	}
+	if p.cryptoCtx.DataDecryptor != nil {


I'm not sure if this is needed or not. But for data page v2, the data is just read by ReadFull and Decrypt is not called:

arrow-go/parquet/file/page_reader.go

Lines 815 to 824 in 95b3f76

if compressed {

if levelsBytelen > 0 {

io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])

}

if _, p.err = p.decompress(p.r, lenCompressed-levelsBytelen, buf.Bytes()[levelsBytelen:]); p.err != nil {

return false

}

} else {

io.ReadFull(p.r, buf.Bytes())

}

So maybe the Decrypt call is not needed for data age v1 or dictionary page either?

The reverse actually. Looks like this is a bug we just never came across, I'm guessing no one was using DataPageV2 with uncompressed data but still encrypted that was using this library.

Alright, I'll fix it for DataPageV2 then. I'll add a unit tests without compression and with encryption that should fail with the current main, if I have the time.

I reran the profiler with the current commit in this PR, with a 2.8GB parquet file stored in S3, uncompressed. And cpu profiler is showing that more time is spent in runtime.memmove (copying memory) than in syscall.Syscall6 (read). Which is annoying me. :-D

I think it should be still possible to eliminate at least one copy for the uncompressed case.

So this is my scenario:

ReaderProperties.GetStream reads column chunk from a TLS and stores it in a buffer (or just allocates the buffer if BufferedStreamEnabled, but lets go with the unbuffered case for now)

serializedPageReader is created with the buffer returned from ReaderProperties.GetStream

serializedPageReader.Next get the page header calls serializedPageReader.readUncompressed / serializedPageReader.decompress which reads data from the GetStream buffer into dictPageBuffer/dataPageBuffer
3a) for the uncompressed case the this is just a copy

page struct is created from the bytes written to the dictPageBuffer/dataPageBuffer

I think I could avoid the copy in 3a and create page directly from the bytes in the buffer returned by ReaderProperties.GetStream by using combination of calls Peek (to get the bytes)+Discard (to move the internal position inside the buffer). This should hold when BufferedStreamEnabled is false, I have to check what happens when it is true.

Awesome! Thanks for diving into this!

Alright, so I "steal" the buffer by using Peek/Discard if the data has been read previously and it is available of the BufferedReader. So in the uncompressed and unencrypted case -> data is read and stored into a buffer in ReaderProperties.GetStream and copied to the user provided buffer to Float32ColumnChunkReader.ReadBatch.
Now, if we have a plainEncoder and no compression, it should be possible to write the data directly to the user provided buffer, so that would eliminate even that copy, but one is more complicated and I need to be start doing other stuff. :D

Also, the decryption types allocate buffers for the decrypted data. We could send it an already allocated buffer to use, or maybe do an in place decryption (if possible), or give it the custom allocator if it is set.

Anyway, I'll fix the decryption for DataPageV2 next and I'll consider this one done.

daniel-adam-tfs · 2025-10-30T18:06:39Z

parquet/file/column_reader_test.go

+	require.NoError(t, err)
+
+	icr := col0.(*file.Int64ColumnChunkReader)
+	// require.NoError(t, icr.SeekToRow(3)) // TODO: this causes a panic currently


If I uncomment this, then this causes a panic

That shouldn't be panicing as the SeekToRow works correctly based on my last tests.... I'll see if i can debug this

This still panics:

--- FAIL: TestDecryptColumns (0.00s) --- FAIL: TestDecryptColumns/DataPageV2_BufferedRead (0.00s) panic: cipher: message authentication failed [recovered, repanicked] goroutine 8 [running]: testing.tRunner.func1.2({0x2899940, 0xc000054100}) /usr/local/Cellar/go/1.25.6/libexec/src/testing/testing.go:1872 +0x237 testing.tRunner.func1() /usr/local/Cellar/go/1.25.6/libexec/src/testing/testing.go:1875 +0x35b panic({0x2899940?, 0xc000054100?}) /usr/local/Cellar/go/1.25.6/libexec/src/runtime/panic.go:783 +0x132 github.com/apache/arrow-go/v18/parquet/internal/encryption.(*aesDecryptor).Decrypt(0xc0002c58f0, {0xc000026800, 0x4000, 0x4000}, {0xc0002c5b10?, 0x600c00003ff20?, 0x44d3560?}, {0xc0002c5e90, 0xd, 0x10}) /daniel-adam-tfs/arrow-go/parquet/internal/encryption/aes.go:262 +0x26d github.com/apache/arrow-go/v18/parquet/internal/encryption.(*decryptor).Decrypt(0xc0003326c0?, {0xc000026800?, 0xc00003fd40?, 0x1397b05?}) /daniel-adam-tfs/arrow-go/parquet/internal/encryption/decryptor.go:268 +0x45 github.com/apache/arrow-go/v18/parquet/file.(*serializedPageReader).readPageHeader(0xc0003326c0, {0x2aceb20, 0xc00003ff20}, 0xc0002ff740) /daniel-adam-tfs/arrow-go/parquet/file/page_reader.go:704 +0x139 github.com/apache/arrow-go/v18/parquet/file.(*serializedPageReader).Next(0xc0003326c0) /daniel-adam-tfs/arrow-go/parquet/file/page_reader.go:798 +0xdd github.com/apache/arrow-go/v18/parquet/file.(*serializedPageReader).SeekToPageWithRow(0xc0003326c0, 0x3) /daniel-adam-tfs/arrow-go/parquet/file/page_reader.go:749 +0x186 github.com/apache/arrow-go/v18/parquet/file.(*columnChunkReader).SeekToRow(0xc0002fa780, 0x3) /daniel-adam-tfs/arrow-go/parquet/file/column_reader.go:584 +0x2a github.com/apache/arrow-go/v18/parquet/file_test.checkDecryptedValues(0xc000326380, 0xc000210870, 0xc0003283f0) /daniel-adam-tfs/arrow-go/parquet/file/column_reader_test.go:867 +0x591 github.com/apache/arrow-go/v18/parquet/file_test.TestDecryptColumns.func1(0xc000326380) /daniel-adam-tfs/arrow-go/parquet/file/column_reader_test.go:964 +0x18a testing.tRunner(0xc000326380, 0xc0002fe680) /usr/local/Cellar/go/1.25.6/libexec/src/testing/testing.go:1934 +0xea created by testing.(*T).Run in goroutine 7 /usr/local/Cellar/go/1.25.6/libexec/src/testing/testing.go:1997 +0x465 FAIL github.com/apache/arrow-go/v18/parquet/file 1.922s

For each of the cases defined, doesn't matter whether it is a V1 or V2 page, or if is it compressed or not, or if BufferedStream is used or not. SeekToRow works in the unencrypted case, so this is again encryption related.

It seems to be related to this line:

arrow-go/parquet/file/page_reader.go

Lines 699 to 702 in 38dc64b

if oidx == nil {

if _, err = section.Seek(p.dataOffset-p.baseOffset, io.SeekStart); err != nil {

return err

}

I think this is intended to skip past the dictionary page to the data page. But for some reason after this Seek the

arrow-go/parquet/file/page_reader.go

Line 663 in 38dc64b

view = p.cryptoCtx.MetaDecryptor.Decrypt(view)

call fails. The offset to the data page is correct, so I'm thinking that some internal state of the decryptor or the page reader is affected by skip past the parsing of the dictionary page.

But I'll remove the SeekToRow and park this for now in #566. Because we are benchmarking various formats + readers of unencrypted data, so I wanna make some optimization (including this PR, which according to my benchmarks speeds things up a little).

daniel-adam-tfs · 2025-10-30T18:13:38Z

parquet/file/column_reader_test.go

+	arrWriterProps := pqarrow.NewArrowWriterProperties()
+
+	var buf bytes.Buffer
+	wr, err := pqarrow.NewFileWriter(schema, &buf, writerProps, arrWriterProps)


@zeroshade I think there is a bug in pqarrow writer. For the data page v2 buffer, it first writes the levels (definition and repetition) and then the values. Only values are compressed. However, this whole buffer is then encrypted. And unless ChatGPT is hallucinating on me then only the compressed values should also be encrypted, levels should stay unencrypted and uncompressed.
I'll check with some encrypted parquet create in a different way on Monday and see what happens.

@zeroshade OK, I should have some time this week to finish this. In fact, I think the memory allocation is done, but the decryption needs fixing before the tests pass.

I've been trying to use PyArrow to encrypt/decrypt files, but there seems to be some discrepancy in implementations. I cannot get files encrypted by PyArrow to decrypt using go-arrow and vice versa. I'll open an issue for the encryption/decryption.

Please link the issue here when you file it, as PyArrow and arrow-go should be agreeing on encrypting/decrypting and vice versa since pyarrow binds to Arrow C++ and the tests for parquet C++ and arrow-go should be the same tests that are passing. I'd be interested to reproduce the failure and debug it

This works correctly now.

daniel-adam-tfs · 2026-01-22T09:53:43Z

Alright, decryption should be correct now. Let me rebase this (...and try remember what was happening here 😄 )

daniel-adam-tfs commented Oct 24, 2025

View reviewed changes

daniel-adam-tfs commented Oct 28, 2025

View reviewed changes

daniel-adam-tfs force-pushed the feature/mem-allocator-in-readerprops branch from e3a38f7 to c7bee5e Compare October 30, 2025 15:12

daniel-adam-tfs commented Oct 30, 2025

View reviewed changes

daniel-adam-tfs added 6 commits January 22, 2026 11:49

Utilize memory allocator in ReaderProperties.GetStream

f397975

Ensure that decompressBuffer doesn't get reallocated by io.CopyN

0b241d5

fixup! Ensure that decompressBuffer doesn't get reallocated by io.CopyN

6e742d8

fixup! Ensure that decompressBuffer doesn't get reallocated by io.CopyN

045f2d1

Read uncompressed data directly into the page buffer

0f7b399

Fix encryption for DataPageV2

b9587e2

daniel-adam-tfs force-pushed the feature/mem-allocator-in-readerprops branch from 1dde56d to b9587e2 Compare January 22, 2026 10:51

	if compressed {
	if levelsBytelen > 0 {
	io.ReadFull(p.r, buf.Bytes()[:levelsBytelen])
	}
	if _, p.err = p.decompress(p.r, lenCompressed-levelsBytelen, buf.Bytes()[levelsBytelen:]); p.err != nil {
	return false
	}
	} else {
	io.ReadFull(p.r, buf.Bytes())
	}

	if oidx == nil {
	if _, err = section.Seek(p.dataOffset-p.baseOffset, io.SeekStart); err != nil {
	return err
	}

Utilize memory allocator in ReadProperties.GetStream #547

Are you sure you want to change the base?

Utilize memory allocator in ReadProperties.GetStream #547

Uh oh!

Conversation

daniel-adam-tfs commented Oct 24, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

daniel-adam-tfs Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniel-adam-tfs commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-adam-tfs Oct 24, 2025 •

edited

Loading

daniel-adam-tfs Oct 28, 2025 •

edited

Loading

daniel-adam-tfs Jan 22, 2026 •

edited

Loading

daniel-adam-tfs Nov 10, 2025 •

edited

Loading