On closing a PDF mOutputStream is NULL #289

BrianErickson-InfoCon · 2024-11-29T20:38:37Z

OS: Windows 10
Complier: Microsoft Visual C++ 2019 Version 16.11.33

I'm writing a file that is 10+GB in size.
Smaller sizes seem to work.
All JPEG images.
During the file closing process, (in function ObjectsContext::StartNewIndirectObject())
mOutputStream is NULL.
The resulting PDF isn't properly closed.

static const std::string scObj = "obj";
ObjectIDType ObjectsContext::StartNewIndirectObject()
{
ObjectIDType newObjectID = mReferencesRegistry.AllocateNewObjectID();

mReferencesRegistry.MarkObjectAsWritten(newObjectID,mOutputStream->GetCurrentPosition()); << mOutputStream is NULL
mPrimitiveWriter.WriteInteger(newObjectID);
mPrimitiveWriter.WriteInteger(0);
mPrimitiveWriter.WriteKeyword(scObj);

if(IsEncrypting()) {
	mEncryptionHelper->OnObjectStart((long long)newObjectID, 0);
}

return newObjectID;

}

My code:
boolean CQSDPdfFile::Close()
{
EStatusCode status;
// If file is already closed, just return true
if (!m_bIsFileOpen)
return true;
status = m_pdfWriter.EndPDF(); <<<
ClearFile();
if (status != PDFHummus::eSuccess)
return false;
m_bIsFileOpen = false;
return true;
}

The text was updated successfully, but these errors were encountered:

galkahana · 2024-11-30T20:02:59Z

Sounds like a library expression of PDF Format's own limitation. The beginning of Appendix C "Implementation limits" of the PDF specs says:
"PDF itself has one architectural limit: Because ten digits are allocated to byte offsets, the size of a file is limited to 1010 bytes (approximately 10 gigabytes)."

There's probably something within the library writing algo that identifies that at some point it's hitting the limit to describe offsets and stops. I should hope there's a better way than just crashing on null....i wonder if there's an earlier warning, like if one of the earlier writing command (like teh placement of a JPEG) provides with an error. [i'll try to take a look sometime soon and figure out if there's such an early warning trying to simulate what you got myself).

Can you handle the task at hand with smaller files? like limited to 9-9.5 gigs?

BrianErickson-InfoCon · 2024-12-02T17:02:39Z

Work at less than 10 G.
I read the spec 1.4, there is a 10 bit limit, but that was suppose to be removed in spec 1.5+.
Latest version of Adobe allows for file size greater that 10 GB.
Either case, it would be great if there was some indication of a limit has been reached or about to be reached.
Please, let me know.
Thanks.

BrianErickson-InfoCon · 2024-12-02T23:31:51Z

While I'm still having the problem with 10G+ files
I have narrowed the problem to:
status = mObjectsContext->WriteXrefTable(xrefTablePosition); << returns eFailure
if(status != 0)
break;
I'm still trying to isolate the problem, in case it's not a spec limitation.
However, the "mOutputStream is NULL" part is because I was closing the file twice. sorry.
Hope this helps.

BrianErickson-InfoCon · 2024-12-04T23:37:15Z

The resulting output file is up to 11.5 GB, but the file is corrupt.
Also,

In function "EStatusCode ObjectsContext::WriteXrefTable(LongFilePositionType& outWritePosition)"
...
if(objectReference.mObjectWritten) << false
{
SAFE_SPRINTF_2(entryBuffer,21,"%010lld %05ld n\r\n",objectReference.mWritePosition,objectReference.mGenerationNumber);
mOutputStream->Write((const IOBasicTypes::Byte *)entryBuffer,20);
}
else
{
// object not written. at this point this should not happen, and indicates a failure
status = PDFHummus::eFailure; << Gets here
TRACE_LOG1("ObjectsContext::WriteXrefTable, Unexpected Failure. Object of ID = %ld was not registered as written. probably means it was not written",i);
}

Hope this helps

galkahana · 2024-12-05T05:24:09Z

This probably means that there was an earlier halt.
mObjectWritten is marked true when an objects starts (you can see references to MarkObjectAsWritten). This means that an object ID was allocated, but the object itself was never written (specifically void ObjectsContext::StartNewIndirectObject(ObjectIDType inObjectID) was not called).

adding logs doesn't help?

In any case, im not sure what this will help. the files as come out of PDFWriter have the 10 gigs limitation anyways in writing. maybe there's an earlier warning (so you dn't crash), but you should probably plan on smaller files.

Being able to support larger files probably means a bit of a feature adding to the library to emit objects stream based files only (that's the 1.5 feature you refer to...and it remains to be seen whether it does elevate the said limitation), which is not the case right now.

galkahana · 2024-12-07T15:10:04Z

ok. located the string of issues.
so at some point the file grows to be more than 10gbs.
here's what happens next:

At some point there's a call to StartNewIndirectObject. This can happen via quite a few routes. it's when starting to write an object. (this can also happen with StartModifiedIndirectObject when writing a modified object in pdf file update scenarios).
In it there's a call to MarkObjectAsWritten. MarkObjectAsWritten returns a status result, but it's not being handled by any of this calls. This is the root of the mishaps.
MarkObjectAsWritten normally returns OK, but will return a failure if the position for the object start (recorded at this point) is post what may be represented using 10 digits. That's the 10gb limit in the library.
Given that StartXXXXIndirectObject ignores such failures the code continues till WriteXrefTable fails later.

This can be corrected to return a failure immediately (by redirecting the error code via StartXXXXIndirectObject in its various form) instead of doing so later. At the least it will provide an early halt. The file would still be defective, given it already reached 10gbs and xref may not be written later still. I'll introduce a correction along this line soon.

So with this, it's still required to keep an eye on how the file size grows and halt prior to 10gbs.

I read about 1.5 xref streams again. looks like they can be used regardless of the usage of objects streams. While the library does not write xref streams at this point i think i got most of the parts ready to enable this. I Can add this as a feature after a bit of a POC to see that i got this right. This will lift the 10gbs limit as you can determine the offset bytesize yourself, something that i'll route to the user as an option (with a good enough default).

galkahana · 2024-12-08T20:26:36Z

ok, so with #291 you should be able to create files with file size larger than 10gbs.
Now, it's possible to ask PDFWriter to create files with 1.5 xref streams (make sure to also have the file with version 1.5 or higher) and then the limitation of 10 digits is now more. i lift the validation check in this case and you probably can just go ahead and create those large files you wanted (well...didn't check 10gbs or more...but the principal should work).

To activate this when using StartPDF/StartPDFStream provide a PDFCreationSettings with inWriteXrefAsXrefStream set to true.
You can see an example here.
If you get to try it out i'd love to hear if this was sufficient to create those files you wanted.

cheers,
Gal.

BrianErickson-InfoCon · 2024-12-10T23:53:04Z

#291 version works, great.
FYI, I had the version set to 1.3 but changed to max. Didn't fix.
Using pdfWriter.GetObjectsContext().GetCurrentPosition() to limit file size then resetting file works as well.
(End up with multiple files)

But the #291 version works much better. My test was up to 11.5 GB.
Thanks.

galkahana · 2024-12-11T08:18:41Z

yeah, just changing the version wouldn't be enough. it's not required for 1.5 or higher to use xref streams, so that's an optional feature. cool. glad to see it works. i'll make this an official release then and add documentation.

galkahana mentioned this issue Dec 7, 2024

feat: properly treat errors from recording out of bounds positions in very large files #290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On closing a PDF mOutputStream is NULL #289

On closing a PDF mOutputStream is NULL #289

BrianErickson-InfoCon commented Nov 29, 2024

galkahana commented Nov 30, 2024

BrianErickson-InfoCon commented Dec 2, 2024

BrianErickson-InfoCon commented Dec 2, 2024 •

edited

Loading

BrianErickson-InfoCon commented Dec 4, 2024

galkahana commented Dec 5, 2024

galkahana commented Dec 7, 2024

galkahana commented Dec 8, 2024

BrianErickson-InfoCon commented Dec 10, 2024

galkahana commented Dec 11, 2024

On closing a PDF mOutputStream is NULL #289

On closing a PDF mOutputStream is NULL #289

Comments

BrianErickson-InfoCon commented Nov 29, 2024

galkahana commented Nov 30, 2024

BrianErickson-InfoCon commented Dec 2, 2024

BrianErickson-InfoCon commented Dec 2, 2024 • edited Loading

BrianErickson-InfoCon commented Dec 4, 2024

galkahana commented Dec 5, 2024

galkahana commented Dec 7, 2024

galkahana commented Dec 8, 2024

BrianErickson-InfoCon commented Dec 10, 2024

galkahana commented Dec 11, 2024

BrianErickson-InfoCon commented Dec 2, 2024 •

edited

Loading