Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On closing a PDF mOutputStream is NULL #289

Open
BrianErickson-InfoCon opened this issue Nov 29, 2024 · 9 comments
Open

On closing a PDF mOutputStream is NULL #289

BrianErickson-InfoCon opened this issue Nov 29, 2024 · 9 comments

Comments

@BrianErickson-InfoCon
Copy link

OS: Windows 10
Complier: Microsoft Visual C++ 2019 Version 16.11.33

I'm writing a file that is 10+GB in size.
Smaller sizes seem to work.
All JPEG images.
During the file closing process, (in function ObjectsContext::StartNewIndirectObject())
mOutputStream is NULL.
The resulting PDF isn't properly closed.

static const std::string scObj = "obj";
ObjectIDType ObjectsContext::StartNewIndirectObject()
{
ObjectIDType newObjectID = mReferencesRegistry.AllocateNewObjectID();

mReferencesRegistry.MarkObjectAsWritten(newObjectID,mOutputStream->GetCurrentPosition()); << mOutputStream is NULL
mPrimitiveWriter.WriteInteger(newObjectID);
mPrimitiveWriter.WriteInteger(0);
mPrimitiveWriter.WriteKeyword(scObj);

if(IsEncrypting()) {
	mEncryptionHelper->OnObjectStart((long long)newObjectID, 0);
}

return newObjectID;

}

My code:
boolean CQSDPdfFile::Close()
{
EStatusCode status;
// If file is already closed, just return true
if (!m_bIsFileOpen)
return true;
status = m_pdfWriter.EndPDF(); <<<
ClearFile();
if (status != PDFHummus::eSuccess)
return false;
m_bIsFileOpen = false;
return true;
}

@galkahana
Copy link
Owner

Sounds like a library expression of PDF Format's own limitation. The beginning of Appendix C "Implementation limits" of the PDF specs says:
"PDF itself has one architectural limit: Because ten digits are allocated to byte offsets, the size of a file is limited to 1010 bytes (approximately 10 gigabytes)."

There's probably something within the library writing algo that identifies that at some point it's hitting the limit to describe offsets and stops. I should hope there's a better way than just crashing on null....i wonder if there's an earlier warning, like if one of the earlier writing command (like teh placement of a JPEG) provides with an error. [i'll try to take a look sometime soon and figure out if there's such an early warning trying to simulate what you got myself).

Can you handle the task at hand with smaller files? like limited to 9-9.5 gigs?

@BrianErickson-InfoCon
Copy link
Author

Work at less than 10 G.
I read the spec 1.4, there is a 10 bit limit, but that was suppose to be removed in spec 1.5+.
Latest version of Adobe allows for file size greater that 10 GB.
Either case, it would be great if there was some indication of a limit has been reached or about to be reached.
Please, let me know.
Thanks.

@BrianErickson-InfoCon
Copy link
Author

BrianErickson-InfoCon commented Dec 2, 2024

While I'm still having the problem with 10G+ files
I have narrowed the problem to:
status = mObjectsContext->WriteXrefTable(xrefTablePosition); << returns eFailure
if(status != 0)
break;
I'm still trying to isolate the problem, in case it's not a spec limitation.
However, the "mOutputStream is NULL" part is because I was closing the file twice. sorry.
Hope this helps.

@BrianErickson-InfoCon
Copy link
Author

The resulting output file is up to 11.5 GB, but the file is corrupt.
Also,

In function "EStatusCode ObjectsContext::WriteXrefTable(LongFilePositionType& outWritePosition)"
...
if(objectReference.mObjectWritten) << false
{
SAFE_SPRINTF_2(entryBuffer,21,"%010lld %05ld n\r\n",objectReference.mWritePosition,objectReference.mGenerationNumber);
mOutputStream->Write((const IOBasicTypes::Byte *)entryBuffer,20);
}
else
{
// object not written. at this point this should not happen, and indicates a failure
status = PDFHummus::eFailure; << Gets here
TRACE_LOG1("ObjectsContext::WriteXrefTable, Unexpected Failure. Object of ID = %ld was not registered as written. probably means it was not written",i);
}

Hope this helps

@galkahana
Copy link
Owner

This probably means that there was an earlier halt.
mObjectWritten is marked true when an objects starts (you can see references to MarkObjectAsWritten). This means that an object ID was allocated, but the object itself was never written (specifically void ObjectsContext::StartNewIndirectObject(ObjectIDType inObjectID) was not called).

adding logs doesn't help?

In any case, im not sure what this will help. the files as come out of PDFWriter have the 10 gigs limitation anyways in writing. maybe there's an earlier warning (so you dn't crash), but you should probably plan on smaller files.

Being able to support larger files probably means a bit of a feature adding to the library to emit objects stream based files only (that's the 1.5 feature you refer to...and it remains to be seen whether it does elevate the said limitation), which is not the case right now.

@galkahana
Copy link
Owner

ok. located the string of issues.
so at some point the file grows to be more than 10gbs.
here's what happens next:

  1. At some point there's a call to StartNewIndirectObject. This can happen via quite a few routes. it's when starting to write an object. (this can also happen with StartModifiedIndirectObject when writing a modified object in pdf file update scenarios).
  2. In it there's a call to MarkObjectAsWritten. MarkObjectAsWritten returns a status result, but it's not being handled by any of this calls. This is the root of the mishaps.
  3. MarkObjectAsWritten normally returns OK, but will return a failure if the position for the object start (recorded at this point) is post what may be represented using 10 digits. That's the 10gb limit in the library.
  4. Given that StartXXXXIndirectObject ignores such failures the code continues till WriteXrefTable fails later.

This can be corrected to return a failure immediately (by redirecting the error code via StartXXXXIndirectObject in its various form) instead of doing so later. At the least it will provide an early halt. The file would still be defective, given it already reached 10gbs and xref may not be written later still. I'll introduce a correction along this line soon.

So with this, it's still required to keep an eye on how the file size grows and halt prior to 10gbs.

I read about 1.5 xref streams again. looks like they can be used regardless of the usage of objects streams. While the library does not write xref streams at this point i think i got most of the parts ready to enable this. I Can add this as a feature after a bit of a POC to see that i got this right. This will lift the 10gbs limit as you can determine the offset bytesize yourself, something that i'll route to the user as an option (with a good enough default).

@galkahana
Copy link
Owner

ok, so with #291 you should be able to create files with file size larger than 10gbs.
Now, it's possible to ask PDFWriter to create files with 1.5 xref streams (make sure to also have the file with version 1.5 or higher) and then the limitation of 10 digits is now more. i lift the validation check in this case and you probably can just go ahead and create those large files you wanted (well...didn't check 10gbs or more...but the principal should work).

To activate this when using StartPDF/StartPDFStream provide a PDFCreationSettings with inWriteXrefAsXrefStream set to true.
You can see an example here.
If you get to try it out i'd love to hear if this was sufficient to create those files you wanted.

cheers,
Gal.

@BrianErickson-InfoCon
Copy link
Author

#291 version works, great.
FYI, I had the version set to 1.3 but changed to max. Didn't fix.
Using pdfWriter.GetObjectsContext().GetCurrentPosition() to limit file size then resetting file works as well.
(End up with multiple files)

But the #291 version works much better. My test was up to 11.5 GB.
Thanks.

@galkahana
Copy link
Owner

yeah, just changing the version wouldn't be enough. it's not required for 1.5 or higher to use xref streams, so that's an optional feature. cool. glad to see it works. i'll make this an official release then and add documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants