Skip to content

Commit

Permalink
Fix for Issue #454 (#455)
Browse files Browse the repository at this point in the history
* workaround for the Issue #450

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

* PageTest: attempt to fix cs issues

* Page.php: fixed cs issues

* ParserTest: fixed failing test testRetainImageContentImpact 

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run.
Adjusted values to fix it.

Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

* refined memory threshold in ParserTest::testRetainImageContentImpact

* Update Page.php

* Taking out line

Taking out the line:
$decodedText = '';
This was not needed. Thanks @j0k3r

* Changing the catch of the Error

To catching Throwable.

* Fix/workaround for Issue #454

When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods.
But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects.

* Update Page.php

Some of the changes asked in Github by kOOni

* Update Page.php

Other changes asked by k00ny

* Some other recomendations

Some other @k00ni recommendations

* After manually doing php-cs-fixer

I manually run dev-tools\vendor\bin\php-cs-fixer fix

* Correcting the phpstan error

* Update Page.php

just to make a code enhacement

* Removing vscode\lauch.json and some corrections

Some corrections metions by @k00ni.

* creating some function to get this clearer

Follow the recomendation of @k00ni on using extra function to have the code clearer.

* After applaying some @k00ni recomendations

Many changes following @k00ni recommendations.

* Updating the comment for the isFpdf function

Better explanation for the function

* Changes for correcting phpstan errors

* some changes

Changes in comments, functions names and variable names.

* Reformatted some code parts

Co-authored-by: Konrad Abicht <[email protected]>
  • Loading branch information
izabala and k00ni authored Oct 18, 2021
1 parent 5ed3040 commit f5fa4d3
Show file tree
Hide file tree
Showing 3 changed files with 226 additions and 35 deletions.
Binary file added samples/bugs/Issue454.pdf
Binary file not shown.
171 changes: 136 additions & 35 deletions src/Smalot/PdfParser/Page.php
Original file line number Diff line number Diff line change
Expand Up @@ -224,54 +224,147 @@ public function getText(self $page = null): string
return '';
}

/**
* Return true if the current page is a (setasign\Fpdi\Fpdi) FPDI/FPDF document
*
* The metadata 'Producer' should have the value of "FPDF" . FPDF_VERSION if the
* pdf file was generated by FPDF/Fpfi.
*
* @return bool true is the current page is a FPDI/FPDF document
*/
public function isFpdf(): bool
{
if (\array_key_exists('Producer', $this->document->getDetails()) &&
\is_string($this->document->getDetails()['Producer']) &&
str_starts_with($this->document->getDetails()['Producer'], 'FPDF')) {
return true;
}

return false;
}

/**
* Return the page number of the PDF document of the page object
*
* @return int the page number
*/
public function getPageNumber(): int
{
$pages = $this->document->getPages();
$numOfPages = \count($pages);
for ($pageNum = 0; $pageNum < $numOfPages; ++$pageNum) {
if ($pages[$pageNum] === $this) {
break;
}
}

return $pageNum;
}

/**
* Return the Object of the page if the document is a FPDF/FPDI document
*
* If the document was generated by FPDF/FPDI it returns the
* PDFObject of the given page
*
* @return PDFObject The PDFObject for the page
*/
public function getPDFObjectForFpdf(): PDFObject
{
$pageNum = $this->getPageNumber();
$xObjects = $this->getXObjects();

return $xObjects[$pageNum];
}

/**
* Return a new PDFObject of the document created with FPDF/FPDI
*
* For a document generated by FPDF/FPDI, it generates a
* new PDFObject for that document
*
* @return PDFObject The PDFObject
*/
public function createPDFObjectForFpdf(): PDFObject
{
$pdfObject = $this->getPDFObjectForFpdf();
$new_content = $pdfObject->getContent();
$header = $pdfObject->getHeader();
$config = $pdfObject->config;

return new PDFObject($pdfObject->document, $header, $new_content, $config);
}

/**
* Return page if document is a FPDF/FPDI document
*
* @return Page The page
*/
public function createPageForFpdf(): self
{
$pdfObject = $this->getPDFObjectForFpdf();
$new_content = $pdfObject->getContent();
$header = $pdfObject->getHeader();
$config = $pdfObject->config;

return new self($pdfObject->document, $header, $new_content, $config);
}

public function getTextArray(self $page = null): array
{
if ($contents = $this->get('Contents')) {
if ($contents instanceof ElementMissing) {
return [];
} elseif ($contents instanceof ElementNull) {
return [];
} elseif ($contents instanceof PDFObject) {
$elements = $contents->getHeader()->getElements();
if ($this->isFpdf()) {
$pdfObject = $this->getPDFObjectForFpdf();
$newPdfObject = $this->createPDFObjectForFpdf();

if (is_numeric(key($elements))) {
$new_content = '';
return $newPdfObject->getTextArray($pdfObject);
} else {
if ($contents = $this->get('Contents')) {
if ($contents instanceof ElementMissing) {
return [];
} elseif ($contents instanceof ElementNull) {
return [];
} elseif ($contents instanceof PDFObject) {
$elements = $contents->getHeader()->getElements();

if (is_numeric(key($elements))) {
$new_content = '';

/** @var PDFObject $element */
foreach ($elements as $element) {
if ($element instanceof ElementXRef) {
$new_content .= $element->getObject()->getContent();
} else {
$new_content .= $element->getContent();
}
}

/** @var PDFObject $element */
foreach ($elements as $element) {
if ($element instanceof ElementXRef) {
$new_content .= $element->getObject()->getContent();
} else {
$new_content .= $element->getContent();
$header = new Header([], $this->document);
$contents = new PDFObject($this->document, $header, $new_content, $this->config);
} else {
try {
$contents->getTextArray($this);
} catch (\Throwable $e) {
return $contents->getTextArray();
}
}
} elseif ($contents instanceof ElementArray) {
// Create a virtual global content.
$new_content = '';

$header = new Header([], $this->document);
$contents = new PDFObject($this->document, $header, $new_content, $this->config);
} else {
try {
$contents->getTextArray($this);
} catch (\Throwable $e) {
return $contents->getTextArray();
/** @var PDFObject $content */
foreach ($contents->getContent() as $content) {
$new_content .= $content->getContent()."\n";
}
}
} elseif ($contents instanceof ElementArray) {
// Create a virtual global content.
$new_content = '';

/** @var PDFObject $content */
foreach ($contents->getContent() as $content) {
$new_content .= $content->getContent()."\n";
$header = new Header([], $this->document);
$contents = new PDFObject($this->document, $header, $new_content, $this->config);
}

$header = new Header([], $this->document);
$contents = new PDFObject($this->document, $header, $new_content, $this->config);
return $contents->getTextArray($this);
}

return $contents->getTextArray($this);
return [];
}

return [];
}

/**
Expand Down Expand Up @@ -300,6 +393,9 @@ public function extractRawData(): array
}
}
} else {
if ($this->isFpdf()) {
$content = $this->getPDFObjectForFpdf();
}
$sectionsText = $content->getSectionsText($content->getContent());
foreach ($sectionsText as $sectionText) {
$extractedData[] = ['t' => '', 'o' => 'BT', 'c' => ''];
Expand Down Expand Up @@ -329,6 +425,10 @@ public function extractDecodedRawData(array $extractedRawData = null): array
}
$currentFont = null; /** @var Font $currentFont */
$clippedFont = null;
$fpdfPage = null;
if ($this->isFpdf()) {
$fpdfPage = $this->createPageForFpdf();
}
foreach ($extractedRawData as &$command) {
if ('Tj' == $command['o'] || 'TJ' == $command['o']) {
$data = $command['c'];
Expand Down Expand Up @@ -371,7 +471,8 @@ public function extractDecodedRawData(array $extractedRawData = null): array
}
} elseif ('Tf' == $command['o'] || 'TF' == $command['o']) {
$fontId = explode(' ', $command['c'])[0];
$currentFont = $this->getFont($fontId);
// If document is a FPDI/FPDF the $page has the correct font
$currentFont = isset($fpdfPage) ? $fpdfPage->getFont($fontId) : $this->getFont($fontId);
continue;
} elseif ('Q' == $command['o']) {
$currentFont = $clippedFont;
Expand Down
90 changes: 90 additions & 0 deletions tests/Integration/PageTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -634,4 +634,94 @@ public function testGetDataTmIssue450()
$this->assertEquals(756.25, $dataTm[0][0][5]);
$this->assertEquals('{signature:signer505906:Please+Sign+Here}', $dataTm[0][1]);
}

public function testIsFpdf(): void
{
$filename = $this->rootDir.'/samples/Document1_foxitreader.pdf';
$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertFalse($page->isFpdf());
$filename = $this->rootDir.'/samples/bugs/Issue454.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertTrue($page->isFpdf());
}

public function testGetPageNumber(): void
{
$filename = $this->rootDir.'/samples/Document1_foxitreader.pdf';
$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$filename = $this->rootDir.'/samples/Document1_pdfcreator.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$filename = $this->rootDir.'/samples/Document2_pdfcreator_nocompressed.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$filename = $this->rootDir.'/samples/InternationalChars.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$filename = $this->rootDir.'/samples/SimpleInvoiceFilledExample1.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$filename = $this->rootDir.'/samples/bugs/Issue454.pdf';
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$this->assertEquals(0, $page->getPageNumber());
$page = $pages[1];
$this->assertEquals(1, $page->getPageNumber());
$page = $pages[2];
$this->assertEquals(2, $page->getPageNumber());
$page = $pages[3];
$this->assertEquals(3, $page->getPageNumber());
}

public function testIssue454(): void
{
$filename = $this->rootDir.'/samples/bugs/Issue454.pdf';
$parser = $this->getParserInstance();
$document = $parser->parseFile($filename);
$pages = $document->getPages();
$page = $pages[0];
$dataTm = $page->getDataTm();
$this->assertIsArray($dataTm);
$this->assertGreaterThan(0, \count($dataTm));
$this->assertIsArray($dataTm[0]);
$this->assertEquals(2, \count($dataTm[0]));
$this->assertIsArray($dataTm[0][0]);
$this->assertEquals(6, \count($dataTm[0][0]));
$this->assertEquals(201.96, $dataTm[0][0][4]);
$this->assertEquals(720.68, $dataTm[0][0][5]);
$this->assertStringContainsString('Document title', $dataTm[0][1]);
$textData = $page->getTextXY(201.96, 720.68);
$this->assertStringContainsString('Document title', $textData[0][1]);
$page = $pages[2];
$dataTm = $page->getDataTm();
$this->assertIsArray($dataTm);
$this->assertGreaterThan(0, \count($dataTm));
$this->assertIsArray($dataTm[0]);
$this->assertEquals(2, \count($dataTm[0]));
$this->assertIsArray($dataTm[0][0]);
$this->assertEquals(6, \count($dataTm[0][0]));
$this->assertEquals(67.5, $dataTm[0][0][4]);
$this->assertEquals(756.25, $dataTm[0][0][5]);
$this->assertStringContainsString('{signature:signer505906:Please+Sign+Here}', $dataTm[0][1]);
$textData = $page->getTextXY(67.5, 756.25);
$this->assertStringContainsString('{signature:signer505906:Please+Sign+Here}', $textData[0][1]);
}
}

0 comments on commit f5fa4d3

Please sign in to comment.