Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New document class #20

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

New document class #20

wants to merge 1 commit into from

Conversation

corentin-larose
Copy link
Collaborator

Work in progress, just to know if you like this lead...

@camspiers
Copy link
Owner

I like the idea, but I am still thinking about the implications.

@corentin-larose
Copy link
Collaborator Author

Just figured out that I put my explanations in the commit message, don't know if you saw them (but yes, a lot to think... It's just a lead):

For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:

    public function classify(Document $document)
    {
    $results = array();

    /*
        if ($this->documentNormalizer) {
            $document = $this->documentNormalizer->normalize($document);
        }

        $tokens = $this->tokenizer->tokenize($document);

        if ($this->tokenNormalizer) {
            $tokens = $this->tokenNormalizer->normalize($tokens);
        }

        $tokens = array_count_values($tokens);
    */

    $tokens = $document;
    [...]

My pros

  • Strong contract for Documents through interface
  • Document is in a frequency state ASAP
  • Document API is very wide open (cf Unit Tests)
  • Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
  • Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
  • Agnostic approach using SPL
  • One can even use closures/built-in functions for normalizers/tokenizers (faster?)
  • Hydrators/Extractors made simplier
  • Some more document-level calculations could be done in the instance
  • TokenCountByDocument no longer necessary

My cons

  • Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
  • Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
  • Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
  • Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
  • Your cons?

For instance, this document can be used directly in classify() method replacing the commented code and thus the related properties/accessors:
```php
    public function classify(Document $document)
    {
	$results = array();

	/*
        if ($this->documentNormalizer) {
            $document = $this->documentNormalizer->normalize($document);
        }

        $tokens = $this->tokenizer->tokenize($document);

        if ($this->tokenNormalizer) {
            $tokens = $this->tokenNormalizer->normalize($tokens);
        }

        $tokens = array_count_values($tokens);
	*/

	$tokens = $document;
	[...]
```

My pros
- Strong contract for Documents through interface
- Document is in a frequency state ASAP
- Document API is very wide open (cf Unit Tests)
- Document can still be manipulated as an array/Iterable (shame, Symfony config component (DataStore) doesn't like ArrayObject)
- Since document is an object, it is more RAM-efficient (no multiple copies as with an array)
- Agnostic approach using SPL
- One can even use closures/built-in functions for normalizers/tokenizers (faster?)
- Hydrators/Extractors made simplier
- Some more document-level calculations could be done in the instance
- TokenCountByDocument no longer necessary

My cons
- Not sure if it should be in tokens state rather than in frequency state (need for calculation/count? we could store these information either)
- Loose contracts for normalizers/tokenizers since it uses callables instead of classes with interfaces (could still be enforced though, but we would loose the closures/built-in functions advantage)
- Slower than arrays? (not sure, needs a bench since SPL is incredibly fast, and it removes a lot of logic/iterations around)
- Static approach for accessors which is sometimes hated by developpers (Unit Tests...)
- Your cons?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants