Skip to content

Commit

Permalink
Implemented Token Hashing Vectorizer transformer
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewdalpino committed Sep 7, 2020
1 parent e8f5ed2 commit 18dab62
Show file tree
Hide file tree
Showing 10 changed files with 234 additions and 17 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
- Unreleased
- Implemented Token Hashing Vectorizer transformer

- 0.1.0-beta
- Add Recursive Feature Eliminator feature selector
- Implement BM25 TF-IDF Transformer
Expand Down
5 changes: 3 additions & 2 deletions composer.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
"homepage": "https://rubixml.com",
"license": "MIT",
"keywords": [
"php", "machine-learning", "rubix", "ml", "extras", "neural-network", "deep-learning",
"analytics", "data-mining"
"php", "machine learning", "rubix", "ml", "extras", "neural network", "deep learning",
"analytics", "data mining", "php-ml", "php ml", "php ai", "artificial intelligence",
"ai", "rubixml", "rubix ml"
],
"authors": [
{
Expand Down
4 changes: 2 additions & 2 deletions docs/transformers/bm25-transformer.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
<span style="float:right;"><a href="https://github.com/RubixML/Extras/blob/master/src/Transformers/BM25Transformer.php">[source]</a></span>

# BM25 Transformer
BM25 is a term frequency weighting scheme that takes term frequency (TF) saturation and document length into account.
BM25 is a sublinear term frequency weighting scheme that takes term frequency (TF) saturation and document length into account.

> **Note:** This transformer assumes that its input is made up of word frequency vectors such as those produced by [Word Count Vectorizer](word-count-vectorizer.md).
> **Note:** BM25 Transformer assumes that its inputs are token frequency vectors such as those created by [Word Count Vectorizer](word-count-vectorizer.md).
**Interfaces:** [Transformer](api.md#transformer), [Stateful](api.md#stateful), [Elastic](api.md#elastic)

Expand Down
2 changes: 1 addition & 1 deletion docs/transformers/delta-tf-idf-transformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# Delta TF-IDF Transformer
A supervised TF-IDF (Term Frequency Inverse Document Frequency) Transformer that uses class labels to boost the TF-IDFs of terms by how informative they are. Terms that receive the highest boost are those whose concentration is primarily in one class whereas low weighted terms are more evenly distributed among the classes.

> **Note:** This transformer assumes that its input is made up of word frequency vectors such as those produced by [Word Count Vectorizer](word-count-vectorizer.md).
> **Note:** Delta TF-IDF Transformer assumes that its inputs are token frequency vectors such as those created by [Word Count Vectorizer](word-count-vectorizer.md).
**Interfaces:** [Transformer](api.md#transformer), [Stateful](api.md#stateful), [Elastic](api.md#elastic)

Expand Down
12 changes: 6 additions & 6 deletions docs/transformers/recursive-feature-eliminator.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,6 @@ Recursive Feature Eliminator or *RFE* is a supervised feature selector that uses
| 2 | epochs | 1 | int | The maximum number of iterations to recurse upon the dataset. |
| 3 | base | Auto | RanksFeatures | The base feature ranking learner instance. |

## Additional Methods
Return the final importances scores of the selected feature columns:
``` php
public importances() : ?array
```

## Example
```php
use Rubix\ML\Transformers\RecursiveFeatureEliminator;
Expand All @@ -28,5 +22,11 @@ use Rubix\ML\Regressors\RegressionTree;
$transformer = new RecursiveFeatureEliminator(10, 2, new RegressionTree());
```

## Additional Methods
Return the final importances scores of the selected feature columns:
``` php
public importances() : ?array
```

### References
>- I. Guyon et al. (2002). Gene Selection for Cancer Classification using Support Vector Machines.
25 changes: 25 additions & 0 deletions docs/transformers/token-hashing-vectorizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
<span style="float:right;"><a href="https://github.com/RubixML/Extras/blob/master/src/Transformers/TokenHashingVectorizer.php">[source]</a></span>

# Token Hashing Vectorizer
Token Hashing Vectorizer builds token count vectors on the fly by employing a *hashing trick*. It is a stateless transformer that uses the CRC32 (Cyclic Redundancy Check) hashing algorithm to assign token occurrences to a bucket in a vector of user-specified dimensionality. The advantage of hashing over storing a fixed vocabulary is that there is no memory footprint however there is a chance that certain tokens will collide with other tokens especially in lower-dimensional vector spaces.

**Interfaces:** [Transformer](api.md#transformer)

**Data Type Compatibility:** Continuous only

## Parameters
| # | Param | Default | Type | Description |
|---|---|---|---|---|
| 1 | dimensions | | int | The dimensionality of the vector space. |
| 2 | tokenizer | Word | Tokenizer | The tokenizer used to extract tokens from blobs of text. |

## Example
```php
use Rubix\ML\Transformers\TokenHashingVectorizer;
use Rubix\ML\Other\Tokenizers\NGram;

$transformer = new TokenHashingVectorizer(10000, new NGram(1, 2));
```

## Additional Methods
This transformer does not have any additional methods.
8 changes: 4 additions & 4 deletions src/Transformers/BM25Transformer.php
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@
/**
* BM25 Transformer
*
* BM25 is a term frequency weighting scheme that takes term frequency (TF) saturation and
* document length into account.
* BM25 is a sublinear term frequency weighting scheme that takes term frequency (TF)
* saturation and document length into account.
*
* > **Note**: This transformer assumes that its input is made up of term frequency vectors
* such as those created by the Word Count Vectorizer.
* > **Note**: BM25 Transformer assumes that its inputs are made up of token frequency
* vectors such as those created by the Word Count Vectorizer.
*
* References:
* [1] S. Robertson et al. (2009). The Probabilistic Relevance Framework: BM25 and Beyond.
Expand Down
4 changes: 2 additions & 2 deletions src/Transformers/DeltaTfIdfTransformer.php
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
* that receive the highest boost are those whose concentration is primarily in one
* class whereas low weighted terms are more evenly distributed among the classes.
*
* > **Note**: This transformer assumes that its input is made up of word frequency
* vectors such as those created by the Word Count Vectorizer.
* > **Note**: Delta TF-IDF Transformer assumes that its inputs are made up of token
* frequency vectors such as those created by the Word Count Vectorizer.
*
* References:
* [1] J. Martineau et al. (2009). Delta TFIDF: An Improved Feature Space for
Expand Down
125 changes: 125 additions & 0 deletions src/Transformers/TokenHashingVectorizer.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<?php

namespace Rubix\ML\Transformers;

use Rubix\ML\DataType;
use Rubix\ML\Datasets\Dataset;
use Rubix\ML\Other\Tokenizers\Word;
use Rubix\ML\Other\Tokenizers\Tokenizer;
use InvalidArgumentException;
use Stringable;

use function count;
use function is_string;

/**
* Token Hashing Vectorizer
*
* Token Hashing Vectorizer builds token count vectors on the fly by employing a *hashing
* trick*. It is a stateless transformer that uses the CRC32 (Cyclic Redundancy Check)
* hashing algorithm to assign token occurrences to a bucket in a vector of user-defined
* dimensionality. The advantage of hashing over a fixed vocabulary is that there is no
* memory footprint however there is a chance that certain tokens will collide with other
* tokens especially in lower-dimensional vector spaces.
*
* @category Machine Learning
* @package Rubix/ML
* @author Andrew DalPino
*/
class TokenHashingVectorizer implements Transformer, Stringable
{
/**
* The maximum number of dimensions supported.
*
* @var int
*/
protected const MAX_DIMENSIONS = 4294967295;

/**
* The dimensionality of the vector space.
*
* @var int
*/
protected $dimensions;

/**
* The tokenizer used to extract tokens from blobs of text.
*
* @var \Rubix\ML\Other\Tokenizers\Tokenizer
*/
protected $tokenizer;

/**
* @param int $dimensions
* @param \Rubix\ML\Other\Tokenizers\Tokenizer|null $tokenizer
* @throws \InvalidArgumentException
*/
public function __construct(int $dimensions, ?Tokenizer $tokenizer = null)
{
if ($dimensions < 1 or $dimensions > self::MAX_DIMENSIONS) {
throw new InvalidArgumentException('Dimensions must be'
. ' between 0 and ' . self::MAX_DIMENSIONS
. ", $dimensions given.");
}

$this->dimensions = $dimensions;
$this->tokenizer = $tokenizer ?? new Word();
}

/**
* Return the data types that this transformer is compatible with.
*
* @return \Rubix\ML\DataType[]
*/
public function compatibility() : array
{
return DataType::all();
}

/**
* Transform the dataset in place.
*
* @param array[] $samples
*/
public function transform(array &$samples) : void
{
$scale = $this->dimensions / self::MAX_DIMENSIONS;

foreach ($samples as &$sample) {
$vectors = [];

foreach ($sample as $column => $value) {
if (is_string($value)) {
$template = array_fill(0, $this->dimensions, 0);

$tokens = $this->tokenizer->tokenize($value);

$counts = array_count_values($tokens);

foreach ($counts as $token => $count) {
$offset = (int) floor(crc32($token) * $scale);

$template[$offset] += $count;
}

$vectors[] = $template;

unset($sample[$column]);
}
}

$sample = array_merge($sample, ...$vectors);
}
}

/**
* Return the string representation of the object.
*
* @return string
*/
public function __toString() : string
{
return "Token Hashing Vectorizer (dimensions: {$this->dimensions},"
. " tokenizer: {$this->tokenizer})";
}
}
63 changes: 63 additions & 0 deletions tests/Transformers/TokenHashingVectorizerTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
<?php

namespace Rubix\ML\Tests\Transformers;

use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Other\Tokenizers\Word;
use Rubix\ML\Transformers\Transformer;
use Rubix\ML\Transformers\TokenHashingVectorizer;
use PHPUnit\Framework\TestCase;

/**
* @group Transformers
* @covers \Rubix\ML\Transformers\TokenHashingVectorizer
*/
class TokenHashingVectorizerTest extends TestCase
{
/**
* @var \Rubix\ML\Datasets\Unlabeled
*/
protected $dataset;

/**
* @var \Rubix\ML\Transformers\TokenHashingVectorizer
*/
protected $transformer;

/**
* @before
*/
protected function setUp() : void
{
$this->dataset = Unlabeled::quick([
['the quick brown fox jumped over the lazy man sitting at a bus stop drinking a can of coke'],
['with a dandy umbrella'],
]);

$this->transformer = new TokenHashingVectorizer(20, new Word());
}

/**
* @test
*/
public function build() : void
{
$this->assertInstanceOf(TokenHashingVectorizer::class, $this->transformer);
$this->assertInstanceOf(Transformer::class, $this->transformer);
}

/**
* @test
*/
public function transform() : void
{
$this->dataset->apply($this->transformer);

$outcome = [
[1, 1, 0, 1, 2, 0, 0, 1, 3, 0, 0, 1, 0, 0, 2, 1, 0, 1, 5, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
];

$this->assertEquals($outcome, $this->dataset->samples());
}
}

0 comments on commit 18dab62

Please sign in to comment.