Skip to content

Commit

Permalink
Add RegexFullMatch operator (onnx#5401)
Browse files Browse the repository at this point in the history
### Description
This PR introduces the `RegexFullMatch` operator, as originally proposed
in onnx#5317.

The `RegexFullMatch` operator takes one string tensor as input and
returns a bool tensor of identical shape indicating if each element
fully matches the regex pattern encoded in the `pattern` string
attribute. This attribute is a string and we expect valid
[re2](https://github.com/google/re2) regex.

Some examples are as follows:
```
RegexFullMatch(["www.google.com", "www.facebook.com", "www.bbc.co.uk"], "www\.[\w.-]+\.\bcom\b")
=> [True, True, False]
RegexFullMatch([["[email protected]", "[email protected]"], ["not email", "[email protected]"]], "(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)")
=> [[True, False], [False, True]]
```

### Motivation and Context
Closes onnx#5317.

Following discussion at the last Operators SIG Weekly the "engine"
attribute has been dropped in favour of simply using
[re2](https://github.com/google/re2) syntax for now. This reflects the
fact that both
[Tensorflow](https://www.tensorflow.org/api_docs/python/tf/strings/regex_full_match)
and
[PyTorch](https://pytorch.org/text/0.15.0/transforms.html#regextokenizer)
operators requiring regex use re2 already.

---------

Signed-off-by: Aditya Goel <[email protected]>
Signed-off-by: Chun-Wei Chen <[email protected]>
Signed-off-by: Aditya Goel <[email protected]>
Co-authored-by: Chun-Wei Chen <[email protected]>
Co-authored-by: Christian Bourjau <[email protected]>
Co-authored-by: Xavier Dupré <[email protected]>
  • Loading branch information
4 people authored Aug 4, 2023
1 parent 0f1f98c commit 2e0908d
Show file tree
Hide file tree
Showing 32 changed files with 483 additions and 5 deletions.
2 changes: 1 addition & 1 deletion .azure-pipelines/Linux-CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
export CMAKE_ARGS="-DONNX_WERROR=ON -DONNX_USE_PROTOBUF_SHARED_LIBS=ON"
# enable more sanitizer
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CXX_FLAGS='-fsanitize=undefined -fno-sanitize-recover=all '"
pip install -e . -v
pip install -e ".[reference]" -v
displayName: 'Install ONNX and dependencies'
- script: |
Expand Down
2 changes: 1 addition & 1 deletion .azure-pipelines/MacOS-CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ jobs:
if [ '$(onnx_lite)' == '1' ]; then
export CMAKE_ARGS="${CMAKE_ARGS} -DONNX_USE_LITE_PROTO=ON"
fi
pip install -e . -v
pip install -e ".[reference]" -v
displayName: 'Install dependencies and ONNX'
- script: |
Expand Down
2 changes: 1 addition & 1 deletion .azure-pipelines/Windows-CI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ jobs:
set CMAKE_ARGS=-DONNX_USE_PROTOBUF_SHARED_LIBS=OFF -DONNX_USE_LITE_PROTO=ON -DONNX_WERROR=OFF
)
pip install -e . -v
pip install -e ".[reference]" -v
pytest
IF NOT %ERRORLEVEL% EQU 0 (
@echo "pytest failed"
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/release_win.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@ jobs:
run: |
python -m pip install -q --upgrade pip
cd onnx
if ('${{ matrix.architecture }}' -eq 'x86') {
sed -i '' '/google-re2/d' requirements-release.txt
}
python -m pip install -q -r requirements-release.txt
- name: Build ONNX wheel
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ A roadmap process takes place every year. More details can be found [here](https
ONNX released packages are published in PyPi.

```sh
pip install onnx
pip install onnx # or pip install onnx[reference] for optional reference implementation dependencies
```

[ONNX weekly packages](https://pypi.org/project/onnx-weekly/) are published in PyPI to enable experimentation and early testing.
Expand Down
38 changes: 38 additions & 0 deletions docs/Changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -24092,6 +24092,44 @@ This version of the operator has been available since version 20 of the default
<dd>Constrain grid types to float tensors.</dd>
</dl>

### <a name="RegexFullMatch-20"></a>**RegexFullMatch-20**</a>

RegexFullMatch performs a full regex match on each element of the input tensor. If an element fully matches the regex pattern specified as an attribute, the corresponding element in the output is True and it is False otherwise. [RE2](https://github.com/google/re2/wiki/Syntax) regex syntax is used.

#### Version

This version of the operator has been available since version 20 of the default ONNX operator set.

#### Attributes

<dl>
<dt><tt>pattern</tt> : string</dt>
<dd>Regex pattern to match on. This must be valid RE2 syntax.</dd>
</dl>

#### Inputs

<dl>
<dt><tt>X</tt> (non-differentiable) : T1</dt>
<dd>Tensor with strings to match on.</dd>
</dl>

#### Outputs

<dl>
<dt><tt>Y</tt> (non-differentiable) : T2</dt>
<dd>Tensor of bools indicating if each input string fully matches the regex pattern specified.</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(string)</dt>
<dd>Inputs must be UTF-8 strings</dd>
<dt><tt>T2</tt> : tensor(bool)</dt>
<dd>Outputs are bools and are True where there is a full regex match and False otherwise.</dd>
</dl>

### <a name="StringConcat-20"></a>**StringConcat-20**</a>

StringConcat concatenates string tensors elementwise (with NumPy-style broadcasting support)
Expand Down
116 changes: 116 additions & 0 deletions docs/Operators.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ For an operator input/output's differentiability, it can be differentiable,
|<a href="#ReduceMin">ReduceMin</a>|<a href="Changelog.md#ReduceMin-18">18</a>, <a href="Changelog.md#ReduceMin-13">13</a>, <a href="Changelog.md#ReduceMin-12">12</a>, <a href="Changelog.md#ReduceMin-11">11</a>, <a href="Changelog.md#ReduceMin-1">1</a>|
|<a href="#ReduceProd">ReduceProd</a>|<a href="Changelog.md#ReduceProd-18">18</a>, <a href="Changelog.md#ReduceProd-13">13</a>, <a href="Changelog.md#ReduceProd-11">11</a>, <a href="Changelog.md#ReduceProd-1">1</a>|
|<a href="#ReduceSum">ReduceSum</a>|<a href="Changelog.md#ReduceSum-13">13</a>, <a href="Changelog.md#ReduceSum-11">11</a>, <a href="Changelog.md#ReduceSum-1">1</a>|
|<a href="#RegexFullMatch">RegexFullMatch</a>|<a href="Changelog.md#RegexFullMatch-20">20</a>|
|<a href="#Reshape">Reshape</a>|<a href="Changelog.md#Reshape-19">19</a>, <a href="Changelog.md#Reshape-14">14</a>, <a href="Changelog.md#Reshape-13">13</a>, <a href="Changelog.md#Reshape-5">5</a>, <a href="Changelog.md#Reshape-1">1</a>|
|<a href="#Resize">Resize</a>|<a href="Changelog.md#Resize-19">19</a>, <a href="Changelog.md#Resize-18">18</a>, <a href="Changelog.md#Resize-13">13</a>, <a href="Changelog.md#Resize-11">11</a>, <a href="Changelog.md#Resize-10">10</a>|
|<a href="#ReverseSequence">ReverseSequence</a>|<a href="Changelog.md#ReverseSequence-10">10</a>|
Expand Down Expand Up @@ -22588,6 +22589,121 @@ expect(
</details>


### <a name="RegexFullMatch"></a><a name="regexfullmatch">**RegexFullMatch**</a>

RegexFullMatch performs a full regex match on each element of the input tensor. If an element fully matches the regex pattern specified as an attribute, the corresponding element in the output is True and it is False otherwise. [RE2](https://github.com/google/re2/wiki/Syntax) regex syntax is used.

#### Version

This version of the operator has been available since version 20 of the default ONNX operator set.

#### Attributes

<dl>
<dt><tt>pattern</tt> : string</dt>
<dd>Regex pattern to match on. This must be valid RE2 syntax.</dd>
</dl>

#### Inputs

<dl>
<dt><tt>X</tt> (non-differentiable) : T1</dt>
<dd>Tensor with strings to match on.</dd>
</dl>

#### Outputs

<dl>
<dt><tt>Y</tt> (non-differentiable) : T2</dt>
<dd>Tensor of bools indicating if each input string fully matches the regex pattern specified.</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>T1</tt> : tensor(string)</dt>
<dd>Inputs must be UTF-8 strings</dd>
<dt><tt>T2</tt> : tensor(bool)</dt>
<dd>Outputs are bools and are True where there is a full regex match and False otherwise.</dd>
</dl>


#### Examples

<details>
<summary>basic</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"www\.[\w.-]+\.\bcom\b",
)

x = np.array(["www.google.com", "www.facebook.com", "www.bbc.co.uk"]).astype(
object
)
result = np.array([True, True, False])
expect(node, inputs=[x], outputs=[result], name="test_regex_full_match_basic")
```

</details>


<details>
<summary>match_email_domain</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array(
[
["[email protected]", "[email protected]"],
["not email", "[email protected]"],
]
).astype(object)
result = np.array([[True, False], [False, True]])
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_email_domain",
)
```

</details>


<details>
<summary>match_empty</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array([[], []]).astype(object)
result = np.array([[], []]).astype(bool)
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_empty",
)
```

</details>


### <a name="Relu"></a><a name="relu">**Relu**</a>

Relu takes one input data (Tensor<T>) and produces one output data
Expand Down
74 changes: 73 additions & 1 deletion docs/TestCoverage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
* [Overall Test Coverage](#overall-test-coverage)
# Node Test Coverage
## Summary
Node tests have covered 177/190 (93.16%, 5 generators excluded) common operators.
Node tests have covered 178/191 (93.19%, 5 generators excluded) common operators.

Node tests have covered 0/0 (N/A) experimental operators.

Expand Down Expand Up @@ -15258,6 +15258,78 @@ expect(
</details>


### RegexFullMatch
There are 3 test cases, listed as following:
<details>
<summary>basic</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"www\.[\w.-]+\.\bcom\b",
)

x = np.array(["www.google.com", "www.facebook.com", "www.bbc.co.uk"]).astype(
object
)
result = np.array([True, True, False])
expect(node, inputs=[x], outputs=[result], name="test_regex_full_match_basic")
```

</details>
<details>
<summary>match_email_domain</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array(
[
["[email protected]", "[email protected]"],
["not email", "[email protected]"],
]
).astype(object)
result = np.array([[True, False], [False, True]])
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_email_domain",
)
```

</details>
<details>
<summary>match_empty</summary>

```python
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array([[], []]).astype(object)
result = np.array([[], []]).astype(bool)
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_empty",
)
```

</details>


### Relu
There are 1 test cases, listed as following:
<details>
Expand Down
67 changes: 67 additions & 0 deletions onnx/backend/test/case/node/regex_full_match.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Copyright (c) ONNX Project Contributors
#
# SPDX-License-Identifier: Apache-2.0

import numpy as np

import onnx
from onnx.backend.test.case.base import Base
from onnx.backend.test.case.node import expect


class RegexFullMatch(Base):
@staticmethod
def export_basic() -> None:
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"www\.[\w.-]+\.\bcom\b",
)

x = np.array(["www.google.com", "www.facebook.com", "www.bbc.co.uk"]).astype(
object
)
result = np.array([True, True, False])
expect(node, inputs=[x], outputs=[result], name="test_regex_full_match_basic")

@staticmethod
def export_match_email_domain() -> None:
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array(
[
["[email protected]", "[email protected]"],
["not email", "[email protected]"],
]
).astype(object)
result = np.array([[True, False], [False, True]])
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_email_domain",
)

@staticmethod
def export_match_empty() -> None:
node = onnx.helper.make_node(
"RegexFullMatch",
inputs=["X"],
outputs=["Y"],
pattern=r"(\W|^)[\w.\-]{0,25}@(yahoo|gmail)\.com(\W|$)",
)

x = np.array([[], []]).astype(object)
result = np.array([[], []]).astype(bool)
expect(
node,
inputs=[x],
outputs=[result],
name="test_regex_full_match_empty",
)
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2www.google.com2www.facebook.com2www.bbc.co.ukBX
Expand Down
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2[email protected][email protected] not email2[email protected]X
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
2 changes: 2 additions & 0 deletions onnx/defs/operator_sets.h
Original file line number Diff line number Diff line change
Expand Up @@ -1107,6 +1107,7 @@ class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, GridSample);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, Gelu);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, ConstantOfShape);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, StringConcat);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, RegexFullMatch);
class ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, StringSplit);

// Iterate over schema from ai.onnx version 20
Expand All @@ -1118,6 +1119,7 @@ class OpSet_Onnx_ver20 {
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, Gelu)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, ConstantOfShape)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, StringConcat)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, RegexFullMatch)>());
fn(GetOpSchema<ONNX_OPERATOR_SET_SCHEMA_CLASS_NAME(Onnx, 20, StringSplit)>());
}
};
Expand Down
Loading

0 comments on commit 2e0908d

Please sign in to comment.