Support big endian platform by providing new ops implementation #559

andrewsi-z · 2021-11-03T19:58:02Z

Based on discussion in spaCy issue 9428, there are a set of modules related to the load and use of pre-trained language pipelines and/or models on a platform with a different byteorder. This leads to incorrect results produced by the use of these pretrained models on platforms such as s390x.

As discussed in the spaCy issue referenced above, this PR adds support for thinc-bigendian-ops , which is created and located here: https://github.com/andrewsi-z/thinc-bigendian-ops .

thinc-bigendian-ops follows the precedent and use of thinc-apple-ops. When imported, it provides a custom ops class that implements the following:

numpy_ops.pyx hash has logic that maps a unsigned char array to the input ids array (uint64). The approach used maps the underlying storage and "views" as char array, and this implementation will result in different values depending on system byteorder. BigEndianOps modifies this algorithm and provides an alternative based on bit-shift operations, which will return the same value regardless of system byte order.
A new asarray method is implemented, which checks the byteorder of the output before returning it to caller. In cases where the byteorder is littleendian, it is swapped.

The logic in thinc proper is meant to mirror the logic added for thinc_apple_ops in a minimally intrusive way.

Notes:

ran test suite with thinc-bigendian-ops on s390x (big endian)
ran spaCy _sm model pipelines to validate original use cae.

honnibal · 2021-11-16T16:53:43Z

I think this looks like a good match for how we've been doing things. @adrianeboyd do you have any comments? It looks okay to me. If it all works, I'm pleased to resolve this long-standing platform support issue!

I do think the mechanism we've designed for the ops selection could be better though. It's quite unsightly the way we have to reference the subclasses in Thinc, and it leads to the circular import problems.

Here's a suggestion we could try to improve this:

Ops classes implement a classmethod get_hardware_suitability(name, **kwargs) that returns an integer. A negative integer indicates incompatibility. A higher integer indicates that the ops class claims it should be used. kwargs can be passed in from above to influence the match.
In get_ops, we sort the registered ops by their hardware suitability score, and select the highest available. If no compatible ops are found, we raise an error.

This will prevent Thinc from having to know all of the ops classes in order to resolve which one should be used. Generally ops classes should return -1 for incompatible, 0 for compatible but not optimised, and 1 for optimised. If another plugin wants to overrule existing ones in specific circumstances it can return a higher match score under those conditions (including by supporting arbitrary keyword arguments).

I think we could merge the current PR as-is, as it doesn't seem to make things worse (that I can see). But I think the patch does indicate we could improve our system here.

andrewsi-z · 2021-11-16T18:27:59Z

For reference on associated issues, I should mention I have the needed murmurhash changes here: https://github.com/andrewsi-z/murmurhash
This uses the murmurhash3 endian agnostic design introduced for (I believe) node use of it.
I followed this same design and applied it to murmurhash2.cpp hash64a, which caused errors. I will open up an issue against explosion/murmurhash to discuss whether you'd like these fixes or prefer to keep a separate murmurhash library for big endian (which also functionally works fine).

adrianeboyd · 2021-11-19T11:36:27Z

I think I'd reverse/massage the logic in get_ops in bit, but given the current state of thinc, the basics for this PR seem fine.

Adding AppleOps is simpler in a number of ways than BigEndianOps:

you can't install it on systems where it wouldn't run correctly
backing off to numpy accidentally is slower, not broken

So we need to be more careful around BigEndianOps. I also hesitate a bit to add this without any CI testing on our end, but I'm not sure that we have any feasible options.

adrianeboyd · 2021-11-19T11:42:03Z

thinc/backends/__init__.py

        cls = ops_by_name.get("apple", ops_by_name.get("numpy"))
+
+        if "bigendian" in ops_by_name:
+            cls = ops_by_name.get("bigendian", ops_by_name.get("numpy"))
+


Suggested change

cls = ops_by_name.get("apple", ops_by_name.get("numpy"))

if "bigendian" in ops_by_name:

cls = ops_by_name.get("bigendian", ops_by_name.get("numpy"))

cls = ops_by_name.get("numpy")

cls = ops_by_name.get("apple", cls)

cls = ops_by_name.get("bigendian", cls)

This is where a ranking from the ops classes themselves would be better.

Updated with suggested change and retested.

adrianeboyd · 2021-11-19T11:44:35Z

thinc/backends/__init__.py

+    # avoid fallback to base NumpyOps if on big endian platform
+    if current_ops.name == "bigendian" and name == "numpy":
+        name = current_ops.name 


I think we've removed all the use_ops("numpy") from our code, so I think I'd rather have at most a warning/error rather than having use_ops silently not do what you just told it to do.

I removed the highlighted code... At best, this was my attempt to keep an explicit use_ops("numpy") specification from reverting back to numpy ops when explicitly requested from outside of thinc. It is best not to ignore an explicit request.

andrewsi-z · 2021-11-19T13:07:45Z

Hi @adrianeboyd, thank you for your comments and I'll start reviewing them in more detail.

You brought up CI... I know github actions doesn't support s390x yet, but there are a number of supporting options that can potentially (?) be triggered by a github action. These would be community available no cost resources .

I can explore this as a follow-on action and present some options for discussion (and help get it going if you and the explosion team believe it is worthwhile to purse).

… abcomments

svlandeg · 2021-12-21T12:40:19Z

While we probably want to think about more generic solutions in the future, this should be fine to merge as is for now, and will become available in Thinc's next release 8.0.14.

dacdevsgt · 2024-11-20T21:46:04Z

Hi, first sorry, I know that we are 2024 but I have a IBM Power7 ppc64 and I want to use to learn about model languages, I installed Debian 12 because have a version to use with ppc64. I really new about model languages, I tried to install PyTorch but architecture of processor of Power7 doesn't work. Now install spacy was successfully but I got the error: ValueError: Little-endian buffer not supported on big-endian compiler. But reading this repo you can fixed the problem with endians, can somebody help me to how configure or use in a python file, I have this as a little example:

`import spacy

from thinc.api import use_ops

with use_ops("numpy", use_blis=False):
nlp = spacy.load("en_core_web_sm")

Input text

text = "SpaCy amazing NLP."

Process the text

doc = nlp(text)

Tokenization: Print each token in the text

print("Tokens:")
for token in doc:
print(token.text)

Named Entity Recognition (NER)

print("\nNamed Entities:")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")

Part-of-Speech Tagging

print("\nPart-of-Speech Tags:")
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")

Dependency Parsing

print("\nDependency Parsing:")
for token in doc:
print(f"{token.text} --> {token.dep_} (head: {token.head.text})")`

The idea to use and learn on this Power7 server is resources of this server.

Thanks, thanks!

andrewsi-z and others added 4 commits October 27, 2021 10:37

add bigendian op support

e026c24

Merge branch 'explosion:master' into master

1065f7c

add test to mirror appleops test

5cb9989

Merge branch 'explosion:master' into master

1019c17

andrewsi-z mentioned this pull request Nov 3, 2021

Serialization for trained models/pipelines are not portable to hosts with different byteorder. explosion/spaCy#9428

Closed

Merge branch 'explosion:master' into master

50f9ee8

andrewsi-z marked this pull request as ready for review November 12, 2021 19:25

andrewsi-z mentioned this pull request Nov 17, 2021

Murmurhash algorithm results vary based on host endianness explosion/murmurhash#26

Closed

adrianeboyd reviewed Nov 19, 2021

View reviewed changes

andrewsi-z and others added 7 commits November 19, 2021 13:56

removed explicit bigendian check in get_ops

769c649

update use_ops

912e512

remove use_ops check

af5c1f3

commit suggested changes

6dc3e9e

Merge branch 'abcomments' of https://github.com/andrewsi-z/thinc into…

6abce3e

… abcomments

Format

a3ee25b

Format

83a0273

svlandeg merged commit 7b54f72 into explosion:master Dec 21, 2021

svlandeg added feat / ops Backends and maths enhancement Feature requests and improvements labels Dec 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support big endian platform by providing new ops implementation #559

Support big endian platform by providing new ops implementation #559

andrewsi-z commented Nov 3, 2021

honnibal commented Nov 16, 2021

andrewsi-z commented Nov 16, 2021

adrianeboyd commented Nov 19, 2021

adrianeboyd Nov 19, 2021

adrianeboyd Nov 19, 2021

andrewsi-z Nov 19, 2021

adrianeboyd Nov 19, 2021

andrewsi-z Nov 19, 2021

andrewsi-z commented Nov 19, 2021 •

edited

Loading

svlandeg commented Dec 21, 2021 •

edited

Loading

dacdevsgt commented Nov 20, 2024

Support big endian platform by providing new ops implementation #559

Support big endian platform by providing new ops implementation #559

Conversation

andrewsi-z commented Nov 3, 2021

honnibal commented Nov 16, 2021

andrewsi-z commented Nov 16, 2021

adrianeboyd commented Nov 19, 2021

adrianeboyd Nov 19, 2021

Choose a reason for hiding this comment

adrianeboyd Nov 19, 2021

Choose a reason for hiding this comment

andrewsi-z Nov 19, 2021

Choose a reason for hiding this comment

adrianeboyd Nov 19, 2021

Choose a reason for hiding this comment

andrewsi-z Nov 19, 2021

Choose a reason for hiding this comment

andrewsi-z commented Nov 19, 2021 • edited Loading

svlandeg commented Dec 21, 2021 • edited Loading

dacdevsgt commented Nov 20, 2024

Input text

Process the text

Tokenization: Print each token in the text

Named Entity Recognition (NER)

Part-of-Speech Tagging

Dependency Parsing

andrewsi-z commented Nov 19, 2021 •

edited

Loading

svlandeg commented Dec 21, 2021 •

edited

Loading