Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support big endian platform by providing new ops implementation #559

Merged
merged 12 commits into from
Dec 21, 2021
Merged

Support big endian platform by providing new ops implementation #559

merged 12 commits into from
Dec 21, 2021

Conversation

andrewsi-z
Copy link
Contributor

Based on discussion in spaCy issue 9428, there are a set of modules related to the load and use of pre-trained language pipelines and/or models on a platform with a different byteorder. This leads to incorrect results produced by the use of these pretrained models on platforms such as s390x.

As discussed in the spaCy issue referenced above, this PR adds support for thinc-bigendian-ops , which is created and located here: https://github.com/andrewsi-z/thinc-bigendian-ops .

thinc-bigendian-ops follows the precedent and use of thinc-apple-ops. When imported, it provides a custom ops class that implements the following:

  • numpy_ops.pyx hash has logic that maps a unsigned char array to the input ids array (uint64). The approach used maps the underlying storage and "views" as char array, and this implementation will result in different values depending on system byteorder. BigEndianOps modifies this algorithm and provides an alternative based on bit-shift operations, which will return the same value regardless of system byte order.
  • A new asarray method is implemented, which checks the byteorder of the output before returning it to caller. In cases where the byteorder is littleendian, it is swapped.

The logic in thinc proper is meant to mirror the logic added for thinc_apple_ops in a minimally intrusive way.

Notes:

  • ran test suite with thinc-bigendian-ops on s390x (big endian)
  • ran spaCy _sm model pipelines to validate original use cae.

@andrewsi-z andrewsi-z marked this pull request as ready for review November 12, 2021 19:25
@honnibal
Copy link
Member

I think this looks like a good match for how we've been doing things. @adrianeboyd do you have any comments? It looks okay to me. If it all works, I'm pleased to resolve this long-standing platform support issue!

I do think the mechanism we've designed for the ops selection could be better though. It's quite unsightly the way we have to reference the subclasses in Thinc, and it leads to the circular import problems.

Here's a suggestion we could try to improve this:

  • Ops classes implement a classmethod get_hardware_suitability(name, **kwargs) that returns an integer. A negative integer indicates incompatibility. A higher integer indicates that the ops class claims it should be used. kwargs can be passed in from above to influence the match.
  • In get_ops, we sort the registered ops by their hardware suitability score, and select the highest available. If no compatible ops are found, we raise an error.

This will prevent Thinc from having to know all of the ops classes in order to resolve which one should be used. Generally ops classes should return -1 for incompatible, 0 for compatible but not optimised, and 1 for optimised. If another plugin wants to overrule existing ones in specific circumstances it can return a higher match score under those conditions (including by supporting arbitrary keyword arguments).

I think we could merge the current PR as-is, as it doesn't seem to make things worse (that I can see). But I think the patch does indicate we could improve our system here.

@andrewsi-z
Copy link
Contributor Author

For reference on associated issues, I should mention I have the needed murmurhash changes here: https://github.com/andrewsi-z/murmurhash
This uses the murmurhash3 endian agnostic design introduced for (I believe) node use of it.
I followed this same design and applied it to murmurhash2.cpp hash64a, which caused errors. I will open up an issue against explosion/murmurhash to discuss whether you'd like these fixes or prefer to keep a separate murmurhash library for big endian (which also functionally works fine).

@adrianeboyd
Copy link
Contributor

I think I'd reverse/massage the logic in get_ops in bit, but given the current state of thinc, the basics for this PR seem fine.

Adding AppleOps is simpler in a number of ways than BigEndianOps:

  • you can't install it on systems where it wouldn't run correctly
  • backing off to numpy accidentally is slower, not broken

So we need to be more careful around BigEndianOps. I also hesitate a bit to add this without any CI testing on our end, but I'm not sure that we have any feasible options.

Comment on lines 99 to 103
cls = ops_by_name.get("apple", ops_by_name.get("numpy"))

if "bigendian" in ops_by_name:
cls = ops_by_name.get("bigendian", ops_by_name.get("numpy"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cls = ops_by_name.get("apple", ops_by_name.get("numpy"))
if "bigendian" in ops_by_name:
cls = ops_by_name.get("bigendian", ops_by_name.get("numpy"))
cls = ops_by_name.get("numpy")
cls = ops_by_name.get("apple", cls)
cls = ops_by_name.get("bigendian", cls)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where a ranking from the ops classes themselves would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with suggested change and retested.

Comment on lines 125 to 127
# avoid fallback to base NumpyOps if on big endian platform
if current_ops.name == "bigendian" and name == "numpy":
name = current_ops.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've removed all the use_ops("numpy") from our code, so I think I'd rather have at most a warning/error rather than having use_ops silently not do what you just told it to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the highlighted code... At best, this was my attempt to keep an explicit use_ops("numpy") specification from reverting back to numpy ops when explicitly requested from outside of thinc. It is best not to ignore an explicit request.

@andrewsi-z
Copy link
Contributor Author

andrewsi-z commented Nov 19, 2021

Hi @adrianeboyd, thank you for your comments and I'll start reviewing them in more detail.

You brought up CI... I know github actions doesn't support s390x yet, but there are a number of supporting options that can potentially (?) be triggered by a github action. These would be community available no cost resources .

I can explore this as a follow-on action and present some options for discussion (and help get it going if you and the explosion team believe it is worthwhile to purse).

@svlandeg
Copy link
Member

svlandeg commented Dec 21, 2021

While we probably want to think about more generic solutions in the future, this should be fine to merge as is for now, and will become available in Thinc's next release 8.0.14.

@svlandeg svlandeg merged commit 7b54f72 into explosion:master Dec 21, 2021
@svlandeg svlandeg added feat / ops Backends and maths enhancement Feature requests and improvements labels Dec 21, 2021
@dacdevsgt
Copy link

Hi, first sorry, I know that we are 2024 but I have a IBM Power7 ppc64 and I want to use to learn about model languages, I installed Debian 12 because have a version to use with ppc64. I really new about model languages, I tried to install PyTorch but architecture of processor of Power7 doesn't work. Now install spacy was successfully but I got the error: ValueError: Little-endian buffer not supported on big-endian compiler. But reading this repo you can fixed the problem with endians, can somebody help me to how configure or use in a python file, I have this as a little example:

`import spacy

from thinc.api import use_ops

with use_ops("numpy", use_blis=False):
nlp = spacy.load("en_core_web_sm")

Input text

text = "SpaCy amazing NLP."

Process the text

doc = nlp(text)

Tokenization: Print each token in the text

print("Tokens:")
for token in doc:
print(token.text)

Named Entity Recognition (NER)

print("\nNamed Entities:")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")

Part-of-Speech Tagging

print("\nPart-of-Speech Tags:")
for token in doc:
print(f"{token.text}: {token.pos_} ({token.tag_})")

Dependency Parsing

print("\nDependency Parsing:")
for token in doc:
print(f"{token.text} --> {token.dep_} (head: {token.head.text})")`

The idea to use and learn on this Power7 server is resources of this server.

Thanks, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / ops Backends and maths
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants