Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new StableApiDefinition string coderange methods #443

Open
ianks opened this issue Nov 21, 2024 · 4 comments
Open

Implement new StableApiDefinition string coderange methods #443

ianks opened this issue Nov 21, 2024 · 4 comments

Comments

@ianks
Copy link
Collaborator

ianks commented Nov 21, 2024

In order to support Truffleruby on magnus, we need to support string coderange operations, which currently relies on cruby internals.

/// Represents a Ruby coderange which holds information about the contents of a string.
///
/// # Notes
/// This is a low-level API and should not be used directly unless you know what you're doing. The
/// type is non-exhaustive and may have additional variants added in the future.
#[repr(u32)]
#[derive(Debug, Copy, Clone, Hash, PartialEq, Eq)]
#[non_exhaustive]
pub enum CoderangeType {
    /// The coderange is unknown, and will need to be scanned in the future when needed.
    Unknown = 0,
    /// The string contains only 7-bit ASCII characters.
    SevenBit = 1,
    /// The string contains only valid characters for the current encoding of the string.
    Valid = 2,
    /// At one point, the string contained only 7-bit ASCII characters, but it has since been
    /// modified to contain valid characters of the current encoding. As such, for certain
    /// encodings we can know it's VALID, but whether or not it's ASCII-only is currently unknown.
    SevenBitOrValid = 3,
    /// The string contains broken characters for the current encoding (i.e. invalid UTF-8 bytes).
    Broken = 4,
}

pub trait StableApiDefinition {
    /// Gets the inline coderange of an encodable object (akin to `RB_ENC_CODERANGE`).
    ///
    /// # Safety
    /// This function is unsafe because it dereferences a raw pointer to get
    /// access to underlying flags of the [`RBasic`]. The caller must ensure that
    /// the `VALUE` is a valid pointer to an RString, or any other encodable object.
    unsafe fn coderange_inline_get(&self, obj: VALUE) -> CoderangeType;

    /// Set the inline coderange of an encodable object (akin to `RB_ENC_CODERANGE_SET`).
    ///
    /// **Notes**:
    /// - A UTF-8 string can have either a 7-bit ASCII coderange if the string
    ///   is ASCII only. Only use the `RUBY_ENC_CODERANGE_VALID` value if the
    ///   string has non-ASCII characters to avoid breaking assumptions made
    ///   by the Ruby interpreter.
    /// - This function does not check if the coderange is valid. It simply
    ///   sets the coderange to the given value. This caller must ensure that
    ///   the coderange is valid.
    /// - This function does not modify the Ruby string's encoding. It only
    ///   sets the coderange.
    ///
    /// # Safety
    /// This function is unsafe because it dereferences a raw pointer to get
    /// access to underlying flags of the [`RBasic`]. The caller must ensure that
    /// the `VALUE` is a valid pointer to an RString, or any other encodable object.
    unsafe fn coderange_inline_set(&self, obj: VALUE, coderange: CoderangeType);

    /// Clear the inline coderange of an encodable object (akin to `RB_ENC_CODERANGE_CLEAR`).
    ///
    /// # Safety
    /// This function is unsafe because it dereferences a raw pointer to get
    /// access to underlying flags of the [`RBasic`]. The caller must ensure that
    /// the `VALUE` is a valid pointer to an RString, or any other encodable object.
    unsafe fn coderange_inline_clear(&self, obj: VALUE);

    /// Mix  two code  ranges  into one.   This  is handy  for  instance when  you
    /// concatenate two  strings into one (akin to `RB_ENC_CODERANGE_AND`).
    ///
    /// # Safety
    /// This function is unsafe because it dereferences a raw pointer to get
    /// access to underlying flags of the [`RBasic`]. The caller must ensure that
    /// the `VALUE` is a valid pointer to an RString, or any other encodable object.
    unsafe fn enc_coderange_inline_and(&self, obj: VALUE, coderange: CoderangeType);
}
@ianks ianks added this to the Full Truffleruby Support milestone Nov 21, 2024
@ianks ianks changed the title Implement new StableApi string coderange methods Implement new StableApiDefinition string coderange methods Nov 21, 2024
@ianks ianks added help wanted Extra attention is needed good first issue Good for newcomers and removed help wanted Extra attention is needed labels Nov 21, 2024
@nirvdrum
Copy link

nirvdrum commented Dec 13, 2024

The SevenBitOrValid is a value that doesn't exist in any Ruby implementation that I'm aware of. It's unfortunate that RUBY_ENC_CODERANGE_7BIT and RUBY_ENC_CODERANGE_VALID aren't mutually exclusive. It's even more unfortunate that you can get a different result on some operations by having the wrong code range. I can see why having an extra state might be interesting, but I'm concerned it'll be misunderstood and lead to incorrect usage.

@nirvdrum
Copy link

nirvdrum commented Dec 13, 2024

While TruffleRuby uses a different string representation than CRuby, it does support all of the code ranges. Code ranges are a little weird in that they're not directly exposed and should be an implementation detail, but they affect the semantics of several methods that practically they need to be implemented. I expound on that in a comprehensive blog post for those interested.

You should get the same resolved code range in both TruffleRuby and CRuby. I say resolved because TruffleRuby will eagerly compute the code range in many cases. And, owing to a rope-based representation, there's no need to clear the code range in other cases.

TruffleRuby supports native code range functionality like the rb_enc_str_coderange function, RB_ENC_CODERANGE macro, and definitions for RUBY_ENC_CODERANGE_{UNKNOWN,7BIT,VALID,BROKEN}. Clearing and scanning for the code range may no-op in some cases, but you'll get the semantically correct code range once full resolved. The interstitial state before a full code range scan isn't very interesting and is mostly used in CRuby as a way to short-circuit certain operations.

@ianks
Copy link
Collaborator Author

ianks commented Dec 13, 2024

@nirvdrum do the proposed method definitions make sense at least (even if we may need to no-op some)? Open to improvements here as well

@nirvdrum
Copy link

nirvdrum commented Dec 13, 2024

The function signatures look pretty good, but I'm not sure what's meant by "inline". Is that a naming convention used elsewhere? In CRuby, the code range is stored in the header bits. On JRuby and TruffleRuby it's stored in a field in the Ruby string object.

I'll need to think about enc_coderange_inline_and a bit more. I like the general idea, but it's not always a simple operation. In the cases where it works, it can be advantageous, and we have that sprinkled in various places. E.g., String#b would only end up being CoderangeType::7Bit or CoderangeType::Valid and is cheap to derive based on the what your currently know. But, in other case you need the associated encoding. E.g., CoderangeType::Broken && CoderangeType::Broken could result in CoderangeType::Valid or it could result in CoderangeType::Broken, depending on the encoding, and would require a full code range scan of the resulting string to be sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants