Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid boxing in Framing #1247

Merged
merged 9 commits into from
May 6, 2024
Merged

Avoid boxing in Framing #1247

merged 9 commits into from
May 6, 2024

Conversation

JD557
Copy link
Contributor

@JD557 JD557 commented Apr 3, 2024

Adds a specialized indexOf(Byte, Int) to all ByteString implementations, similar to what is done on the ByteIterator, in order to avoid boxing/unboxing when searching for a byte.

This speeds up framing quite a bit.

Benchmark results (Java 11, Scala 2.13, Apple M3 Max):

Old
Benchmark                 (framePerSeq)  (messageSize)   Mode  Cnt         Score         Error  Units
FramingBenchmark.framing              1             32  thrpt    3  10113843.601 ±  347860.528  ops/s
FramingBenchmark.framing              1             64  thrpt    3   9649511.788 ±  275242.133  ops/s
FramingBenchmark.framing              1            128  thrpt    3   8121220.174 ± 1320460.364  ops/s
FramingBenchmark.framing              1            256  thrpt    3   5466330.854 ± 3260665.704  ops/s
FramingBenchmark.framing              1            512  thrpt    3   2859562.317 ±  489750.599  ops/s
FramingBenchmark.framing              1           1024  thrpt    3   1594647.037 ±   52601.259  ops/s
FramingBenchmark.framing              8             32  thrpt    3   1772199.560 ±   36471.572  ops/s
FramingBenchmark.framing              8             64  thrpt    3   1595775.427 ±   94933.072  ops/s
FramingBenchmark.framing              8            128  thrpt    3   1074911.259 ±  198661.935  ops/s
FramingBenchmark.framing              8            256  thrpt    3    715528.264 ±  117832.304  ops/s
FramingBenchmark.framing              8            512  thrpt    3    388890.655 ±   23628.881  ops/s
FramingBenchmark.framing              8           1024  thrpt    3    218333.423 ±    3502.994  ops/s
FramingBenchmark.framing             16             32  thrpt    3   1073203.353 ±   56959.738  ops/s
FramingBenchmark.framing             16             64  thrpt    3    835697.061 ±   61190.351  ops/s
FramingBenchmark.framing             16            128  thrpt    3    620562.797 ±    4529.555  ops/s
FramingBenchmark.framing             16            256  thrpt    3    380776.167 ±    3069.815  ops/s
FramingBenchmark.framing             16            512  thrpt    3    201483.021 ±   27772.140  ops/s
FramingBenchmark.framing             16           1024  thrpt    3    110883.293 ±    2468.349  ops/s
FramingBenchmark.framing             32             32  thrpt    3    589338.870 ±   48664.767  ops/s
FramingBenchmark.framing             32             64  thrpt    3    438839.627 ±   98301.907  ops/s
FramingBenchmark.framing             32            128  thrpt    3    318200.080 ±    3745.204  ops/s
FramingBenchmark.framing             32            256  thrpt    3    196484.642 ±    1439.287  ops/s
FramingBenchmark.framing             32            512  thrpt    3    102559.681 ±    1066.612  ops/s
FramingBenchmark.framing             32           1024  thrpt    3     55713.288 ±    1021.259  ops/s
FramingBenchmark.framing             64             32  thrpt    3    260941.068 ±   11029.238  ops/s
FramingBenchmark.framing             64             64  thrpt    3    213252.824 ±    2763.842  ops/s
FramingBenchmark.framing             64            128  thrpt    3    152106.933 ±    3544.177  ops/s
FramingBenchmark.framing             64            256  thrpt    3     97187.752 ±   16306.733  ops/s
FramingBenchmark.framing             64            512  thrpt    3     51460.170 ±     851.556  ops/s
FramingBenchmark.framing             64           1024  thrpt    3     27853.852 ±    1002.975  ops/s
FramingBenchmark.framing            128             32  thrpt    3    143485.697 ±    2294.380  ops/s
FramingBenchmark.framing            128             64  thrpt    3    108585.695 ±    1580.502  ops/s
FramingBenchmark.framing            128            128  thrpt    3     74709.074 ±    1104.879  ops/s
FramingBenchmark.framing            128            256  thrpt    3     49446.714 ±    3098.863  ops/s
FramingBenchmark.framing            128            512  thrpt    3     25852.970 ±    2643.999  ops/s
FramingBenchmark.framing            128           1024  thrpt    3     13979.600 ±     221.422  ops/s
New
Benchmark                 (framePerSeq)  (messageSize)   Mode  Cnt         Score        Error  Units
FramingBenchmark.framing              1             32  thrpt    3  12488386.857 ± 550524.197  ops/s
FramingBenchmark.framing              1             64  thrpt    3  10825562.059 ± 619829.640  ops/s
FramingBenchmark.framing              1            128  thrpt    3  10298423.557 ± 379279.669  ops/s
FramingBenchmark.framing              1            256  thrpt    3   7195103.283 ± 489354.505  ops/s
FramingBenchmark.framing              1            512  thrpt    3   5318261.366 ± 271701.067  ops/s
FramingBenchmark.framing              1           1024  thrpt    3   3380494.654 ± 293146.013  ops/s
FramingBenchmark.framing              8             32  thrpt    3   1876127.541 ± 228271.371  ops/s
FramingBenchmark.framing              8             64  thrpt    3   1777053.701 ± 146526.586  ops/s
FramingBenchmark.framing              8            128  thrpt    3   1430280.685 ±  97618.551  ops/s
FramingBenchmark.framing              8            256  thrpt    3   1130490.909 ± 134372.173  ops/s
FramingBenchmark.framing              8            512  thrpt    3    784673.367 ±  42187.315  ops/s
FramingBenchmark.framing              8           1024  thrpt    3    496979.878 ±  65898.234  ops/s
FramingBenchmark.framing             16             32  thrpt    3   1085230.634 ±  80359.239  ops/s
FramingBenchmark.framing             16             64  thrpt    3    972758.829 ±  47127.839  ops/s
FramingBenchmark.framing             16            128  thrpt    3    794532.378 ±  35834.092  ops/s
FramingBenchmark.framing             16            256  thrpt    3    601945.868 ± 120485.938  ops/s
FramingBenchmark.framing             16            512  thrpt    3    414966.392 ± 187654.169  ops/s
FramingBenchmark.framing             16           1024  thrpt    3    259000.528 ±  58471.624  ops/s
FramingBenchmark.framing             32             32  thrpt    3    593169.099 ±   4840.018  ops/s
FramingBenchmark.framing             32             64  thrpt    3    528762.400 ±  23527.622  ops/s
FramingBenchmark.framing             32            128  thrpt    3    429482.997 ±  13070.825  ops/s
FramingBenchmark.framing             32            256  thrpt    3    321971.065 ±  35385.581  ops/s
FramingBenchmark.framing             32            512  thrpt    3    214747.165 ±  34767.609  ops/s
FramingBenchmark.framing             32           1024  thrpt    3    130871.717 ±  30464.051  ops/s
FramingBenchmark.framing             64             32  thrpt    3    318087.922 ±   7819.818  ops/s
FramingBenchmark.framing             64             64  thrpt    3    259738.166 ±   9262.097  ops/s
FramingBenchmark.framing             64            128  thrpt    3    208251.362 ±   9641.613  ops/s
FramingBenchmark.framing             64            256  thrpt    3    154602.853 ±  17896.718  ops/s
FramingBenchmark.framing             64            512  thrpt    3    101678.754 ±  20105.737  ops/s
FramingBenchmark.framing             64           1024  thrpt    3     62230.188 ±  19234.156  ops/s
FramingBenchmark.framing            128             32  thrpt    3    150634.966 ±  15251.295  ops/s
FramingBenchmark.framing            128             64  thrpt    3    127661.514 ±   2498.963  ops/s
FramingBenchmark.framing            128            128  thrpt    3    105351.600 ±   6728.938  ops/s
FramingBenchmark.framing            128            256  thrpt    3     80119.663 ±  11005.101  ops/s
FramingBenchmark.framing            128            512  thrpt    3     52383.251 ±  11319.119  ops/s
FramingBenchmark.framing            128           1024  thrpt    3     28313.662 ±  11664.417  ops/s
Both (Interleaved)
Benchmark                     (framePerSeq)  (messageSize)   Mode  Cnt         Score        Error  Units
FramingBenchmark.framing_old              1             32  thrpt    3  10113843.601 ±  347860.528  ops/s
FramingBenchmark.framing_new              1             32  thrpt    3  12488386.857 ± 550524.197  ops/s
FramingBenchmark.framing_old              1             64  thrpt    3   9649511.788 ±  275242.133  ops/s
FramingBenchmark.framing_new              1             64  thrpt    3  10825562.059 ± 619829.640  ops/s
FramingBenchmark.framing_old              1            128  thrpt    3   8121220.174 ± 1320460.364  ops/s
FramingBenchmark.framing_new              1            128  thrpt    3  10298423.557 ± 379279.669  ops/s
FramingBenchmark.framing_old              1            256  thrpt    3   5466330.854 ± 3260665.704  ops/s
FramingBenchmark.framing_new              1            256  thrpt    3   7195103.283 ± 489354.505  ops/s
FramingBenchmark.framing_old              1            512  thrpt    3   2859562.317 ±  489750.599  ops/s
FramingBenchmark.framing_new              1            512  thrpt    3   5318261.366 ± 271701.067  ops/s
FramingBenchmark.framing_old              1           1024  thrpt    3   1594647.037 ±   52601.259  ops/s
FramingBenchmark.framing_new              1           1024  thrpt    3   3380494.654 ± 293146.013  ops/s
FramingBenchmark.framing_old              8             32  thrpt    3   1772199.560 ±   36471.572  ops/s
FramingBenchmark.framing_new              8             32  thrpt    3   1876127.541 ± 228271.371  ops/s
FramingBenchmark.framing_old              8             64  thrpt    3   1595775.427 ±   94933.072  ops/s
FramingBenchmark.framing_new              8             64  thrpt    3   1777053.701 ± 146526.586  ops/s
FramingBenchmark.framing_old              8            128  thrpt    3   1074911.259 ±  198661.935  ops/s
FramingBenchmark.framing_new              8            128  thrpt    3   1430280.685 ±  97618.551  ops/s
FramingBenchmark.framing_old              8            256  thrpt    3    715528.264 ±  117832.304  ops/s
FramingBenchmark.framing_new              8            256  thrpt    3   1130490.909 ± 134372.173  ops/s
FramingBenchmark.framing_old              8            512  thrpt    3    388890.655 ±   23628.881  ops/s
FramingBenchmark.framing_new              8            512  thrpt    3    784673.367 ±  42187.315  ops/s
FramingBenchmark.framing_old              8           1024  thrpt    3    218333.423 ±    3502.994  ops/s
FramingBenchmark.framing_new              8           1024  thrpt    3    496979.878 ±  65898.234  ops/s
FramingBenchmark.framing_old             16             32  thrpt    3   1073203.353 ±   56959.738  ops/s
FramingBenchmark.framing_new             16             32  thrpt    3   1085230.634 ±  80359.239  ops/s
FramingBenchmark.framing_old             16             64  thrpt    3    835697.061 ±   61190.351  ops/s
FramingBenchmark.framing_new             16             64  thrpt    3    972758.829 ±  47127.839  ops/s
FramingBenchmark.framing_old             16            128  thrpt    3    620562.797 ±    4529.555  ops/s
FramingBenchmark.framing_new             16            128  thrpt    3    794532.378 ±  35834.092  ops/s
FramingBenchmark.framing_old             16            256  thrpt    3    380776.167 ±    3069.815  ops/s
FramingBenchmark.framing_new             16            256  thrpt    3    601945.868 ± 120485.938  ops/s
FramingBenchmark.framing_old             16            512  thrpt    3    201483.021 ±   27772.140  ops/s
FramingBenchmark.framing_new             16            512  thrpt    3    414966.392 ± 187654.169  ops/s
FramingBenchmark.framing_old             16           1024  thrpt    3    110883.293 ±    2468.349  ops/s
FramingBenchmark.framing_new             16           1024  thrpt    3    259000.528 ±  58471.624  ops/s
FramingBenchmark.framing_old             32             32  thrpt    3    589338.870 ±   48664.767  ops/s
FramingBenchmark.framing_new             32             32  thrpt    3    593169.099 ±   4840.018  ops/s
FramingBenchmark.framing_old             32             64  thrpt    3    438839.627 ±   98301.907  ops/s
FramingBenchmark.framing_new             32             64  thrpt    3    528762.400 ±  23527.622  ops/s
FramingBenchmark.framing_old             32            128  thrpt    3    318200.080 ±    3745.204  ops/s
FramingBenchmark.framing_new             32            128  thrpt    3    429482.997 ±  13070.825  ops/s
FramingBenchmark.framing_old             32            256  thrpt    3    196484.642 ±    1439.287  ops/s
FramingBenchmark.framing_new             32            256  thrpt    3    321971.065 ±  35385.581  ops/s
FramingBenchmark.framing_old             32            512  thrpt    3    102559.681 ±    1066.612  ops/s
FramingBenchmark.framing_new             32            512  thrpt    3    214747.165 ±  34767.609  ops/s
FramingBenchmark.framing_old             32           1024  thrpt    3     55713.288 ±    1021.259  ops/s
FramingBenchmark.framing_new             32           1024  thrpt    3    130871.717 ±  30464.051  ops/s
FramingBenchmark.framing_old             64             32  thrpt    3    260941.068 ±   11029.238  ops/s
FramingBenchmark.framing_new             64             32  thrpt    3    318087.922 ±   7819.818  ops/s
FramingBenchmark.framing_old             64             64  thrpt    3    213252.824 ±    2763.842  ops/s
FramingBenchmark.framing_new             64             64  thrpt    3    259738.166 ±   9262.097  ops/s
FramingBenchmark.framing_old             64            128  thrpt    3    152106.933 ±    3544.177  ops/s
FramingBenchmark.framing_new             64            128  thrpt    3    208251.362 ±   9641.613  ops/s
FramingBenchmark.framing_old             64            256  thrpt    3     97187.752 ±   16306.733  ops/s
FramingBenchmark.framing_new             64            256  thrpt    3    154602.853 ±  17896.718  ops/s
FramingBenchmark.framing_old             64            512  thrpt    3     51460.170 ±     851.556  ops/s
FramingBenchmark.framing_new             64            512  thrpt    3    101678.754 ±  20105.737  ops/s
FramingBenchmark.framing_old             64           1024  thrpt    3     27853.852 ±    1002.975  ops/s
FramingBenchmark.framing_new             64           1024  thrpt    3     62230.188 ±  19234.156  ops/s
FramingBenchmark.framing_old            128             32  thrpt    3    143485.697 ±    2294.380  ops/s
FramingBenchmark.framing_new            128             32  thrpt    3    150634.966 ±  15251.295  ops/s
FramingBenchmark.framing_old            128             64  thrpt    3    108585.695 ±    1580.502  ops/s
FramingBenchmark.framing_new            128             64  thrpt    3    127661.514 ±   2498.963  ops/s
FramingBenchmark.framing_old            128            128  thrpt    3     74709.074 ±    1104.879  ops/s
FramingBenchmark.framing_new            128            128  thrpt    3    105351.600 ±   6728.938  ops/s
FramingBenchmark.framing_old            128            256  thrpt    3     49446.714 ±    3098.863  ops/s
FramingBenchmark.framing_new            128            256  thrpt    3     80119.663 ±  11005.101  ops/s
FramingBenchmark.framing_old            128            512  thrpt    3     25852.970 ±    2643.999  ops/s
FramingBenchmark.framing_new            128            512  thrpt    3     52383.251 ±  11319.119  ops/s
FramingBenchmark.framing_old            128           1024  thrpt    3     13979.600 ±     221.422  ops/s
FramingBenchmark.framing_new            128           1024  thrpt    3     28313.662 ±  11664.417  ops/s

@He-Pin
Copy link
Member

He-Pin commented Apr 4, 2024

The byte searching can do with SIMD too, but I lack of time to do this:(

@He-Pin He-Pin added the performance Related to performance label Apr 6, 2024
@He-Pin He-Pin added this to the 1.1.0-M1 milestone Apr 6, 2024
Copy link
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@He-Pin He-Pin requested a review from Roiocam April 12, 2024 06:24
@He-Pin
Copy link
Member

He-Pin commented Apr 12, 2024

@Roiocam @jxnu-liguobin @jrudolph Would you like to get a look into this?

@He-Pin He-Pin added the late-release-note late breaking changes that will require release notes changes label Apr 15, 2024
else {
var found = -1
var i = math.max(from, 0)
while (i < length && found == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we swap with found == -1 && i < length) which can reduce a bit when we just found:)

if (bytes(startIndex + i) == elem) found = i
i += 1
}
found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we extract the startIndex + 1 to the start of the loop and returning a found - startIndex

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

I think we can delegate the indexOf[B >: Byte](elem: B, from: Int): Int to indexOf(elem: Byte, from: Int): Int with Byte.unbox, wdyt, the reduce the duplicated code. @JD557 , or we can just merge this and do that later.

I have checked the bytecode, will do this after work.

@JD557
Copy link
Contributor Author

JD557 commented Apr 15, 2024

Unfortunately, I don't think that would work, as the signature requires [B >: Byte]. B could be Any and in that case we need the slow implementation that falls back to Object.equals 😕

Code example:

case class MyByteSeq(data: List[Byte]) {
  def fastIndexOf(byte: Byte): Int = data.indexOf(byte)
  //def indexOf1[B >: Byte](x: B): Int = fastIndexOf(Byte.unbox(x)) // This won't compile
  def indexOf2[B >: Byte](x: B): Int = fastIndexOf(Byte.unbox(x.asInstanceOf[Object])) // This can fail at runtime
  def indexOf3[B >: Byte](x: B): Int = x match {
    case b: Byte => fastIndexOf(b) // I think the pattern match already unboxes it
    case _ => ??? // I can't call fastIndexOf here, so I need a duplicated version of the code anyway
  }
}

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

THANKS, But I checked the bytecode, it was Java.lang.Objrct, not sure why that would fail at Runtime.

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

@JD557 do you have any further improvement on this pr ?I would like to merge this and improvement can come up later.

@JD557
Copy link
Contributor Author

JD557 commented Apr 15, 2024

@JD557 do you have any further improvement on this pr ?I would like to merge this and improvement can come up later.

Not really, feel free to merge this and improve later

Copy link
Contributor

@jrudolph jrudolph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still feels like a footgun, since it is quite unclear in which situations the new overload will be selected? Should we document in which case the static overload resolution will choose the new method?

Is it somehow possible to avoid the massive code duplication here (probably hard since the whole existing scheme is built around the function of Any.==)? If not it should be documented.

Ultimately, the main problem of ByteString has always been the attempt to make it fit seamlessly into the rest of the Scala collections by making it part of the collections type hierarchy (instead of making it fast and useful in the first place and then consider useful conversions/views to the Scala collection types).

@JD557
Copy link
Contributor Author

JD557 commented Apr 15, 2024

That still feels like a footgun, since it is quite unclear in which situations the new overload will be selected? Should we document in which case the static overload resolution will choose the new method?

I don't have any strong feelings about the naming. I used this scheme because it's the same that's used by ByteIterator (See:

def indexOf(elem: Byte): Int = indexOf(elem, 0)
def indexOf(elem: Byte, from: Int): Int = indexWhere(_ == elem, from)
override def indexOf[B >: Byte](elem: B): Int = indexOf(elem, 0)
override def indexOf[B >: Byte](elem: B, from: Int): Int = indexWhere(_ == elem, from)
).

If we change the method here (e.g. to indexOfByte, like in 18c5db1), then I think it should also be changed in ByteIterator.

However, on that note:

Is it somehow possible to avoid the massive code duplication here (probably hard since the whole existing scheme is built around the function of Any.==)? If not it should be documented.

Looks like ByteIterator avoids the duplication by delegating the comparison to indexWhere. Not sure if the lambda could have a performance impact, though. I would need to benchmark this again.

(Actually, I think the equality on that ByteIterator#indexOf call to indexWhere is backwards... the stdlib does the right thing with elem == _ instead of _ == elem)

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

@jrudolph True, but as I checked the bytecode, the current will be compiled to indexOf(java.lang.Object, int) and then a Byte.boxed is been used in the method body, I think that's @JD557 is addressing.

with specialized version, bytecode ifcmp is been used instead

@JD557
Copy link
Contributor Author

JD557 commented Apr 15, 2024

So, I was doing some more tests with a specialized version of indexWhere, but that's noticeably slower (even though it's faster than ByteIterator#indexWhere). So I don't see many ways to reduce the duplication.

As such, I'm not sure how to proceed with this PR:

  • Should I introduce a specialized indexWhere anyway, leading to code multiplication?
  • Do I keep the indexOf name (like the ByteIterator does) or do I name it something else?

Copy link
Member

@He-Pin He-Pin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

@JD557 I think if you want to change the method name, you can using the name of firstIndexOf.

And for the api, because it's using the scala collection's indexOf,so the input B where is actually a Byte, but now can be a any.

As for selection, I think the compiler will choose this new method when it know the elem for testing is a type Byte but will not when It doesn't.

As for the old index Of, which will always return -1 us the elem is not a Number/can boxed to Byte, so a quick check and delegate to the specified one will not harm too much. Who will using a ByteString to indexOf a any?

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

In BoxedRuntime.java

    public static byte unboxToByte(Object b) {
        return b == null ? 0 : ((java.lang.Byte)b).byteValue();
    }

    public static java.lang.Byte boxToByte(byte b) {
        return java.lang.Byte.valueOf(b);
    }

In Byte.java

    public boolean equals(Object obj) {
        if (obj instanceof Byte) {
            return value == ((Byte)obj).byteValue();
        }
        return false;
    }

Update: BoxedRuntime.toByte will not works too.

I think there are some implicitly conversion, because elem == (bytes(startIndex)) returns true and elem.equals(bytes(startIndex)) returns false 😢

@JD557
Copy link
Contributor Author

JD557 commented Apr 15, 2024

I don't think that BoxesRunTime trick is enough, unfortunately.

Say one asks ByteString.indexOf(500). This should obviously return -1, as it is impossible for a byte to have that value.
However BoxesRunTime.toByte(500) returns -12: Byte.

Ideally we would check if the number is between Byte.MinValue and Byte.MaxValue before boxing, but even that is not enough, due to floating point values (BoxesRunTime.toByte(127.9) == 127.toByte == 127.0).

(There's also the annoying possibility that someone creates their class RichByte(b: Byte) {override def equals(that: Any): b.equals(that) || ...})

Although I imagine most problems would come from Int and Char.

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

Yes, @JD557 , I just checked again, it's using the BoxesRunTime.equals for testing, otherwise we need something like indexOfWhere and extract the comparing thing with a lambda.

    BALOAD
    INVOKESTATIC scala/runtime/BoxesRunTime.boxToByte (B)Ljava/lang/Byte;
    ALOAD 1
    INVOKESTATIC scala/runtime/BoxesRunTime.equals (Ljava/lang/Object;Ljava/lang/Object;)Z
    IFEQ L6

@He-Pin
Copy link
Member

He-Pin commented Apr 15, 2024

@som-snytt Friendly ping , do you know any great way to handle this ,thanks.

@jrudolph
Copy link
Contributor

As for selection, I think the compiler will choose this new method when it know the elem for testing is a type Byte but will not when It doesn't.

Probably, yes. If the general overload would not exist, the new one would also work for literals like 12 or 'a'. So, maybe it's really the best we can do right now? With this solution it will at least choose the faster version whenever the user is explicitly looking for a byte.

@He-Pin
Copy link
Member

He-Pin commented Apr 20, 2024

@JD557 need another update to make mima happy

@JD557
Copy link
Contributor Author

JD557 commented Apr 20, 2024

I can try to fix it, but out of curiosity, isn't it OK to have MiMa issues, since this only targets 1.1.x?

I was ignoring the issue because I thought 1.1.0 was going to break bincompat with 1.0.x anyway

@mdedetrich
Copy link
Contributor

mdedetrich commented Apr 20, 2024

I can try to fix it, but out of curiosity, isn't it OK to have MiMa issues, since this only targets 1.1.x?

Pekko core follows SemVer, so the only acceptable MiMa issues are for internal code (i.e. @InternalApi) or with false positives

I was ignoring the issue because I thought 1.1.0 was going to break bincompat with 1.0.x anyway

No it doesn't, that would be for Pekko 2.x.x.

@@ -823,7 +874,33 @@ sealed abstract class ByteString
override def indexWhere(p: Byte => Boolean, from: Int): Int = iterator.indexWhere(p, from)

// optimized in subclasses
override def indexOf[B >: Byte](elem: B, from: Int): Int = indexOf(elem, from)
override def indexOf[B >: Byte](elem: B, from: Int): Int = super.indexOf(elem, from)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware that this is override is a bit weird, but:

  • I think the old code was an infinite loop
  • Removing this override breaks the MiMa checks

This should never be called anyway, but I think having it like this is a bit safer.

@JD557
Copy link
Contributor Author

JD557 commented Apr 22, 2024

OK, I think the MiMA issues should be fixed now.

Copy link
Contributor

@pjfanning pjfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@pjfanning
Copy link
Contributor

@jrudolph do you still think this PR change needs more work?

@He-Pin
Copy link
Member

He-Pin commented Apr 24, 2024

@sirthias Would you like to give some input about this too?

@sirthias
Copy link
Contributor

The only comment I have is that there would be a potentially much more efficient implementation of indexOf(elem: Byte, from: Int) if we somehow had low-level (unsafe) access to the byte array and could read 8 bytes at once into a long.
Then we could use SWAR (SIMD within a register) and reduce the loop count by 8 (asymptotically) over the current implementation.
Doesn't pekko already have an Unsafe access construct somewhere?
IIRC Akka did...

@He-Pin
Copy link
Member

He-Pin commented Apr 25, 2024

@sirthias Yes, I opened up an issue for SIMD in #1264 , would be nice to have that after this been merged, I think https://github.com/sirthias/borer must have already done this.

@sirthias
Copy link
Contributor

Yes, there is a (much more complicated) SWAR loop implemented in borer's JSON parser.
Here we only have to look for a single known byte rather than a whole set of different characters and we also don't have to copy segments and do UTF8 decoding at the same time.

But the whole thing only makes sense if we can really get down to raw byte access via Unsafe or more modern means. And on ScalaJS a SWAR approach will just create overhead and be a lot slower than the simple loop.

@pjfanning
Copy link
Contributor

@He-Pin @mdedetrich @Roiocam @samueleresca @raboof @jxnu-liguobin should we get this sorted out before 1.1.0-M1 RC or should we look at this again after the 1.1.0-M1 release?

@He-Pin
Copy link
Member

He-Pin commented May 2, 2024

Sorry for now response, holidays here. I think we can including this in m1 and try with SWAR after that.

@pjfanning pjfanning dismissed jrudolph’s stale review May 3, 2024 11:08

changes have been made but I will leave the PR open a few days just in case

@pjfanning pjfanning merged commit cce5f9b into apache:main May 6, 2024
17 of 18 checks passed
@pjfanning
Copy link
Contributor

Merged - thanks @JD557

@pjfanning pjfanning removed the late-release-note late breaking changes that will require release notes changes label May 6, 2024
@JD557 JD557 mentioned this pull request Nov 22, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Related to performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants