Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for PERCENTILE_CONT semantics #380

Open
rwperrott opened this issue Jan 2, 2021 · 6 comments
Open

Add support for PERCENTILE_CONT semantics #380

rwperrott opened this issue Jan 2, 2021 · 6 comments

Comments

@rwperrott
Copy link

rwperrott commented Jan 2, 2021

Expected behavior and actual behavior:

Re: Median definition.
See code examples below.

The Agg.percentileBy method incorrectly assumes that index rounding is enough, but isn't when the the (size * percentile) still has an effectively non-zero factional part, and the floor indexed value and the next value are not equal.

I appreciate that not all Comparable types can be, or easily, added and divided, so I suggest that a modified percentile method, and median method, be added, accepting a type specific function, with parameters for the floor indexed value, the next indexed value and the fractional part of the index, to allow returning a more correct value. The function would only need to be called when the index has an effectively non-zero fractional part and the relevant indexed values are different. Any such new functionally should also exposed anywhere else the existing median and percentile* methods are also exposed e.g. Window.

Steps to reproduce the problem:

e.g. for median

Seq.of(1d,2d,3d,4d,5d,6d,7d).median().ifPresent(System.out::println);

prints 4.0, which is correct, because the exact middle value is 4.0,

Seq.of(1d,2d,3d,4d,5d,6d,7d,8d).median().ifPresent(System.out::println);

prints 4.0 which is apparently incorrect, because the middle values are 4 and 5, so should print (4 + 5) / 2 = 4.5

Versions:

  • jOOλ: 0.9.14
  • Java: 15
@rwperrott
Copy link
Author

rwperrott commented Jan 2, 2021

I'm trying out my suggested solution. I think that a function like this could be passed into a modified copy of Agg.percentileBy, to cover ArrayList<Tuple2<T,U>> value storage, and later ArrayList<T> optimised value storage for detected identity function:

import java.util.function.Function;

@FunctionalInterface
public interface PercentileFunction<T,U> {
    T apply(T t0, // value at index before percentile position
            T t1, // next indexed value
            double indexFraction,
            Function<? super T, ? extends U> function); // to avoid need for two functions, for Tuple2<T,U> and T values
}

@rwperrott
Copy link
Author

I note that the https://infogalactic.com/info/Percentile does mention that linear interpolation is an option, just-like https://infogalactic.com/info/Median, for the median value of sorted, even length, numeric values.

@rwperrott
Copy link
Author

rwperrott commented Jan 6, 2021

This seems to work, with the above function:

// imports
public class Agg2 {
    public static <T, U> Collector<T, ?, Optional<T>>
    percentileBy(double percentile,
                 Function<? super T, ? extends U> mapper,
                 Comparator<? super U> comparator,
                 PercentileFunction<T, U> percentileFunction) {
        if (percentile < 0.0 || percentile > 1.0)
            throw new IllegalArgumentException("Percentile must be between 0.0 and 1.0");

        return Collector.of(
                (Supplier<ArrayList<Tuple2<T, U>>>) ArrayList::new,
                (l, v) -> l.add(tuple(v, mapper.apply(v))),
                (l1, l2) -> {
                    l1.addAll(l2);
                    return l1;
                },
                l -> {
                    final int size = l.size();

                    if (size == 0)
                        return Optional.empty();
                    else if (size == 1)
                        return Optional.of(l.get(0).v1);

                    l.sort(Comparator.comparing(t -> t.v2, comparator)); // Compare U

                    if (percentile == 0d)
                        return Optional.of(l.get(0).v1);
                    if (percentile == 1d)
                        return Optional.of(l.get(size - 1).v1);

                    // Limit fraction size, to stop common errors for double percentile values e.g. 2E-16.
                    // 0.5 is added because actual percentile value can be between values.
                    final double dIndex = ((double) Math.round(size * percentile * 1.0E6d) * 1.0E-6d) - 0.5d;
                    int index = (int) dIndex; // floor, for before or exact index
                    if (index >= size)
                        return Optional.of(l.get(size - 1).v1);

                    final Tuple2<T, U> t0 = l.get(index); // before or exact value
                    final double indexFraction = dIndex - index;
                    // If end or exact index, return t0 value.
                    if (++index == size || indexFraction == 0d)
                        return Optional.of(t0.v1);

                    final Tuple2<T, U> t1 = l.get(index); // after value
                    // Only call percentile function if t*.v1 values are different.
                    return (t0.v1.equals(t1.v1))
                           ? Optional.of(t0.v1)
                           : Optional.of(percentileFunction.apply(t0.v1, t1.v1, indexFraction, mapper));
                });
}

Test code

class Test {
    public void main(String[] args) {
            final Double[] values = {0d, 10d, 20d, 30d, 40d, 50d, 60d, 70d, 80d, 90d};

            final PercentileFunction<Double, Double> floor = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v0;
            };

            final PercentileFunction<Double, Double> ceil = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v1;
            };

            final PercentileFunction<Double, Double> halfUp = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return f < 0.5d ? v0 : v1;
            };

            final PercentileFunction<Double, Double> interpolate = (t0, t1, f, m) -> {
                double v0 = m.apply(t0);
                double v1 = m.apply(t1);
                return v0 - (v0 * f) + (v1 * f); // Linear interpolation
            };

            final String header = "percentile -> |  Agg | floor | halfUp | interpolate | ceil";
            System.out.println(header);
            for (double p = 0d; p <= 1.00d; p += 0.05d) {
                System.out.printf("   %5.3f   -> | %4.1f |  %4.1f |   %4.1f |    %4.1f     | %4.1f%n",
                                  p,
                                 Seq.of(values).percentile(p).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), floor)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), halfUp)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), interpolate)).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), ceil)).orElseThrow());
            }
            System.out.println(header);
    }
}

Results:

percentile -> |  Agg | floor | halfUp | interpolate | ceil
   0.000   -> |  0.0 |   0.0 |    0.0 |     0.0     |  0.0
   0.050   -> |  0.0 |   0.0 |    0.0 |     0.0     |  0.0
   0.100   -> |  0.0 |   0.0 |   10.0 |     5.0     | 10.0
   0.150   -> | 10.0 |  10.0 |   10.0 |    10.0     | 10.0
   0.200   -> | 10.0 |  10.0 |   20.0 |    15.0     | 20.0
   0.250   -> | 20.0 |  20.0 |   20.0 |    20.0     | 20.0
   0.300   -> | 20.0 |  20.0 |   30.0 |    25.0     | 30.0
   0.350   -> | 30.0 |  30.0 |   30.0 |    30.0     | 30.0
   0.400   -> | 30.0 |  30.0 |   40.0 |    35.0     | 40.0
   0.450   -> | 40.0 |  40.0 |   40.0 |    40.0     | 40.0
   0.500   -> | 40.0 |  40.0 |   50.0 |    45.0     | 50.0
   0.550   -> | 50.0 |  50.0 |   50.0 |    50.0     | 50.0
   0.600   -> | 50.0 |  50.0 |   60.0 |    55.0     | 60.0
   0.650   -> | 60.0 |  60.0 |   60.0 |    60.0     | 60.0
   0.700   -> | 70.0 |  60.0 |   70.0 |    65.0     | 70.0
   0.750   -> | 70.0 |  70.0 |   70.0 |    70.0     | 70.0
   0.800   -> | 80.0 |  70.0 |   80.0 |    75.0     | 80.0
   0.850   -> | 80.0 |  80.0 |   80.0 |    80.0     | 80.0
   0.900   -> | 90.0 |  80.0 |   90.0 |    85.0     | 90.0
   0.950   -> | 90.0 |  90.0 |   90.0 |    90.0     | 90.0
percentile -> |  Agg | floor | halfUp | interpolate | ceil

@lukaseder
Copy link
Member

Thanks a lot for your analysis, @rwperrott. I'll look into this next week.

@lukaseder lukaseder changed the title Agg.median looks wrong for even size sequences and other percentile calculations maybe wrong too. Add support for PERCENTILE_CONT semantics Jan 12, 2021
@lukaseder
Copy link
Member

lukaseder commented Jan 12, 2021

The current implementations implement the equivalent of SQL's PERCENTILE_DISC inverse distribution function, not PERCENTILE_CONT. This is probably not the correct default for MEDIAN, indeed, but for the other ones, it's correct by design and Javadoc.

Of course, we should offer PERCENTILE_CONT as well, as that is probably used more often in real world applications.

@rwperrott
Copy link
Author

rwperrott commented Jan 28, 2021

The function parameter in PercentileFunction, thus U, now look redundant because function is unlikely to be useful for percentileBy operations, so can be simplified to this:

public interface PercentileFunction<T> {
    T apply(T t0,
            T t1,
            double indexFraction); // mapper proved to be redundant!

    static <T> PercentileFunction<T> floor() {
        return (t0, t1, f) -> t0;
    }

    static <T> PercentileFunction<T> ceil() {
        return (t0, t1, f) -> t1;
    }

    static <T> PercentileFunction<T> halfUp() {
        return (t0, t1, f) -> f < 0.5d ? t0 : t1;
    }

    static PercentileFunction<Double> interpolateDouble() {
        return (t0, t1, f) -> t0 - (t0 * f) + (t1 * f);
    }
}

It is pointless collecting Tuple2<T,U> just to get a U result, so a list collection only needs to collect U, and be wrapped in collector which maps T to U.

And example classes (not tested):

// imports
public class Agg2 {
    public static <T, U> Collector<T, ?, Optional<T>>
    percentileBy(double percentile,
                 Function<? super T, ? extends U> mapper,
                 Comparator<? super U> comparator,
                 PercentileFunction<T> percentileFunction) {
        if (percentile < 0.0 || percentile > 1.0)
            throw new IllegalArgumentException("Percentile must be between 0.0 and 1.0");

        return Collector.of(
                (Supplier<ArrayList<Tuple2<T, U>>>) ArrayList::new,
                (l, v) -> l.add(tuple(v, mapper.apply(v))),
                (l1, l2) -> {
                    l1.addAll(l2);
                    return l1;
                },
                l -> {
                    final int size = l.size();

                    if (size == 0)
                        return Optional.empty();
                    else if (size == 1)
                        return Optional.of(l.get(0).v1);

                    l.sort(Comparator.comparing(t -> t.v2, comparator)); // Compare U

                    if (percentile == 0d)
                        return Optional.of(l.get(0).v1);
                    if (percentile == 1d)
                        return Optional.of(l.get(size - 1).v1);

                    // Limit fraction size, to stop common errors for double percentile values e.g. 2E-16.
                    // 0.5 is added because actual percentile value can be between values.
                    final double dIndex = ((double) Math.round(size * percentile * 1.0E6d) * 1.0E-6d) - 0.5d;
                    int index = (int) dIndex; // floor, for before or exact index
                    if (index >= size)
                        return Optional.of(l.get(size - 1).v1);

                    final Tuple2<T, U> t0 = l.get(index); // before or exact value
                    final double indexFraction = dIndex - index;
                    // If end or exact index, return t0 value.
                    if (++index == size || indexFraction == 0d)
                        return Optional.of(t0.v1);

                    final Tuple2<T, U> t1 = l.get(index); // after value
                    // Only call percentile function if t*.v1 values are different.
                    return (t0.v1.equals(t1.v1))
                           ? Optional.of(t0.v1)
                           : Optional.of(percentileFunction.apply(t0.v1, t1.v1, indexFraction));
                });
}
import static PercentileFunction.*; // for static PercentileFunction provider methods.

class Test {
    public void main(String[] args) {
            final Double[] values = {0d, 10d, 20d, 30d, 40d, 50d, 60d, 70d, 80d, 90d};

            final String header = "percentile -> |  Agg | floor | halfUp | interpolate | ceil";
            System.out.println(header);
            for (double p = 0d; p <= 1.00d; p += 0.05d) {
                System.out.printf("   %5.3f   -> | %4.1f |  %4.1f |   %4.1f |    %4.1f     | %4.1f%n",
                                  p,
                                 Seq.of(values).percentile(p).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), floor())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), halfUp())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), interpolateDouble())).orElseThrow(),
                                 Seq.of(values).collect(Agg2.percentile(p, Comparator.naturalOrder(), ceil())).orElseThrow());
            }
            System.out.println(header);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants