Colt Statistics

MDArray implements a set of descriptive statistics methods for Ruby, making it comparable to many other statistics library found in Ruby. Those methods are imported from Parallel Colt and should not suffer much impact in performance. As such, those methods should be quite efficient.

Statistics functions:
- auto_correlation: Returns the auto-correlation of a data sequence;
- correlation: Returns the correlation of two data sequences. That is covariance(data1,data2)/(standardDev1*standardDev2);
- covariance: Returns the covariance of two data sequences, which is cov(x,y) = (1/(size()-1)) * Sum((x[i]-mean(x)) * (y[i]-mean(y))) . See the math definition;
- durbin_whatson: Durbin-Watson computation;
- frequencies: Computes the frequency (number of occurances, count) of each distinct value in the given sorted data. After this call returns both distinctValues and frequencies have a new size (which is equal for both), which is the number of distinct values in the sorted data. Distinct values are filled into distinctValues, starting at index 0. The frequency of each distinct value is filled into frequencies, starting at index 0. As a result, the smallest distinct value (and its frequency) can be found at index 0, the second smallest distinct value (and its frequency) at index 1, ..., the largest distinct value (and its frequency) at index distinctValues.size()-1; elements = (5,6,6,7,8,8) --> distinctValues = (5,6,7,8), frequencies = (1,2,1,2)
- geometric_mean: Returns the geometric mean of a data sequence. Note that for a geometric mean to be meaningful, the minimum of the data sequence must not be less or equal to zero. The geometric mean is given by pow( Product( data[i] ), 1/size) which is equivalent to Math.exp( Sum( Log(data[i]) ) / size);
- harmonic_mean: Returns the harmonic mean of a data sequence.
- kurtosis: Returns the kurtosis (aka excess) of a data sequence.
- lag-1: Returns the lag-1 autocorrelation of a dataset; Note that this method has semantics different from autoCorrelation(..., 1);
- max: Returns the largest member of a data sequence.
- mean: Returns the arithmetic mean of a data sequence; That is Sum( data[i] ) / data.size().
- mean_deviation: Returns the mean deviation MDArray. That is Sum (Math.abs(data[i]-mean)) / data.size()).
- median: Returns the median of the MDArray;
- min: Returns the smallest member of the MDArray;
- moment: Returns the moment of k-th order with constant c of MDArray, which is Sum( (data[i]-c)k ) / data.size().
- pooled_mean: Returns the pooled mean of two MDArrays. That is (size1 * mean1 + size2 * mean2) / (size1 + size2).
- pooled_variance: Returns the pooled variance of two MDArrays. That is (size1 * variance1 + size2 * variance2) / (size1 + size2);
- product: Returns the product, which is Prod( data[i] ). In other words: data[0]data[1]...*data[data.size()-1]. This method uses the equivalent definition: prod = pow( exp( Sum( Log(x[i]) ) / size(), size()).
- quantile: Returns the phi-quantile; that is, an element elem for which holds that phi percent of data elements are less than elem. The quantile need not necessarily be contained in the MDArray, it can be a linear interpolation.
- quantile_inverse: Returns how many percent of the elements contained in the receiver are <= element. Does linear interpolation if the element is not contained but lies in between two contained elements.
- quantiles: Returns the quantiles of the specified percentages. The quantiles need not necessarily be contained in the MDArray, it can be a linear interpolation.
- rank_interpolated: Returns the linearly interpolated number of elements in a list less or equal to a given element. The rank is the number of elements <= element. Ranks are of the form {0, 1, 2,..., sortedList.size()}. If no element is <= element, then the rank is zero. If the element lies in between two contained elements, then linear interpolation is used and a non integer value is returned.
- rms: Returns the RMS (Root-Mean-Square) of a data sequence. That is Math.sqrt(Sum( data[i]*data[i] ) / data.size()). The RMS of data sequence is the square-root of the mean of the squares of the elements in the MDArray. It is a measure of the average "size" of the elements of MDArray.
- sample_kurtosis: Returns the sample kurtosis (aka excess) of MDArray. Ref: R.R. Sokal, F.J. Rohlf, Biometry: the principles and practice of statistics in biological research (W.H. Freeman and Company, New York, 1998, 3rd edition) p. 114-115.
- sample_kurtosis_standard_error: Return the standard error of the sample kurtosis. Ref: R.R. Sokal, F.J. Rohlf, Biometry: the principles and practice of statistics in biological research (W.H. Freeman and Company, New York, 1998, 3rd edition) p. 138.
- sample_skew: Returns the sample skew of MDArray. Ref: R.R. Sokal, F.J. Rohlf, Biometry: the principles and practice of statistics in biological research (W.H. Freeman and Company, New York, 1998, 3rd edition) p. 114-115.
- sample_skew_standard_error: Return the standard error of the sample skew. Ref: R.R. Sokal, F.J. Rohlf, Biometry: the principles and practice of statistics in biological research (W.H. Freeman and Company, New York, 1998, 3rd edition) p. 138.

Using the statistics methods

#-------------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------------

should "do stats operations" do

  # read file VALE3.  This file has a header that we need to discard.  VALE3 
  # contains quote values from brazilian company VALE as obtained from Yahoo finance
  # (quote vale3.SA)
  vale3 = MDArray.double("#{$COLT_TEST_DIR}/VALE3_short.csv", true)

  # in order to use statistics we need to reset_statistics on array vale3.  This 
  # breaks version 0.4.3 statistics that did not require a call to reset_statistics.
  # In future version we will keep both methods with reset_statistics and without
  # reset_statistics being required.
  vale3.reset_statistics

  # sum all values of vale3.  This does not make sense from a financial point of view.
  # suming all values including dates... Not doing anything with the value, just
  # checking that it does not crash.
  vale3.sum

  # lets get only the open price for the whole period. We slice the vale3 on the
  # second dimension and get the second column. 
  open = vale3.slice(1,1)
  
  # lets also get the high value.
  high = vale3.slice(1,2)

  # weights to be used for weighted operations
  weights = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14.0, 15, 16, 17, 18, 19]

  # splitters to be used to split the list
  splitters = [34.0, 36.0]

  # quantiles to be used to split the list
  percs = [0.20, 0.40, 0.60, 0.80, 1]

  # getting descriptive statistics for the open value.  open is a new MDArray, so
  # we need to reset_statistics for open as well.
  open.reset_statistics

  assert_equal(-0.30204751376121775, open.auto_correlation(10))
  assert_equal(0.8854362245369992, open.correlation(high))
  assert_equal(1.4367963988919659, open.covariance(high))
  assert_equal(0.00079607686762408, open.durbin_watson)
  assert_equal(33.837262944797345, open.geometric_mean)
  assert_equal(33.81400448777291, open.harmonic_mean)
  assert_equal(-0.925644222523478, open.kurtosis)
  assert_equal(0.681656774667894, open.lag1)
  assert_equal(36.43, open.max)
  assert_equal(33.86052631578948,open.mean)
  assert_equal(1.0889750692520779,open.mean_deviation)
  assert_equal(33.74,open.median)
  assert_equal(31.7,open.min)
  assert_equal(0.07736522466830013,open.moment3)
  assert_equal(5.147382269264963,open.moment4)
  assert_equal(34.17368421052632,open.pooled_mean(high))
  assert_equal(1.623413296398882,open.pooled_variance(high))
  assert_equal(1.1442193777839571e+29,open.product)
  assert_equal(32.498000000000005,open.quantile(0.2))
  assert_equal(0.8421052631578947,open.quantile_inverse(35.0))
  assert_equal(5.903846153846159,open.rank_interpolated(33.0))
  assert_equal(33.88377930514836,open.rms)
  assert_equal(-0.8280585298104861,open.sample_kurtosis)
  assert_equal(1.5166184210526306,open.sample_covariance(high))
  assert_equal(1.0142698435367294,open.sample_kurtosis_standard_error)
  assert_equal(0.042567930807996486,open.sample_skew)
  assert_equal(0.5237666950104207,open.sample_skew_standard_error)
  assert_equal(1.3075102994156926,open.sample_standard_deviation)
  assert_equal(1.6627719298244807,open.sample_variance)
  assert_equal(1.2035654385963137,open.sample_weighted_variance(weights))
  assert_equal(0.039130771304858564,open.skew)
  assert_equal(1.255092672964214,open.standard_deviation)
  assert_equal(0.28793800664365016,open.standard_error)
  assert_equal(643.35,open.sum)
  assert_equal(0.561897364355954,open.sum_of_inversions)
  assert_equal(66.90969033519778,open.sum_of_logarithms)
  assert_equal(29.92989473684211,open.sum_of_power_deviations(2, open.mean))
  assert_equal(740665.2440910001,open.sum_of_powers(3))
  assert_equal(21814.099500000004,open.sum_of_squares)
  assert_equal(28.354637119112198,open.sum_of_squared_deviations)
  assert_equal(33.862, open.trimmed_mean(2, 2))
  assert_equal(1.5752576177284554,open.variance)
  assert_equal(34.31689473684211,open.weighted_mean(weights))
  assert_equal(0.029110571117388302,open.weighted_rms(weights))
  assert_equal(33.816315789473684,open.winsorized_mean(1, 1))

  p "Distinct values: #{open.frequencies[:distinct_values]}"
  p "Frequencies: #{open.frequencies[:frequencies]}"
  p "Split: #{open.split(splitters)}"
  p "Quantiles: #{open.quantiles(percs)}"
  p "Sorted elements: #{open.sort}"
  p "Standardized elements: #{open.standardize}"

end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colt Statistics

Using the statistics methods

Clone this wiki locally