JDistlib—Java Statistical Distribution Library

Description

A Java package that provides routines for various statistical distributions.

Main Features

  • Computation of the density (pdf), cumulative (cdf), quantile, and random variates of many popular statistical distributions, such as Ansari-Bradley, Beta, Binomial, Cauchy, Chi square, Exponential, Fisher's F, Gamma, Geometric, Hypergeometric, Kendall, Logistic, Log normal, Negative binomial, Noncentral beta, Noncentral chi square, Noncentral F, Noncentral T, Normal, Poisson, Sign rank, Spearman, Student's T, Tukey, Uniform, Weibull, Wilcoxon, and many more.
  • A manual translation based on R. (Updated and current as of R version 3.2.0 Alpha or 3.1.1). Since version 0.2.0, JDistlib has (mostly) passed the same battery of tests of standard R with the same level of accuracy, except those outlined in the Known Problems section (see below).
  • THREAD SAFE (See "Implementation details" below)
  • 100% Java. 100% compatible with JDK 5.0 or later.
  • Normality tests, such as: Kolmogorov-Smirnov, Anderson-Darling, Cramer-Von Mises, D'Agostino-Pearson, Jarque Bera, Kolmogorov-Lilliefors, Shapiro-Francia, Shapiro-Wilk.
  • Two-distribution tests, such as: Kolmogorov-Smirnov two-distribution test, Ansari-Bradley, Mood, Bartlett, Fligner, T-test (one-sample, paired, two-sample), Variance test, Wilcoxon test, Mann-Whitney-U test, Kruskal-Wallis test, Binomial test.
  • Research-oriented random number generators (RNG): Mersenne-Twister and Marsaglia's Complementary-multiply-with-carry 4096.
  • Computation of some useful mathematical functions, such as: Gamma, log gamma, poly gammas (di-, tri-, tetra-, pentagamma), incomplete beta, Stirling approximation function, Bessel functions first and second kind (J and Y) with fractional orders, Modified Bessel functions the first and the second kind (I and K) with fractional orders, etc.
  • Numerical optimization (find maxima / minima), find roots of a function numerically.
  • Basic vector-based math routines.

Download

Current release is version 0.3.5. (April 14, 2014) Go to the project page to download the package, to report bugs, or to discuss about this project.

Documentation

You can browse the rudimentary Javadoc, if you like.

License

GPL version 2.

What's new

Version 0.3.5 (Apr 14, 2014):

  • Synced with R-devel-2014-04-10 (effectively R 3.2.0 alpha or 3.1.1), fixing the following bugs:
    • pchisq(1e-5, 100, 1) == 0 due to underflow in dpois_raw. (PR#15635). Note that for cases like this, due to the underlying formula, you will need to use log space to compute the probability.
    • Calculation error in using function pbinom (PR#15734)

Note: Since JDistlib includes bleeding edge fixes, its output may be different than that of the recently released R. Test cases for such bleeding edge fixes have been (and will always be) incorporated, guaranteeing its accuracy.

Old news

Version 0.3.4 (Apr 7, 2014):

  • Synced with R-rc_2014-04-04_r65373, fixing the following bugs:
    • pbeta(x, a,b, log.p=TRUE) sometimes lost all precision for very small and very differently sized a,b. (PR#15641)
    • More precise Normal density when x > 5 (PR#15620)
    • Adding sinpi, cospi, and tanpi for more precise Bessel function and Cauchy distribution computations (PR#15529)
  • Fixed bug #16, infinite loop in sort functions when the numbers are all negatives (Thanks Gilad Wallach and Idan Peretz!).
  • Imported a lot of comments to sync with the latest R function.
  • Fixes comment on Bessel functions---Bessel functions can handle negatives already!

Version 0.3.3 (Jan 28, 2014):

  • Bessel functions first and second kind (J and Y). Modified Bessel functions, the first and the third kind (I and K), with fractional orders.
  • Added Beta Prime and Kumaraswamy distributions.
  • Added PolyGamma.lmvpsigammafn (log of multivariate psi-gamma function).

Version 0.3.2 (Jan 24, 2014):

  • Fixed bug in MathFunctions.lmvgammafn (see ticket #14). Thanks, anonymous!
  • Fixed bug in Spearman.quantile (off by 1 issue)
  • Fixed bug in binomial_test in DistributionTest
  • Fixed bug in Logarithmic distribution plus some speed up in Logarithmic.quantile
  • Added Bounded Arcsine, Laplace, and Zipf distributions
  • Added density functions for Spearman and Tukey distributions (using differentials; not precise!)
  • Added MathFunctions.sinc, gharmonic, lgharmonic, and sort for various data types
  • Added some incomplete solutions to bug PR#15635
  • Added batch calls for PolyGamma functions
  • Added Poisson test
  • Make MathFunctions.logspace_add and logspace_sub public
  • Removed redundant constants from Constants (M_PI_half, M_LN_2, kLog1OverSqrt2Pi)

Version 0.3.1 (Jan 13, 2014):

  • Added Spearman quantile (using bisection) and random variates (by inversion)
  • Added Order quantile variates (only very minimally tested; caveat emptor!)
  • Added Chi, Inverse Gamma, and Nakagami distributions (based on simple transform from the Gamma distribution)
  • Added many two-distribution tests: Ansari-Bradley, Mood, Bartlett, Fligner, T-test (one-sample, paired, two-sample), Variance test, Wilcoxon test, Mann-Whitney-U test, Kruskal-Wallis test, Binomial test
  • Added lower_tail flag to Ansari distribution
  • Utilities.rank is now index 1 based (not index 0) since many routines seem to depend on that fact
  • Various bug fixes

Version 0.3.0 (Jan 10, 2014):

  • Renamed: Removed the Q prefix of QRandomEngine, QMersenneTwister, QRandomCMWC, and QRandomSampler, for consistency
  • Added Beta binomial distribution (with parameterization of mu, sigma, and size)
  • Added hazard, cumulative hazard, survival, and inverse survival functions for all distributions (instance only)
  • Fixed bugs on Kolmogorov-Smirnov two-sample test when the second array (Y) is longer than the first array (X)
  • Fixed bugs for Binomial.cumulative when x < 0 or x >= n (improperly returns 0 or 1).
  • Updated to R-patched_2014-01-08_r64705 (R 3.1.0 alpha) that contains the following bug fixes:
    • dbeta(x, a, b) with a or b within a factor of 2 of the largest representable number could infinite-loop.
    • qcauchy(p, *) is now fully accurate even when p is very close to 1. (PR#15521)
    • In some extreme cases (more than 10^15) integer inputs to dpqrxxx functions might have been rounded up by one (with a warning about being non-integer). (PR#15624)

Version 0.2.1 (Jan 9, 2014):

  • Fixed crash on Poisson.random (and consequently NegBinomial.random) when mu >= 10
  • Fixed bugs on NonCentralF.random
  • Added codes from p-r-tests.R for further unit testing on random variates.

Version 0.2.0 (Jan 8, 2014):

  • Many bugfixes! As of version 0.2.0, JDistlib passed most of R's standard battery of tests for all basic (prepackaged) distributions (except for the 3 integration tests below). So far, the hardest hit is Noncentral chi square due to the inavailability of long double in Java (See ticket #9). Noncentral beta also suffers a bit (See ticket #13).
  • Most of d-p-q-r-tests.R has been added for unit testing (only 3 integration tests remain).
  • Deprecated GenericDistribution.random(QRandomEngine)
  • Moved: MathFunctions, Constants, and PolyGamma to jdistlib.math
  • Added an API to create multiple random variables
  • Added an API to query multiple values of density, cumulative, and quantile (instance only)
  • Fixed bugs on SignRank.quantile. Variable n was set incorrectly.
  • Fixed bugs on T.quantile(x, df, true, false) that causes NaN when df is close to 1 and x is very small
  • Fixed bugs on many distributions when x is close to the limit of double precision floating point
  • Remove false non-convergence warning messages in NonCentralT.cumulative
  • Fixed bugs on bd0 when np < 1e-306. This will fix the behavior of many distributions when x is very small
  • Fixed bugs on Poisson.random that caused the routine to hang up on certain random states (Ticket #7)
  • Fixed bugs on LogNormal when x <= 0
  • The precision of Gamma.cumulative is on par with R

Version 0.1.3 (Jan 2, 2014):

  • Fixed bugs on SignRank.cumulative. The variable n was set incorrectly.
  • Fixed bugs on Gamma.cumulative when the scale is +Inf.
  • Added some code from d-p-q-r-tests.R for unit testing.
  • Noted some precision loss on Gamma.cumulative
  • Noted some precision loss on NonCentralChiSquare
  • Fixed bugs on most distributions for boundary cases dealing with infinity
  • Converted project to Maven

Version 0.1.2 (Dec 26, 2013):

  • Added Rayleigh and Inverse Normal distributions
  • Bugfixes on Kendall distribution
  • Added Brent's optimization and root finding methods (for brute force quantile search)

Version 0.1.1 (Dec 20, 2013):

  • Order (no quantile) and Extreme (both maxima and minima) distributions for order statistics (from EVD package)
  • Added Box-Muller method to generate random normals
  • Added RandomSampler ripped from Colt. Handy for creating a permutation of list of objects.
  • The sources should be 100% compatible with JDK 1.5.

Version 0.1.0 (Dec 19, 2013):

  • Distributions are now instantiable

Version 0.0.9 (Dec 17, 2013):

  • Proper fix for negative binomial distribution with size=0 (PR#15268)
  • Synced with R version 3.0.2

Version 0.0.8 (Dec 17, 2013):

  • Fix for bug #6. Thanks, Roland Ewald!

Version 0.0.7 (Mar 29, 2013):

  • Proper fix for pt / pf distribution (PR#15162)

Version 0.0.6 (Jan 11, 2013):

Further R synchronization fixes the following bugs / adds the following features:
  • qgeom() could return -1 for extremely small q. (PR#14967)
  • lgamma(x) for very small x (in the denormalized range) is no longer Inf with a warning.
  • plogis(x, lower = FALSE, log.p = TRUE) no longer underflows early for large x (e.g. 800).
  • Imported the simplified logic for T.quantile from R
  • Added multivariate gamma function (MathFunctions.lmvgammafn)
  • Added Wishart distribution sampling (random only)

Version 0.0.5 (Jan 09, 2013):

  • Synchronized with R's patched of the same date. Fixes the following bugs (taken from R's change log):
    • qt(1e-12, 1.2) no longer gives NaN.
    • dt(1e160, 1.2, log=TRUE) no longer gives -Inf.
    • beta(a, b) could overflow to infinity in its calculations when one of a and b was less than one. (PR#15075)
    • lbeta(a, b) no longer gives NaN if a or b is very small (in the denormalized range).

Version 0.0.4 (Jan 09, 2013):

  • Fix for pt / pf distribution. (PR#15162)
  • Added Fretchet, GEV, Generalized Pareto, Gumbel, and Reverse Weibull distributions
Distribution feature matrix
DistributionDensityCumulativeQuantileRandomSource
AnsariXXX0.0.4Base R (src/library/stats/src/ansari.c)
Random variate is created
by inversion method
Arcsine (bounded)0.3.20.3.20.3.20.3.2Coded from scratch
BetaXXXXBase R (nmath)
Beta Binomial0.3.00.3.00.3.00.3.0gamlss.dist package
Beta Prime0.3.30.3.30.3.30.3.3Coded from scratch
Also known as: Beta distribution
of the second kind
BinomialXXXXBase R (nmath)
CauchyXXXXBase R (nmath)
Chi0.3.10.3.10.3.10.3.1Simple transformation
from Gamma distribution
Chi squareXXXXBase R (nmath)
ExponentialXXXXBase R (nmath)
Extreme0.1.10.1.10.1.10.1.1EVD package
Fisher's FXXXXBase R (nmath)
Fretchet0.0.40.0.40.0.40.0.4EVD package
GammaXXXXBase R (nmath)
Generalized Pareto0.0.40.0.40.0.40.0.4EVD package
GeometricXXXXBase R (nmath)
Generalized Extreme Value (GEV)0.0.40.0.40.0.40.0.4EVD package
Gumbel0.0.40.0.40.0.40.0.4EVD package
HypergeometricXXXXBase R (nmath)
Inverse Gamma0.3.10.3.10.3.10.3.1Simple transformation
from Gamma distribution
Inverse Normal0.1.20.1.20.1.20.1.2gamlss.dist package
KendallXX0.0.30.0.3Base R (src/library/stats/src/kendall.c)
Quantile is taken from
SuppDists package
Random variate is created
by inversion method
Kumaraswamy0.3.30.3.30.3.30.3.3Coded from scratch
Laplace0.3.20.3.20.3.20.3.2Taken from VGAM
Logarithm0.0.30.0.30.0.30.0.3gamlss.dist package
LogisticXXXXBase R (nmath)
Log normalXXXXBase R (nmath)
Nakagami0.3.10.3.10.3.10.3.1Simple transformation
from Gamma distribution
Negative binomialXXXXBase R (nmath)
Noncentral betaXXXXBase R (nmath)
Noncentral chi squareXXXXBase R (nmath)
Noncentral FXXXXBase R (nmath)
Noncentral TXXXXBase R (nmath)
NormalXXXXBase R (nmath)
Order0.1.10.1.10.3.10.1.1EVD package
Quantile: Numerical optimization method
PoissonXXXXBase R (nmath)
Rayleigh0.1.20.1.20.1.20.1.2VGAM package
Reverse Weibull0.0.40.0.40.0.40.0.4EVD package
Sign rankXXXXBase R (nmath)
Spearman0.3.2X0.3.10.3.1pspearman package
with fallback to base R
(src/library/stats/src/prho.c)
for n > 22
Density: Differential method
Quantile: Bisection method
Random: Inversion method
Student's TXXXXBase R (nmath)
Tukey0.3.2XX0.0.3Base R (nmath)
Random variate is created
by inversion method.
Density: Differential method
This is for Studentized range only!
UniformXXXXBase R (nmath)
WeibullXXXXBase R (nmath)
WilcoxonXXXXBase R (nmath)
Wishart 0.0.6See Javadoc
Zipf0.3.20.3.20.3.20.3.2Coded from scratch,
with parts taken from VGAM
Implementation details

These routines are implemented as static final functions for ease of use, except for Sign Rank and Wilcoxon distributions. Sign Rank and Wilcoxon distributions are implemented as dynamic classes since they require a storage matrix that is dependent on the supplied parameters.

The primary change I did with the source code is I made the routines thread safe, especially for the RNG routine for each distribution. In R, the RNGs of several distributions require some global state variables, which hinders the implementation of thread safety. I think this is why R is not a multi-threaded program. Multi-threaded libraries in R, such as multicore, got around this by forking processes (i.e., copy the entire memory used by R for each cores). This results in a huge memory waste since multi-threaded R program will consume k times more than it ought to be, where k is the number of cores being used. I got around this by implementing some structures to afford state storage or eliminating the requirement altogether. For Beta and Gamma RNG routines, the time saved by storing the states is meager. So, I eliminated the states for these distributions. In Binomial, Hypergeometric, and Poisson distributions, I implemented the state storage as an inner class that has to be instantiated upon use. Since Sign Rank and Wilcoxon distributions are implemented as dynamic classes, storing the states is as simple as declaring the states as fields.

There was an earlier attempt called distlib, which was based on an earlier version of R. However, the library currently suffers a few shortcomings:

  1. It was based on an earlier version of R. The newer distribution library has been updated to improve accuracy. Gamma distribution, in particular, has been significantly improved. Since many other key distributions use routines in the Gamma distribution, their accuracy is also markedly improved, especially in the extreme lower tail.
  2. Distlib is buggy and cannot even compile presently.
  3. Distlib is not thread safe due to the global states required by some RNG routines.
  4. Distlib was a result of an automatic translation. The resulting code is very messy.

This is why I decided to do the translation over.

Why not Colt? Colt is a very fast statistical library. However, it does not contain as many statistical distribution as standard R code. I also found that Colt is less accurate at computing the probability values at the extreme tail of the distributions. Note that Colt's precision is up to 6 digits at the very extreme tail and that R's functions have been calibrated and are more accurate than Colt. See Bangalore, et al. paper. Relevant quote:

Amongst the Java libraries for the Chi-square distribution, SSJ performs well until around x=100; because SSJ provides a direct (not calculated by complementation) upper tail CDF, this inaccuracy in more extreme tails is likely due to poor algorithm performance Pierre (2007). The JMSL, Jakarta Math, SOCR and JSci libraries perform comparatively poorly, with accuracy steadily decreasing from about x=50 and dropping to zero at x=66 and x=71. This poor performance is likely due to complementation error as none of these libraries provide a direct CDF for the upper tail area, which had to be obtained by complementation (SourceForge.net, 2007, The Apache Software Foundation, 2007, University of California, Los Angeles, 2007 and Visual Numerics, 2007). The Colt library does very well, with only minor inaccuracies across the entire range of x values; it is unclear why the accuracy is not perfect, but this may be due to an implementation bug as the choice of algorithm appears sound (COLT, 2007). The LRE curves for the GSL library and the R statistical language are not plotted because the CDF tail areas were perfectly accurate with respect to the exact values from DCDFLIB; both GSL and R provide separate functions to directly compute both tail areas and the algorithms implemented in these functions are accurate across the entire range of x values we tested (GNU Scientific Library, 2007 and R Development Core Team, 2008).

Known problems
  1. Java does not have "long double". Hence I changed every occurrence of "long double" into "double". This happens in Hypergeometric, Noncentral beta, Noncentral chi square (and, by implication, Noncentral F), Noncentral T, and Tukey distributions. Possible ramifications: loss of precision in these distributions. See the "TODO long double" tag in each of the file. Update since version 0.2.0: So far, the hardest hit is Noncentral chi square (See ticket #9). Noncentral beta also suffers a bit (See ticket #13). Hypergeometric, Noncentral T, and Tukey seem to be fine. Noncentral F is most likely also affected because it depends on Noncentral chi square when df2 is huge (>1e8), but no test cases so far.
  2. R authors noted a precision problem in the quantile routine of the Hypergeometric distribution and have not fixed it. The problem is most pronounced at the very extreme tail of the distribution. I translated the file as such. So, the resulting translation will also suffer from the same problem. In addition to that, further precision loss should be expected due to the "long double" problem above.
  3. R is by no means free of bug. Any bugs in relevant R modules will show up in JDistlib too. I put reasonable effort to monitor relevant R bugs. See tickets for details.
  4. I did not translate unused RNG routines in the normal distribution. R's current standard is by inversion. I did, however, translate the Ahrens-Dieter and Kinderman-Ramage methods as an option. I did not translate the Box-Muller found in the R source code because it is not as good as the others and it requires global state storage. As of version 0.1.1, all RNG routines in base R package are implemented except for those that are buggy and are provided only for compatibility purposes.
  5. I did only minimal testing. So, caveat emptor. As of version 0.2.0, JDistlib passed most of R's standard battery of tests for all basic (prepackaged) distributions (except for the long double-related accuracy issues above). The only tests left untranslated are three integration tests for the Noncentral Beta. I am working on it, but it is for a long term solution.
Maintainer information

This project was initiated by Roby Joehanes in February 2012. JDistlib is currently actively maintained.