A Java package that provides routines for various statistical distributions.
- Computation of the density (pdf), cumulative (cdf), quantile, and random variates of many popular statistical distributions, such as Ansari-Bradley, Beta, Binomial, Cauchy, Chi square, Exponential, Fisher's F, Gamma, Geometric, Hypergeometric, Kendall, Logistic, Log normal, Negative binomial, Noncentral beta, Noncentral chi square, Noncentral F, Noncentral T, Normal, Poisson, Sign rank, Spearman, Student's T, Tukey, Uniform, Weibull, Wilcoxon, and many more.
- A manual translation based on R. (Updated and current as of R version 3.3.2; R-devel October 9, 2016). Since version 0.2.0, JDistlib has (mostly) passed the same battery of tests of standard R with the same level of accuracy, except those outlined in the Known Problems section (see below).
- THREAD SAFE (See "Implementation details" below), except for implementations of MersenneTwister (see ticket #24).
- 100% Java. 100% compatible with JDK 5.0 or later.
- Normality tests, such as: Kolmogorov-Smirnov, Anderson-Darling, Cramer-Von Mises, D'Agostino-Pearson, Jarque Bera, Kolmogorov-Lilliefors, Shapiro-Francia, Shapiro-Wilk.
- Two-distribution tests, such as: Kolmogorov-Smirnov two-distribution test, Ansari-Bradley, Mood, Bartlett, Fligner, T-test (one-sample, paired, two-sample), Variance test, Wilcoxon test, Mann-Whitney-U test, Kruskal-Wallis test, Binomial test.
- Research-oriented random number generators (RNG): Mersenne-Twister, Marsaglia's Complementary-multiply-with-carry 4096, and WELL 44497b.
- Computation of some useful mathematical functions, such as: Gamma, log gamma, poly gammas (di-, tri-, tetra-, pentagamma), incomplete beta, Stirling approximation function, Bessel functions first and second kind (J and Y) with fractional orders, Modified Bessel functions the first and the second kind (I and K) with fractional orders, etc.
- Numerical optimization (find maxima / minima), BOBYQA (multivariate numerical optimization), find roots of a function numerically.
- Basic vector-based math routines.
Current release is version 0.4.5. (October 10, 2016) Go to the project page to download the package, to report bugs, or to discuss about this project.
You can browse the rudimentary Javadoc, if you like.
Note: Daniel King of Broad Institute has kindly provided a copy of JDistlib repository in Maven Central.
Note: Thanks to MichaeL, who alerted me that v0.4.5 binary file contains source code (Bug #33). I have since reuploaded the binary for v0.4.5 (October 21, 2016). Apologies the inconvenience.
Version 0.4.5 (October 10, 2016):
- Bug fix for #31 (PR#16972) on Poisson quantile (and related limit bugs)
- Added Median Absolute Deviation (MAD) to jdistlib.math.VectorMath
- Responded to bug #29 by adding an option to sort for Shapiro-Wilk test. Thank you anonymous bug reporter!
- Sync with R-3.3.1, fixes rgamma(1,Inf) or rgamma(1, 0,0) no longer give NaN but the correct limit.
- Added Levy distribution
- Sync with Development build of October 9, 2016
- Fixes: rbeta(4, NA) and similarly rgamma() and rnbinom() now return NaN's with a warning, as other r<dist>(), and as documented. (PR#17155)
Note: Since JDistlib includes bleeding edge fixes, its output may be different than that of the recently released R. Test cases for such bleeding edge fixes have been (and will always be) incorporated, guaranteeing its accuracy. Note also that JDistlib may still have bugs on its own. Your bug report is highly appreciated
Version 0.4.4 (April 19, 2016):
- Fix for bug #28: Bug on Shapiro-Wilk p-value.
Version 0.4.3 (April 19, 2016):
Version 0.4.2 (April 19, 2016):
- Fix for bug #26: Allow warnings to be cast as an exception.
- Sync with R-devel (Apr 18, 2016-r70508); 3.3.0 alpha
- Fix for bug PR#16521: rchisq(*, df=0, ncp=0) now returns 0 instead of NaN, and dchisq(*, df=0, ncp=*) also no longer returns NaN in limit cases (where the limit is unique)
- Fix for pchisq(*, df=0, ncp > 0, log.p=TRUE) no longer underflows (for ncp > ~60).
- Fix for rhyper(nn, m, n, k) no longer returns NA when one of the three parameters exceeds the maximal integer.
- Fix for bug PR#16727: [dpqr]nbinom(..., size = Inf) should behave like [dpqr]pois(...)
- Added WELL 44497b random number generator
- Added BOBYQA (multivariate numerical optimization)
Version 0.4.1 (September 15, 2015):
- Synced with R version 3.2.2
- Fix for bug PR#16475: qt(*, df=Inf, ncp=.) now uses the natural qnorm() limit instead of returning NaN.
- Fix for bug PR#16489: rhyper(nn, <large>) now works correctly.
Version 0.4.0 (May 6, 2015):
- Quick update: Fixed a typo on Beta.density, which caused failures in regression tests.
Version 0.3.9 (May 5, 2015):
- Sync with the development branch of R 3.2.x (dated April 24, 2015, which is also current for May 4, 2015 with respect to distribution computation)
- Fix for the second half of PR#15554 regarding Bessel.J and Bessel.Y with huge alpha (≥ 261)
- Fix for bug #17 regarding the accuracy of Beta.quantile (PR#15755)
- Added MathFunctions.logspace_sum (per 3.2.x API feature)
- Fix for bug #22 for inadvertent use of Java 1.8 API (Double.isFinite). It has been replaced with !Double.isInfinite (Java 1.5-compatible API)
Version 0.3.8 (Dec 15, 2014):
- Fix for bug #20 regarding Kolmogorov-Smirnov (KS) test. Thanks, Gilad Wallach and Eran Avidan! Note: this bug was introduced in v0.3.6, with the fix of bug #18
Version 0.3.7 (Dec 10, 2014):
- Synced with R release version 3.1.2.
- Fix for bug #19
- Added option to allow inexact KS p-value computation method, if needed. Default option is still exact method. See bug #19 entry for details.
- Fixed integer overflow bug when computing KS exact method---only happen with big data sets.
Version 0.3.6 (Aug 18, 2014):
- Synced with R release version 3.1.1.
- Fix for bug #18
- Added generalized one-distribution Kolmogorov-Smirnov test
- kolmogorov_smirnov_statistic and kolmogorov_smirnov_pvalue are deprecated in favor of kolmogorov_smirnov_test
- Synced MersenneTwister with Sean Luke's version 20
- Incorporated Bintray / Gradle build system, courtesy Schalk W. Cronjé.
Version 0.3.5 (Apr 14, 2014):
- Synced with R-devel-2014-04-10 (effectively R 3.2.0 alpha or 3.1.1), fixing the following bugs:
Version 0.3.4 (Apr 7, 2014):
- Synced with R-rc_2014-04-04_r65373, fixing the following bugs:
- pbeta(x, a,b, log.p=TRUE) sometimes lost all precision for very small and very differently sized a,b. (PR#15641)
- More precise Normal density when x > 5 (PR#15620)
- Adding sinpi, cospi, and tanpi for more precise Bessel function and Cauchy distribution computations (PR#15529)
- Fixed bug #16, infinite loop in sort functions when the numbers are all negatives (Thanks Gilad Wallach and Idan Peretz!).
- Imported a lot of comments to sync with the latest R function.
- Fixes comment on Bessel functions---Bessel functions can handle negatives already!
Version 0.3.3 (Jan 28, 2014):
- Bessel functions first and second kind (J and Y). Modified Bessel functions, the first and the third kind (I and K), with fractional orders.
- Added Beta Prime and Kumaraswamy distributions.
- Added PolyGamma.lmvpsigammafn (log of multivariate psi-gamma function).
Version 0.3.2 (Jan 24, 2014):
- Fixed bug in MathFunctions.lmvgammafn (see ticket #14). Thanks, anonymous!
- Fixed bug in Spearman.quantile (off by 1 issue)
- Fixed bug in binomial_test in DistributionTest
- Fixed bug in Logarithmic distribution plus some speed up in Logarithmic.quantile
- Added Bounded Arcsine, Laplace, and Zipf distributions
- Added density functions for Spearman and Tukey distributions (using differentials; not precise!)
- Added MathFunctions.sinc, gharmonic, lgharmonic, and sort for various data types
- Added some incomplete solutions to bug PR#15635
- Added batch calls for PolyGamma functions
- Added Poisson test
- Make MathFunctions.logspace_add and logspace_sub public
- Removed redundant constants from Constants (M_PI_half, M_LN_2, kLog1OverSqrt2Pi)
Version 0.3.1 (Jan 13, 2014):
- Added Spearman quantile (using bisection) and random variates (by inversion)
- Added Order quantile variates (only very minimally tested; caveat emptor!)
- Added Chi, Inverse Gamma, and Nakagami distributions (based on simple transform from the Gamma distribution)
- Added many two-distribution tests: Ansari-Bradley, Mood, Bartlett, Fligner, T-test (one-sample, paired, two-sample), Variance test, Wilcoxon test, Mann-Whitney-U test, Kruskal-Wallis test, Binomial test
- Added lower_tail flag to Ansari distribution
- Utilities.rank is now index 1 based (not index 0) since many routines seem to depend on that fact
- Various bug fixes
Version 0.3.0 (Jan 10, 2014):
- Renamed: Removed the Q prefix of QRandomEngine, QMersenneTwister, QRandomCMWC, and QRandomSampler, for consistency
- Added Beta binomial distribution (with parameterization of mu, sigma, and size)
- Added hazard, cumulative hazard, survival, and inverse survival functions for all distributions (instance only)
- Fixed bugs on Kolmogorov-Smirnov two-sample test when the second array (Y) is longer than the first array (X)
- Fixed bugs for Binomial.cumulative when x < 0 or x >= n (improperly returns 0 or 1).
- Updated to R-patched_2014-01-08_r64705 (R 3.1.0 alpha) that contains the following bug fixes:
- dbeta(x, a, b) with a or b within a factor of 2 of the largest representable number could infinite-loop.
- qcauchy(p, *) is now fully accurate even when p is very close to 1. (PR#15521)
- In some extreme cases (more than 10^15) integer inputs to dpqrxxx functions might have been rounded up by one (with a warning about being non-integer). (PR#15624)
Version 0.2.1 (Jan 9, 2014):
- Fixed crash on Poisson.random (and consequently NegBinomial.random) when mu >= 10
- Fixed bugs on NonCentralF.random
- Added codes from p-r-tests.R for further unit testing on random variates.
Version 0.2.0 (Jan 8, 2014):
- Many bugfixes! As of version 0.2.0, JDistlib passed most of R's standard battery of tests for all basic (prepackaged) distributions (except for the 3 integration tests below). So far, the hardest hit is Noncentral chi square due to the inavailability of long double in Java (See ticket #9). Noncentral beta also suffers a bit (See ticket #13).
- Most of d-p-q-r-tests.R has been added for unit testing (only 3 integration tests remain).
- Deprecated GenericDistribution.random(QRandomEngine)
- Moved: MathFunctions, Constants, and PolyGamma to jdistlib.math
- Added an API to create multiple random variables
- Added an API to query multiple values of density, cumulative, and quantile (instance only)
- Fixed bugs on SignRank.quantile. Variable n was set incorrectly.
- Fixed bugs on T.quantile(x, df, true, false) that causes NaN when df is close to 1 and x is very small
- Fixed bugs on many distributions when x is close to the limit of double precision floating point
- Remove false non-convergence warning messages in NonCentralT.cumulative
- Fixed bugs on bd0 when np < 1e-306. This will fix the behavior of many distributions when x is very small
- Fixed bugs on Poisson.random that caused the routine to hang up on certain random states (Ticket #7)
- Fixed bugs on LogNormal when x <= 0
- The precision of Gamma.cumulative is on par with R
Version 0.1.3 (Jan 2, 2014):
- Fixed bugs on SignRank.cumulative. The variable n was set incorrectly.
- Fixed bugs on Gamma.cumulative when the scale is +Inf.
- Added some code from d-p-q-r-tests.R for unit testing.
- Noted some precision loss on Gamma.cumulative
- Noted some precision loss on NonCentralChiSquare
- Fixed bugs on most distributions for boundary cases dealing with infinity
- Converted project to Maven
Version 0.1.2 (Dec 26, 2013):
- Added Rayleigh and Inverse Normal distributions
- Bugfixes on Kendall distribution
- Added Brent's optimization and root finding methods (for brute force quantile search)
Version 0.1.1 (Dec 20, 2013):
- Order (no quantile) and Extreme (both maxima and minima) distributions for order statistics (from EVD package)
- Added Box-Muller method to generate random normals
- Added RandomSampler ripped from Colt. Handy for creating a permutation of list of objects.
- The sources should be 100% compatible with JDK 1.5.
Version 0.1.0 (Dec 19, 2013):
- Distributions are now instantiable
Version 0.0.9 (Dec 17, 2013):
- Proper fix for negative binomial distribution with size=0 (PR#15268)
- Synced with R version 3.0.2
Version 0.0.8 (Dec 17, 2013):
- Fix for bug #6. Thanks, Roland Ewald!
Version 0.0.7 (Mar 29, 2013):
- Proper fix for pt / pf distribution (PR#15162)
Version 0.0.6 (Jan 11, 2013):Further R synchronization fixes the following bugs / adds the following features:
- qgeom() could return -1 for extremely small q. (PR#14967)
- lgamma(x) for very small x (in the denormalized range) is no longer Inf with a warning.
- plogis(x, lower = FALSE, log.p = TRUE) no longer underflows early for large x (e.g. 800).
- Imported the simplified logic for T.quantile from R
- Added multivariate gamma function (MathFunctions.lmvgammafn)
- Added Wishart distribution sampling (random only)
Version 0.0.5 (Jan 09, 2013):
- Synchronized with R's patched of the same date. Fixes the following bugs (taken from R's change log):
- qt(1e-12, 1.2) no longer gives NaN.
- dt(1e160, 1.2, log=TRUE) no longer gives -Inf.
- beta(a, b) could overflow to infinity in its calculations when one of a and b was less than one. (PR#15075)
- lbeta(a, b) no longer gives NaN if a or b is very small (in the denormalized range).
Version 0.0.4 (Jan 09, 2013):
- Fix for pt / pf distribution. (PR#15162)
- Added Fretchet, GEV, Generalized Pareto, Gumbel, and Reverse Weibull distributions
|Ansari||X||X||X||0.0.4||Base R (src/library/stats/src/ansari.c)
Random variate is created
by inversion method
|Arcsine (bounded)||0.3.2||0.3.2||0.3.2||0.3.2||Coded from scratch|
|Beta||X||X||X||X||Base R (nmath)|
|Beta Binomial||0.3.0||0.3.0||0.3.0||0.3.0||gamlss.dist package|
|Beta Prime||0.3.3||0.3.3||0.3.3||0.3.3||Coded from scratch|
Also known as: Beta distribution
of the second kind
|Binomial||X||X||X||X||Base R (nmath)|
|Cauchy||X||X||X||X||Base R (nmath)|
from Gamma distribution
|Chi square||X||X||X||X||Base R (nmath)|
|Exponential||X||X||X||X||Base R (nmath)|
|Fisher's F||X||X||X||X||Base R (nmath)|
|Gamma||X||X||X||X||Base R (nmath)|
|Generalized Pareto||0.0.4||0.0.4||0.0.4||0.0.4||EVD package|
|Geometric||X||X||X||X||Base R (nmath)|
|Generalized Extreme Value (GEV)||0.0.4||0.0.4||0.0.4||0.0.4||EVD package|
|Hypergeometric||X||X||X||X||Base R (nmath)|
|Inverse Gamma||0.3.1||0.3.1||0.3.1||0.3.1||Simple transformation|
from Gamma distribution
|Inverse Normal||0.1.2||0.1.2||0.1.2||0.1.2||gamlss.dist package|
|Kendall||X||X||0.0.3||0.0.3||Base R (src/library/stats/src/kendall.c)
Quantile is taken from
Random variate is created
by inversion method
|Kumaraswamy||0.3.3||0.3.3||0.3.3||0.3.3||Coded from scratch|
|Laplace||0.3.2||0.3.2||0.3.2||0.3.2||Taken from VGAM|
|Levy||0.4.5||0.4.5||0.4.5||0.4.5||Simple transformation from Normal distribution|
|Logistic||X||X||X||X||Base R (nmath)|
|Log normal||X||X||X||X||Base R (nmath)|
from Gamma distribution
|Negative binomial||X||X||X||X||Base R (nmath)|
|Noncentral beta||X||X||X||X||Base R (nmath)|
|Noncentral chi square||X||X||X||X||Base R (nmath)|
|Noncentral F||X||X||X||X||Base R (nmath)|
|Noncentral T||X||X||X||X||Base R (nmath)|
|Normal||X||X||X||X||Base R (nmath)|
Quantile: Numerical optimization method
|Poisson||X||X||X||X||Base R (nmath)|
|Reverse Weibull||0.0.4||0.0.4||0.0.4||0.0.4||EVD package|
|Sign rank||X||X||X||X||Base R (nmath)|
with fallback to base R
for n > 22
Density: Differential method
Quantile: Bisection method
Random: Inversion method
|Student's T||X||X||X||X||Base R (nmath)|
|Tukey||0.3.2||X||X||0.0.3||Base R (nmath)
Random variate is created
by inversion method.
Density: Differential method
This is for Studentized range only!
|Uniform||X||X||X||X||Base R (nmath)|
|Weibull||X||X||X||X||Base R (nmath)|
|Wilcoxon||X||X||X||X||Base R (nmath)|
|Zipf||0.3.2||0.3.2||0.3.2||0.3.2||Coded from scratch,|
with parts taken from VGAM
Design philosophy: In this tutorial, I would assume that you already have some basic knowledge of statistics and Java programming. I tried to keep everything simple and intuitive so people could understand and use JDistlib quickly. To that end, I designed the functions to be statically called. I also recognize people's need to abstract the distributions and that is why I also allowed instantiation of each distribution class (each of which extends GenericDistribution abstract class).
1. Basic distributional features
1.1. Static calls
There are four functions needed for each statistical distribution:
- Density, which is the probability density function (pdf) / probability mass function (pmf) for that distribution. In JDistlib, the call is represented by the function density which takes the following format:
density(x, parameter1, parameter2, ..., give_log)
The x is the value you wish to find the density for. Note that the list of parameters vary by distribution. For example, Normal distribution takes two parameters, mu and sigma, and Chi-square distribution takes only one, nu. The give_log parameter is a boolean (true / false), when set to true JDistlib will output the log-transformed form of the density value, which can be useful (and usually more accurate) for very extreme tail end. Example: You want to know the pdf value of x=2 on Normal distribution with mu=2 and sigma=1:
System.out.println(Normal.density(2, 2, 1, false));
This will give you the value of 0.3989423.
- Cumulative, which is the cumulative density function (cdf) / cumulative mass function (cmf) for that distribution, which is basically the integration (or summation) of the pdf / pmf. In JDistlib, the call is represented by the function cumulative which takes the following format:
cumulative(x, parameter1, parameter2, ..., lower_tail, give_log)
The x is the value you wish to find the cdf/cmf for. This is the function that you would call if you want to find the p-value for a given statistics (e.g., Z score or T statistics or F statistics). Again, list of parameters vary by distibution, as above. The give_log parameter is also as above. If lower_tail is set to true, then the integration / summation is performed from -infinity (or whatever the lower bound value the distribution) to x. If lower_tail is set to false, then the integration / summation is from x to infinity (or whatever the upper bound value is). Example: You want to know the lower tail cdf value of x=2 on Normal distribution with mu=2 and sigma=1:
System.out.println(Normal.cumulative(2, 2, 1, true, false));
This will give you the value of 0.5.
- Quantile, which is the inverse cdf / cmf for that distribution. This is the function you would call if you are given a p-value and would like to know the corresponding value for that distribution. For example, for standard normal distribution (i.e., normal distribution with mu=0, sigma=1), if you put in 0.975, you will get 1.95996 (i.e., the Z statistics for (1-alpha)/2, if you set alpha=0.05). In JDistlib, the call is represented by the function quantile which takes the following format:
quantile(p, parameter1, parameter2, ..., lower_tail, give_log)
The p is the p-value for which you wish to find the corresponding distribution value. Again, list of parameters vary by distibution, as above. The give_log parameter indicates whether or not you give the p-value in log-transformed format or not. If lower_tail is set to true, then you would like the statistics computed from the lower tail end, or otherwise when set to false. Using the example above, here is how you do it in JDistlib:
System.out.println(Normal.quantile(0.975, 0, 1, true, false));
This will give you the value of 1.95996.
- Random, which is the function to generate random numbers according to that distribution. In JDistlib, the call is represented by the function random which takes the following format:
random(n, parameter1, parameter2, ..., random)
The n is the number of random numbers you want to generate. For each distribution, you can omit n if you need to generate only one. Again, list of parameters vary by distibution, as above. The parameter random is the random number generator. There are several choices for random number generators available in JDistlib, such as Mersenne-Twister, Marsaglia's Complementary-multiply-with-carry / CMWC 4096, and WELL 44497b. For example, to generate 1000 normally-distributed with mu=0 and sigma=2 is as follows:
double r = Normal.random(1000, 0, 2, new MersenneTwister());
Such intuitive static calls are available for most distributions in JDistlib, except for Sign Rank and Wilcoxon distributions. Sign Rank and Wilcoxon distributions are implemented as dynamic classes since they require a storage matrix that is dependent on the supplied parameters.
1.2. Dynamic calls
In dynamic calls, you need to instantiate the distribution class with the parameters needed. For example: To create a standard normal object, you would invoke new Normal(0,1). JDistlib by default will also instantiate Mersenne Twister as the random number generator if you wish to generate random numbers later. Since parameters are stored with the instance, you do not need to respecify the parameters in any of the dynamic calls, which can be handy if you have a lot of calls with the same parameters. For example:
Normal n = new Normal(0,1);
System.out.println(n.cumulative(0, false, false));
double r = n.random(1000);
In addition, dynamic calls have a few additional perks: computing hazard function, cumulative hazard function, survival function, and inverse survival functions. Survival and inverse survival functions are simply cumulative and quantile functions, but with lower_tail flag set to false. Hazard function is simply pdf / (1-cdf), and cumulative hazard function is simply -ln(1-cdf).
2. Distributional testing
2.1. Normality Testing
Another major feature of JDistlib is to test your data against normal distribution, i.e., whether your data is normally distributed or not. JDistlib offers a number of tests such as: Kolmogorov-Smirnov, Anderson-Darling, Cramer-Von Mises, D'Agostino-Pearson, Jarque Bera, Kolmogorov-Lilliefors, Shapiro-Francia, and Shapiro-Wilk. These tests can be found in jdistlib.disttest.NormalityTest class. They are designed as static calls with the following format:
NormalityTest.xxx_p_value(double stat, int df);
The xxx refers to the test name (e.g., anderson_darling). IMPORTANT: The array x is usually required to be presorted. If you need a fast routine to sort arrays, please refer to jdistlib.util.Utilities.sort functions. The xxx_statistic will return the value for the test statistic. The xxx_p_value returns the p-value for the given test statistic. Some tests may require a degree of freedom parameter, which is typically the length of the array x.
Note: The calling convention above will be deprecated and be replaced by xxx_test(double x), as it now stands for Kolmogorov-Smirnov test.
2.2. Distribution Testing Other than Normality
JDistlib offers a test to test whether your data came from a distribution other than Normal (e.g., Gamma or Beta). Currently, as of version 0.4.5, only Kolmogorov-Smirnov test is available in JDistlib. Other test (such as Anderson-Darling extension or Ansari-Bradley extension) may be offered in the future. This test can be found in jdistlib.disttest.DistributionTest.kolmogorov_smirnov_test(double X, GenericDistribution dist, TestKind kind, boolean isExact). IMPORTANT: The array x must be presorted. You can plug in any instance of GenericDistribution. For example:
double result = DistributionTest.kolmogorov_smirnov_test(x, new ChiSquare(10), TestKind.TWO_SIDED, true);
The result is an array of two elements: The first is the test statistic and the second is the p-value.
2.3. Multi-distribution Testing
JDistlib also has a feature to compare whether the data in two arrays come from the same distribution. As of version 0.4.5, JDistlib offers Kolmogorov-Smirnov two-sample test, Ansari-Bradley test, and Kruskal-Wallis test. They can be found in jdistlib.disttest.DistributionTest class as well. The calling convention is very similar. For example:
double result = DistributionTest.kolmogorov_smirnov_test(x, y, TestKind.TWO_SIDED, true);
With both x and y be arrays of doubles, not necessarily at the same length. IMPORTANT: The arrays x and y must be presorted. Like the above, the result is an array of two elements: The first is the test statistic and the second is the p-value.
2.4. Other distribution testing features
JDistlib offers many other distributional tests. They can be found in jdistlib.disttest.DistributionTest class as well. These tests include:
- Mann-Whitney U test, which is a nonparametric test to compare two sets of data.
- T-test, which is a test to compare two sets of data under assumption of normality. JDistlib also provides the one sample T-test variety.
- Fligner-Killeen test and Bartlett's test to test homogeneity of variance. Levene's test and Brown-Forsythe tests will be added in the future.
- Variance F test to compare the variance of two datasets.
- Mood's tests of scale to compare whether the medians of two datasets differ by a scale (default = 1).
- and many more...
These routines are implemented as static final functions for ease of use, except for Sign Rank and Wilcoxon distributions. Sign Rank and Wilcoxon distributions are implemented as dynamic classes since they require a storage matrix that is dependent on the supplied parameters.
The primary change I did with the source code is I made the routines thread safe, especially for the RNG routine for each distribution. In R, the RNGs of several distributions require some global state variables, which hinders the implementation of thread safety. I think this is why R is not a multi-threaded program. Multi-threaded libraries in R, such as multicore, got around this by forking processes (i.e., copy the entire memory used by R for each cores). This results in a huge memory waste since multi-threaded R program will consume k times more than it ought to be, where k is the number of cores being used. I got around this by implementing some structures to afford state storage or eliminating the requirement altogether. For Beta and Gamma RNG routines, the time saved by storing the states is meager. So, I eliminated the states for these distributions. In Binomial, Hypergeometric, and Poisson distributions, I implemented the state storage as an inner class that has to be instantiated upon use. Since Sign Rank and Wilcoxon distributions are implemented as dynamic classes, storing the states is as simple as declaring the states as fields.
There was an earlier attempt called distlib, which was based on an earlier version of R. However, the library currently suffers a few shortcomings:
- It was based on an earlier version of R. The newer distribution library has been updated to improve accuracy. Gamma distribution, in particular, has been significantly improved. Since many other key distributions use routines in the Gamma distribution, their accuracy is also markedly improved, especially in the extreme lower tail.
- Distlib is buggy and cannot even compile presently.
- Distlib is not thread safe due to the global states required by some RNG routines.
- Distlib was a result of an automatic translation. The resulting code is very messy.
This is why I decided to do the translation over.
Why not Colt? Colt is a very fast statistical library. However, it does not contain as many statistical distribution as standard R code. I also found that Colt is less accurate at computing the probability values at the extreme tail of the distributions. Note that Colt's precision is up to 6 digits at the very extreme tail and that R's functions have been calibrated and are more accurate than Colt. See Bangalore, et al. paper. Relevant quote:
Amongst the Java libraries for the Chi-square distribution, SSJ performs well until around x=100; because SSJ provides a direct (not calculated by complementation) upper tail CDF, this inaccuracy in more extreme tails is likely due to poor algorithm performance Pierre (2007). The JMSL, Jakarta Math, SOCR and JSci libraries perform comparatively poorly, with accuracy steadily decreasing from about x=50 and dropping to zero at x=66 and x=71. This poor performance is likely due to complementation error as none of these libraries provide a direct CDF for the upper tail area, which had to be obtained by complementation (SourceForge.net, 2007, The Apache Software Foundation, 2007, University of California, Los Angeles, 2007 and Visual Numerics, 2007). The Colt library does very well, with only minor inaccuracies across the entire range of x values; it is unclear why the accuracy is not perfect, but this may be due to an implementation bug as the choice of algorithm appears sound (COLT, 2007). The LRE curves for the GSL library and the R statistical language are not plotted because the CDF tail areas were perfectly accurate with respect to the exact values from DCDFLIB; both GSL and R provide separate functions to directly compute both tail areas and the algorithms implemented in these functions are accurate across the entire range of x values we tested (GNU Scientific Library, 2007 and R Development Core Team, 2008).
- Java does not have "long double". Hence I changed every occurrence of "long double" into "double". This happens in Hypergeometric, Noncentral beta, Noncentral chi square (and, by implication, Noncentral F), Noncentral T, and Tukey distributions. Possible ramifications: loss of precision in these distributions. See the "TODO long double" tag in each of the file. Update since version 0.2.0: So far, the hardest hit is Noncentral chi square (See ticket #9). Noncentral beta also suffers a bit (See ticket #13). Hypergeometric, Noncentral T, and Tukey seem to be fine. Noncentral F is most likely also affected because it depends on Noncentral chi square when df2 is huge (>1e8), but no test cases so far.
- R authors noted a precision problem in the quantile routine of the Hypergeometric distribution and have not fixed it. The problem is most pronounced at the very extreme tail of the distribution. I translated the file as such. So, the resulting translation will also suffer from the same problem. In addition to that, further precision loss should be expected due to the "long double" problem above.
- R is by no means free of bug. Any bugs in relevant R modules will show up in JDistlib too. I put reasonable effort to monitor relevant R bugs. See tickets for details.
I did not translate unused RNG routines in the normal distribution. R's current standard is by inversion. I did, however, translate the Ahrens-Dieter and Kinderman-Ramage methods as an option. I did not translate the Box-Muller found in the R source code because it is not as good as the others and it requires global state storage.As of version 0.1.1, all RNG routines in base R package are implemented except for those that are buggy and are provided only for compatibility purposes. I did only minimal testing. So, caveat emptor.As of version 0.2.0, JDistlib passed most of R's standard battery of tests for all basic (prepackaged) distributions (except for the long double-related accuracy issues above). The only tests left untranslated are three integration tests for the Noncentral Beta. I am working on it, but it is for a long term solution.
This project was initiated by Roby Joehanes in February 2012. JDistlib is currently actively maintained.