Statistics

    Dependence (or association)

    If there is a way to deduce completely or partially the values of a variable, based from another, it means there is a dependence between the two variables.

    Relation between variable can be caracterize from 3 criteria (see summary/stat-correction/stat98_6):

    • intensity (strong: close points, weak: distant points, null: grid)
    • shape (linar, non-linear)
    • direction (monotonous positive, or monotonous negative)

    Note: linear is always monotonous (i.e., only increase or only decrease)

    Source

    Note: independent variables implies uncorrelated variables, BUT, uncorrelated variables DOES NOT imply independent variables.

    Source

    Correlation functions

    Useful to know if variables are correlated, and thus have a dependence

    Pearson

    Measure the linear correlation between two variables. Pearson coefficient is between [-1, 1]. If the value is 0, it means the two variables have none linear correlations.

    Limitations: limited to bivariate normal distribution and no extreme values else coefficient can give wrong conclusions on the existance or not of a relation)

    Mathematical definition, check also Figure from Wikipedia

    Spearman

    Similar to Pearson correlation (value between [-1, 1]) except that it is computed on the ranks.

    Advantage: find monotonous relations whatever the form (linear, exponential, power).

    Note: It's better to use Pearson when two distributions are dissymmetric and/or have extreme values.

    Hint between Pearson and Spearman

    signif(Pearson) > signif(Spearman) : frequently indicates that exceptional values increase the Pearson coefficient's value but does not modify Spearman coefficient's value which is more robust.

    signif(Spearman) > signif(Pearson): frequently indicates that a non-linear relation exists.

    Tests

    Some information about Statistic tests:

    Parametric: Tests that require parameters (notably a known distribution to the data).

    Non-parametric: Does not require parameters, and those methods that do not assume a specific distribution to the data. tutorial

    Rank tests: Usually useful when variables do not follow a normal distribution (Spearman, Wilcoxon, Friedman).

    Normality tests: Detemine if a data set is well-modeled by a normal distribution.

    Note (p-value, alpha usually 0.05 or 0.01):

    • p <= alpha: reject H0
    • p > alpha: fail to reject H0

    Student t-test

    Test the dependence between two variables.

    Type: Parametric, check dependency

    Limitation: Variables may follow a normal distribution, with same average between the two variables. An alternative is Wilcoxon-Mann-Whitney.

    Notes:

    • A nonparametric version of the paired Student t-test is the Wilcoxon signed-rank.
    • A nonparametric version of the Student t-test is the Mann-Whitney U test.

    Welch

    Test the dependence between two variables.

    Type: Parametric, check dependency.

    Limitation: Variables may follow a normal distribution.

    Kolmogorov-Smirnov (K-S)

    Allow to know if a variable follows a normal distribution.

    Type: Normality test, non-parametric.

    H0: Null hypothesis that the sample is drawn from the reference distribution (in the one-sample case) or that the samples are drawn from the same distribution (in the two-sample case).

    Note: From reading, Shapiro-Wilk is similar and better.

    Shapiro-Wilk

    Low p-value (usually less than 0.05 <=> 5%) indicates that variable does not follow a normal distribution, else the variable is drawn from a normal distribution.

    Type: Normality test, non-parametric.

    H0: Null hypothesis that the data was drawn from a normal distribution.

    Limitation: May be innacurate when the number of samples is > 5000.

    Note: The chance of rejecting the null hypothesis when it is true is close to 5% regardless of sample size.

    Wilcoxon (signed-rank)

    Check if the difference between the average of two variables is null. Used when the samples are related or matched in some way or represent two measurements of the same technique. More specifically, each sample is independent, but comes from the same population.

    Examples of paired samples in machine learning might be the same algorithm evaluated on different datasets or different algorithms evaluated on exactly the same training and test data.

    Type: Ranking test, non-parametric, check dependency, paired samples.

    H0: Null hypothesis that two related paired samples come from the same distribution.

    Note: results between R and scipy can differs if the number of samples is too small.

    (Wilcoxon-) Mann-Whitney U test

    This test can be used to investigate whether two independent samples were selected from populations having the same distribution. More specifically, the test determines whether it is equally likely that any randomly selected observation from one sample will be greater or less than a sample in the other distribution. If violated, it suggests differing distributions.

    A similar non-parametric test used on dependent samples is the Wilcoxon signed-rank test (for paired samples).

    Type: Non-parametric, check dependency.

    H0: Null hypothesis is that there is no difference between the distributions of the data samples.

    Requirement: at least 20 observations in each data sample.

    Kruskal-Wallis

    This test can be used to determine whether more than two independent samples have a different distribution. It can be thought of as the generalization of the Mann-Whitney U test.

    Useful to compare many different samples (>= 3 variables).

    When the Kruskal-Wallis H-test leads to significant results, then at least one of the samples is different from the other samples. However, the test does not identify where the difference(s) occur. Moreover, it does not identify how many differences occur.

    Type: non-parametric, check dependency.

    H0: Null hypothesis is that all data samples were drawn from the same distribution

    Note: If the samples are paired in some way, such as repeated measures, then the Kruskal-Wallis H test would not be appropriate. Instead, the Friedman test can be used, named for Milton Friedman.

    Friedman

    Compares more than two samples that are related. The parametric equivalent to this test is the repeated measures analysis of variance (ANOVA). When the Friedman test leads to significant results, at least one of the samples is different from the other samples.

    Type: non-parametric, check dependency, paired samples.

    H0: Null hypothesis is that the multiple paired samples have the same distribution.

    Requirement: The test assumes two or more paired data samples with 10 or more samples per group.

    Miscellaneous

    Bivariate distribution

    Joint probability of two variables

    Source

    Central Limit Theorem (CLT)

    When independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed.

    Source

    Degree of freedom

    Source

    P-value

    The p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true.

    Source

    Tests on continuous variables

    Sounds an harder problem in case you compare continuous variables StackExchange

    Notes

    The Mann-Whitney U test for comparing independent data samples: the nonparametric version of the Student t-test.

    The Wilcoxon signed-rank test for comparing paired data samples: the nonparametric version of the paired Student t-test.