Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add confidence interval for MWU #226

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 41 additions & 15 deletions pingouin/nonparametric.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def madmedianrule(a):
return (np.fabs(a - np.median(a)) / mad(a)) > k


def mwu(x, y, alternative='two-sided', **kwargs):
def mwu(x, y, alternative='two-sided',confidence=0.95, **kwargs):
"""Mann-Whitney U Test (= Wilcoxon rank-sum test). It is the non-parametric
version of the independent T-test.

Expand All @@ -157,6 +157,8 @@ def mwu(x, y, alternative='two-sided', **kwargs):
Defines the alternative hypothesis, or tail of the test. Must be one of
"two-sided" (default), "greater" or "less". See :py:func:`scipy.stats.mannwhitneyu` for
more details.
confidence : float
Confidence level on confidence interval of difference of medians between x and y.
**kwargs : dict
Additional keywords arguments that are passed to :py:func:`scipy.stats.mannwhitneyu`.

Expand All @@ -169,6 +171,7 @@ def mwu(x, y, alternative='two-sided', **kwargs):
* ``'p-val'``: p-value
* ``'RBC'`` : rank-biserial correlation
* ``'CLES'`` : common language effect size
* ``'CI'`` : confidence interval of difference of medians

See also
--------
Expand Down Expand Up @@ -222,16 +225,20 @@ def mwu(x, y, alternative='two-sided', **kwargs):
Association and the American Statistical Association, 25(2),
101–132. https://doi.org/10.2307/1165329

.. [5] Campbell, M. J. & Gardner, M. J. (1988). Calculating confidence
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add in the "Notes" section a one line explanation of the CI method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll give that a go. Like I said, I'm not a statistician, so that'll have to be proof-read by someone

intervals for some non-parametric analyses.
British Medical Journal Volume 226, 1988.

Examples
--------
>>> import numpy as np
>>> import pingouin as pg
>>> np.random.seed(123)
>>> x = np.random.uniform(low=0, high=1, size=20)
>>> y = np.random.uniform(low=0.2, high=1.2, size=20)
>>> pg.mwu(x, y, alternative='two-sided')
U-val alternative p-val RBC CLES
MWU 97.0 two-sided 0.00556 0.515 0.2425
>>> pg.mwu(x, y, alternative='two-sided',confidence=0.95)
U-val alternative p-val RBC CLES CI95%
MWU 97.0 two-sided 0.00556 0.515 0.2425 [-0.39290395101879694, -0.09400270319896187]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the actual output that you get? The CI should normally be rounded to two decimals by the _postprocess_dataframe function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the actual output I get. I was wondering about that, too. But then again, the t-test also gives me full floats (at least when confidence!=0.95), so I thought that was intentional.
I can of course round it in MWU or do you want to adress that elsewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of the t-test showing that behavior.
grafik


Compare with SciPy

Expand All @@ -241,19 +248,19 @@ def mwu(x, y, alternative='two-sided', **kwargs):

One-sided test

>>> pg.mwu(x, y, alternative='greater')
U-val alternative p-val RBC CLES
MWU 97.0 greater 0.997442 0.515 0.2425
>>> pg.mwu(x, y, alternative='greater',confidence=0.95)
U-val alternative p-val RBC CLES CI95%
MWU 97.0 greater 0.997442 0.515 0.2425 [-0.3711286134873304, inf]

>>> pg.mwu(x, y, alternative='less')
U-val alternative p-val RBC CLES
MWU 97.0 less 0.00278 0.515 0.7575
>>> pg.mwu(x, y, alternative='less',confidence=0.95)
U-val alternative p-val RBC CLES CI95%
MWU 97.0 less 0.00278 0.515 0.7575 [-inf, -0.10192641647044609]

Passing keyword arguments to :py:func:`scipy.stats.mannwhitneyu`:

>>> pg.mwu(x, y, alternative='two-sided', method='exact')
U-val alternative p-val RBC CLES
MWU 97.0 two-sided 0.004681 0.515 0.2425
>>> pg.mwu(x, y, alternative='two-sided',confidence=0.95, method='exact')
U-val alternative p-val RBC CLES CI95%
MWU 97.0 two-sided 0.004681 0.515 0.2425 [-0.39290395101879694, -0.09400270319896187]
"""
x = np.asarray(x)
y = np.asarray(y)
Expand Down Expand Up @@ -281,14 +288,33 @@ def mwu(x, y, alternative='two-sided', **kwargs):

# Effect size 2: rank biserial correlation (Wendt 1972)
rbc = 1 - (2 * uval) / diff.size # diff.size = x.size * y.size


# Confidence interval for the (difference in) medians
# Campbell and Gardner 2000
if alternative == "two-sided":
alpha = 1.0 - confidence
conf = 1.0 - alpha / 2 # 0.975
else:
conf = confidence
N = scipy.stats.norm.ppf(conf)
ct1, ct2 = len(x),len(y) # count samples
diffs = sorted([i-j for i in x for j in y]) # get ct1xct2 difference
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kschuerholt could we use a numpy function / numpy broadcasting here to avoid the nested for loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's easy enough. I'll add it in a new commit promptly.

k = int(round(ct1*ct2/2 - (N * (ct1*ct2*(ct1+ct2+1)/12)**0.5)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure that the code follows the flake8 guideline, i.e. there must be a white space between arithmetic operators

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that.. was editing the file on the fly in github directly, no auto linting/formatting there yet unfortunatley. Next commit will be formatted accordingly.

ci = [diffs[k], diffs[len(diffs)-k]]
if alternative == "greater":
ci[1] = np.inf
elif alternative == "less":
ci[0] = -np.inf
# Rename CI
ci_name = 'CI%.0f%%' % (100 * confidence)
# Fill output DataFrame
stats = pd.DataFrame({
'U-val': uval,
'alternative': alternative,
'p-val': pval,
'RBC': rbc,
'CLES': cles}, index=['MWU'])
'CLES': cles,
ci_name: [ci]}, index=['MWU'])
return _postprocess_dataframe(stats)


Expand Down