为了账号安全,请及时绑定邮箱和手机立即绑定

如何从具有权重的数据创建箱线图?

如何从具有权重的数据创建箱线图?

HUX布斯 2022-06-02 12:13:26
我有以下数据:aName名称出现的次数 ( Count),以及Score每个名称的 a。我想创建一个 的箱须图,用它Score来加权每个名称。ScoreCount结果应该与我拥有原始(而非频率)形式的数据相同。但我不想将数据实际转换为这种形式,因为它会很快膨胀。import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltdata = {    "Name":['Sara', 'John', 'Mark', 'Peter', 'Kate'],    "Count":[20, 10, 5, 2, 5],     "Score": [2, 4, 7, 8, 7]}df = pd.DataFrame(data)print(df)   Count   Name  Score0     20   Sara      21     10   John      42      5   Mark      73      2  Peter      84      5   Kate      7我不确定如何在 Python 中解决这个问题。任何帮助表示赞赏!
查看完整描述

2 回答

?
红颜莎娜

TA贡献1842条经验 获得超12个赞

这个问题迟到了,但如果它对遇到它的任何人有用 -


当您的权重是整数时,您可以使用 reindex 按计数扩展,然后直接使用 boxplot 调用。我已经能够在几千个变成几十万的数据帧上做到这一点而没有内存挑战,特别是如果实际重新索引的数据帧被包装到第二个函数中,该函数没有在内存中分配它。


import pandas as pd

import seaborn as sns


data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}

df = pd.DataFrame(data)


def reindex_df(df, weight_col):

    """expand the dataframe to prepare for resampling

    result is 1 row per count per sample"""

    df = df.reindex(df.index.repeat(df[weight_col]))

    df.reset_index(drop=True, inplace=True)

    return(df)


df = reindex_df(df, weight_col = 'Count')


sns.boxplot(x='Name', y='Score', data=df)

或者如果您担心内存


def weighted_boxplot(df, weight_col):

    sns.boxplot(x='Name', 

                y='Score', 

                data=reindex_df(df, weight_col = weight_col))

    

weighted_boxplot(df, 'Count')


查看完整回答
反对 回复 2022-06-02
?
白猪掌柜的

TA贡献1893条经验 获得超10个赞

这里有两种方法来回答这个问题。您可能会期待第一个,但它不是一个好的计算解决方案confidence intervals of the median,它具有使用示例数据的以下代码,引用matplotlib/cbook/__init__.py。因此,Second 比其他任何代码都好得多,因为它经过了很好的测试,可以比较任何其他自定义代码。


def boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,

                  autorange=False):

    def _bootstrap_median(data, N=5000):

        # determine 95% confidence intervals of the median

        M = len(data)

        percentiles = [2.5, 97.5]


        bs_index = np.random.randint(M, size=(N, M))

        bsData = data[bs_index]

        estimate = np.median(bsData, axis=1, overwrite_input=True)

第一的:


import pandas as pd

import matplotlib.pyplot as plt

import numpy as np


data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}


df = pd.DataFrame(data)

print(df)



def boxplot(values, freqs):

    values = np.array(values)

    freqs = np.array(freqs)

    arg_sorted = np.argsort(values)

    values = values[arg_sorted]

    freqs = freqs[arg_sorted]

    count = freqs.sum()

    fx = values * freqs

    mean = fx.sum() / count

    variance = ((freqs * values ** 2).sum() / count) - mean ** 2

    variance = count / (count - 1) * variance  # dof correction for sample variance

    std = np.sqrt(variance)

    minimum = np.min(values)

    maximum = np.max(values)

    cumcount = np.cumsum(freqs)


    print([std, variance])

    Q1 = values[np.searchsorted(cumcount, 0.25 * count)]

    Q2 = values[np.searchsorted(cumcount, 0.50 * count)]

    Q3 = values[np.searchsorted(cumcount, 0.75 * count)]


    '''

    interquartile range (IQR), also called the midspread or middle 50%, or technically

    H-spread, is a measure of statistical dispersion, being equal to the difference

    between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]

    IQR = Q3 −  Q1. In other words, the IQR is the first quartile subtracted from

    the third quartile; these quartiles can be clearly seen on a box plot on the data.

    It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used

    robust measure of scale.

    '''


    IQR = Q3 - Q1


    '''

    The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract

    1.5 times the IQR from the 25 percentile (aka Q1).  The whiskers should include

    99.3% of the data if from a normal distribution.  So the 6 foot tall man from

    the example would be inside the whisker but my 6 foot 2 inch girlfriend would

    be at the top whisker or pass it.

    '''

    whishi = Q3 + 1.5 * IQR

    whislo = Q1 - 1.5 * IQR


    stats = [{

        'label': 'Scores',  # tick label for the boxplot

        'mean': mean,  # arithmetic mean value

        'iqr': Q3 - Q1,  # 5.0,

#         'cilo': 2.0,  # lower notch around the median

#         'cihi': 4.0,  # upper notch around the median

        'whishi': maximum,  # end of the upper whisker

        'whislo': minimum,  # end of the lower whisker

        'fliers': [],  # '\array([], dtype=int64)',  # outliers

        'q1': Q1,  # first quartile (25th percentile)

        'med': Q2,  # 50th percentile

        'q3': Q3  # third quartile (75th percentile)

    }]


    fs = 10  # fontsize

    _, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

    axes.bxp(stats)

    axes.set_title('Default', fontsize=fs)

    plt.show()



boxplot(df['Score'], df['Count'])


第二:


import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt



data = {

    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],

    "Count": [20, 10, 5, 2, 5],

    "Score": [2, 4, 7, 8, 7]

}


df = pd.DataFrame(data)

print(df)


labels = ['Scores']


data = df['Score'].repeat(df['Count']).tolist()


# compute the boxplot stats

stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)


print(['stats :', stats])


fs = 10  # fontsize


fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)

axes.bxp(stats)

axes.set_title('Boxplot', fontsize=fs)


plt.show()


查看完整回答
反对 回复 2022-06-02
  • 2 回答
  • 0 关注
  • 189 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信