蜂群图让每个数据点都发声

张开发
2026/5/4 14:06:28 15 分钟阅读
蜂群图让每个数据点都发声
1. 蜂群图核心特点蜂群图最巧妙的地方在于它的布局算法。当多个数据点具有相似数值时它们不会简单地重叠在一起而是像有“排斥力”一样在垂直方向或水平方向上轻微偏移形成一个类似蜂群的分布。比如下面是同一组数据在散点图和蜂群图中展示的效果。从中可以看出蜂群图的核心特点有绝不重叠 它通过算法检测数据点的重叠情况一旦发现两个点数值相近就会自动把它们向水平方向推开。保留分布形态 散开后的形状天然形成了一种类似“小提琴”或“山峰”的轮廓直观地展示了数据的密度。参数调整 我们可以调整点的大小marker size和排列的紧密程度。点越大视觉冲击力越强但需要的水平空间也越多。2. 蜂群图 vs. 条形图从摘要到细节条形图就像是一份数据摘要报告它告诉我们每个类别的平均值或总计值但隐藏了数据内部的分布细节。而蜂群图则像是一次数据点的全员大会每个数据点都有发言的机会。下面针对同一组数据我们分别绘制了条形图、箱线图和蜂群图一起来感受一下它们之间不同的展示效果。span stylecolor:#000000span stylebackground-color:#ffffffcode classlanguage-pythonspan stylecolor:#008000# 生成示例数据/span np.random.seed(span stylecolor:#880000123/span) categories [span stylecolor:#a31515产品A/span, span stylecolor:#a31515产品B/span, span stylecolor:#a31515产品C/span, span stylecolor:#a31515产品D/span] data_comparison [] span stylecolor:#0000fffor/span category span stylecolor:#0000ffin/span categories: n_points span stylecolor:#88000040/span span stylecolor:#0000ffif/span category span stylecolor:#a31515产品A/span: values np.random.normal(span stylecolor:#88000075/span, span stylecolor:#8800008/span, n_points) span stylecolor:#0000ffelif/span category span stylecolor:#a31515产品B/span: values np.random.normal(span stylecolor:#88000082/span, span stylecolor:#88000012/span, n_points) span stylecolor:#0000ffelif/span category span stylecolor:#a31515产品C/span: values np.random.normal(span stylecolor:#88000065/span, span stylecolor:#8800005/span, n_points) span stylecolor:#0000ffelse/span: span stylecolor:#008000# 产品D/span span stylecolor:#008000# 创建一个双峰分布/span values1 np.random.normal(span stylecolor:#88000055/span, span stylecolor:#8800006/span, n_points // span stylecolor:#8800002/span) values2 np.random.normal(span stylecolor:#88000085/span, span stylecolor:#8800007/span, n_points // span stylecolor:#8800002/span) values np.concatenate([values1, values2]) span stylecolor:#0000fffor/span value span stylecolor:#0000ffin/span values: data_comparison.append({span stylecolor:#a31515产品/span: category, span stylecolor:#a31515用户评分/span: value}) span stylecolor:#008000# 1. 条形图平均值/span means [] span stylecolor:#0000fffor/span category span stylecolor:#0000ffin/span categories: cat_data [d[span stylecolor:#a31515用户评分/span] span stylecolor:#0000fffor/span d span stylecolor:#0000ffin/span data_comparison span stylecolor:#0000ffif/span d[span stylecolor:#a31515产品/span] category] means.append(np.mean(cat_data)) bars axes[span stylecolor:#8800000/span].bar( categories, means, color[span stylecolor:#a31515#1f77b4/span, span stylecolor:#a31515#ff7f0e/span, span stylecolor:#a31515#2ca02c/span, span stylecolor:#a31515#d62728/span] ) span stylecolor:#008000# 在条形上标注平均值/span span stylecolor:#008000# 省略.../span span stylecolor:#008000# 2. 箱线图/span box_data [] span stylecolor:#0000fffor/span category span stylecolor:#0000ffin/span categories: cat_data [d[span stylecolor:#a31515用户评分/span] span stylecolor:#0000fffor/span d span stylecolor:#0000ffin/span data_comparison span stylecolor:#0000ffif/span d[span stylecolor:#a31515产品/span] category] box_data.append(cat_data) boxplot axes[span stylecolor:#8800001/span].boxplot( box_data, tick_labelscategories, patch_artistspan stylecolor:#a31515True/span, boxpropsspan stylecolor:#0000ffdict/span(facecolorspan stylecolor:#a31515lightblue/span) ) span stylecolor:#008000# 省略.../span span stylecolor:#008000# 3. 蜂群图/span data_df pd.DataFrame(data_comparison) sns.swarmplot( xspan stylecolor:#a31515产品/span, yspan stylecolor:#a31515用户评分/span, huespan stylecolor:#a31515产品/span, datadata_df, axaxes[span stylecolor:#8800002/span], sizespan stylecolor:#8800005/span, palettespan stylecolor:#a31515Set2/span, edgecolorspan stylecolor:#a31515black/span, linewidthspan stylecolor:#8800000.5/span, ) span stylecolor:#008000# 省略.../span plt.tight_layout() plt.show() /code/span/span绘制蜂群图可以用seaborn这个库中的swarmplot函数。从上面的对比可以看出条形图告诉我们产品D的平均分约为70分箱线图提示产品D的数据分布范围很广但只有蜂群图清晰地揭示了产品D实际上有两个明显的用户群体一个低评分群体和一个高评分群体3. 蜂群图 vs. 散点图从混乱到有序传统散点图在处理分类数据时常常导致数据点大量重叠形成黑团我们无法看清数据点的真实分布。蜂群图通过智能布局算法解决了这个问题。下面构造一个不同密度的数据看看蜂群图和散点图的展示效果。span stylecolor:#000000span stylebackground-color:#ffffffcode classlanguage-pythonspan stylecolor:#008000# 比较散点图与蜂群图的视觉效果/span fig, axes plt.subplots(span stylecolor:#8800001/span, span stylecolor:#8800002/span, figsize(span stylecolor:#88000014/span, span stylecolor:#8800006/span)) span stylecolor:#008000# 生成具有不同密度的数据/span np.random.seed(span stylecolor:#88000042/span) density_data [] categories [span stylecolor:#a31515低密度/span, span stylecolor:#a31515中等密度/span, span stylecolor:#a31515高密度/span] span stylecolor:#0000fffor/span i, category span stylecolor:#0000ffin/span span stylecolor:#0000ffenumerate/span(categories): n_points span stylecolor:#88000020/span i * span stylecolor:#88000030/span span stylecolor:#008000# 不同密度/span span stylecolor:#0000ffif/span category span stylecolor:#a31515低密度/span: values np.random.normal(span stylecolor:#88000050/span, span stylecolor:#88000015/span, n_points) span stylecolor:#0000ffelif/span category span stylecolor:#a31515中等密度/span: values np.random.normal(span stylecolor:#88000050/span, span stylecolor:#8800008/span, n_points) span stylecolor:#0000ffelse/span: span stylecolor:#008000# 高密度/span values np.random.normal(span stylecolor:#88000050/span, span stylecolor:#8800004/span, n_points) span stylecolor:#0000fffor/span value span stylecolor:#0000ffin/span values: density_data.append({span stylecolor:#a31515类别/span: category, span stylecolor:#a31515数值/span: value}) span stylecolor:#008000# 左侧传统散点图/span span stylecolor:#0000fffor/span i, category span stylecolor:#0000ffin/span span stylecolor:#0000ffenumerate/span(categories): cat_data [d[span stylecolor:#a31515数值/span] span stylecolor:#0000fffor/span d span stylecolor:#0000ffin/span density_data span stylecolor:#0000ffif/span d[span stylecolor:#a31515类别/span] category] x_positions np.full(span stylecolor:#0000fflen/span(cat_data), i) axes[span stylecolor:#8800000/span].scatter(x_positions, cat_data, alphaspan stylecolor:#8800000.6/span, sspan stylecolor:#88000060/span, labelcategory) span stylecolor:#008000#省略.../span span stylecolor:#008000# 右侧蜂群图/span density_data_df pd.DataFrame(density_data) sns.swarmplot( xspan stylecolor:#a31515类别/span, yspan stylecolor:#a31515数值/span, huespan stylecolor:#a31515类别/span, datadensity_data_df, axaxes[span stylecolor:#8800001/span], sizespan stylecolor:#8800006/span, palettespan stylecolor:#a31515coolwarm/span, edgecolorspan stylecolor:#a31515black/span, linewidthspan stylecolor:#8800000.5/span, ) span stylecolor:#008000#省略.../span plt.tight_layout() plt.show() /code/span/span蜂群图解决了“重叠Overplotting”的问题。在数据量适中几百到几千个点时它是展示分布密度的最佳选择。4. 蜂群图的适用场景蜂群图并不是为了取代条形图或散点图它有自己的适用场景和局限性。适合使用蜂群图的场景样本量适中通常少于几百个点时展示完整数据分布需要同时看到整体趋势和个体数据点数据有多个分类变量需要比较不同类别分布希望发现异常值或特殊模式如双峰分布蜂群图的局限性主要有大数据集可能导致图表过于拥挤对于非常大规模数据箱线图或小提琴图可能更合适精确的数值比较不如条形图直观5. 总结蜂群图就像数据可视化领域的显微镜它让我们既能观察到数据的整体分布形态又能看到每一个数据点的具体位置。与只能显示摘要信息的条形图和容易产生重叠的散点图相比蜂群图在显示中小型数据集的完整分布信息方面具有独特优势

更多文章