Skip to content

Latest commit

 

History

History
357 lines (255 loc) · 16.2 KB

standard-machine-learning-datasets.md

File metadata and controls

357 lines (255 loc) · 16.2 KB

10 个实践应用机器学习的标准数据集

原文: https://machinelearningmastery.com/standard-machine-learning-datasets/

熟练应用机器学习的关键是在许多不同的数据集上练习。

这是因为每个问题都不同,需要略有不同的数据准备和建模方法。

在这篇文章中,您将发现可用于练习的 10 个顶级标准机器学习数据集。

让我们潜入。

  • 更新 Mar / 2018 :添加了备用链接以下载 Pima Indians 和 Boston Housing 数据集,因为原件似乎已被删除。
  • 2002 年 2 月更新:对保险数据集的预期默认 RMSE 进行小幅更新。

概观

结构化方法

每个数据集都以一致的方式汇总。这使得它们易于比较和导航,以便您练习特定的数据准备技术或建模方法。

您需要了解的有关每个数据集的方面是:

  1. 名称:如何引用数据集。
  2. 问题类型:问题是回归还是分类。
  3. 输入和输出:输入和输出功能的编号和已知名称。
  4. 表现:使用零规则算法进行比较的基线表现,以及最佳已知表现(如果已知)。
  5. 示例:前 5 行原始数据的快照。
  6. 链接:您可以在哪里下载数据集并了解更多信息。

标准数据集

以下是我们将介绍的 10 个数据集的列表。

每个数据集都足够小,可以放入内存并在电子表格中查看。所有数据集都包含表格数据和没有(明确)缺失值。

  1. 瑞典汽车保险数据集。
  2. 葡萄酒质量数据集。
  3. 皮马印第安人糖尿病数据集。
  4. 声纳数据集。
  5. 钞票数据集。
  6. 鸢尾花数据集。
  7. 鲍鱼数据集。
  8. 电离层数据集。
  9. 小麦种子数据集。
  10. 波士顿房价格数据集。

1.瑞典汽车保险数据集

根据索赔总数,瑞典汽车保险数据集涉及预测数千瑞典克朗的所有索赔的总付款额。

这是一个回归问题。它由 63 个观察值组成,包含 1 个输入变量和 1 个输出变量。变量名称如下:

  1. 索赔数量。
  2. 数千瑞典克朗的所有索赔的总付款额。

预测平均值的基线表现是大约 81,000 克朗的 RMSE。

下面列出了前 5 行的样本。

108,392.5
19,46.2
13,15.7
124,422.2
40,119.4

下面是整个数据集的散点图。

Swedish Auto Insurance Dataset

瑞典汽车保险数据集

2.葡萄酒质量数据集

葡萄酒质量数据集涉及根据每种葡萄酒的化学测量标准预测白葡萄酒的质量。

这是一个多类别的分类问题,但也可能被定为回归问题。每个班级的观察数量不均衡。有 4,898 个观测值,包含 11 个输入变量和一个输出变量。变量名称如下:

  1. 固定酸度。
  2. 挥发性酸度。
  3. 柠檬酸。
  4. 剩余的糖。
  5. 氯化物。
  6. 游离二氧化硫。
  7. 二氧化硫总量。
  8. 密度。
  9. pH 值。
  10. 硫酸盐。
  11. 醇。
  12. 质量(得分在 0 到 10 之间)。

预测平均值的基线表现是约 0.148 质量点的 RMSE。

下面列出了前 5 行的样本。

7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6

3.皮马印第安人糖尿病数据集

皮马印第安人糖尿病数据集涉及在 Pima 印第安人中根据医疗细节预测 5 年内糖尿病的发病。

这是一个二元(2 级)分类问题。每个班级的观察数量不均衡。有 768 个观测值,有 8 个输入变量和 1 个输出变量。缺失值被认为是用零值编码的。变量名称如下:

  1. 怀孕次数。
  2. 口服葡萄糖耐量试验中血浆葡萄糖浓度为 2 小时。
  3. 舒张压(mm Hg)。
  4. 肱三头肌皮褶厚度(mm)。
  5. 2 小时血清胰岛素(μU/ ml)。
  6. 体重指数(体重 kg /(身高 m)^ 2)。
  7. 糖尿病谱系功能。
  8. 年龄(岁)。
  9. 类变量(0 或 1)。

预测最普遍类别的基线表现是大约 65%的分类准确度。最佳结果实现了大约 77%的分类准确度。

下面列出了前 5 行的样本。

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1

4.声纳数据集

声纳数据集涉及在不同角度给出声纳返回强度的情况下预测物体是矿井还是岩石。

这是一个二元(2 级)分类问题。每个班级的观察数量不均衡。有 208 个观测值,包含 60 个输入变量和 1 个输出变量。变量名称如下:

  1. 声纳以不同的角度返回
  2. ...
  3. 等级(M 代表我的,R 代表摇滚)

预测最普遍类别的基线表现是大约 53%的分类准确度。最佳结果实现了大约 88%的分类准确度。

下面列出了前 5 行的样本。

0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1.0000,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.3250,0.3200,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.1840,0.1970,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.0530,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,0.6333,0.7060,0.5544,0.5320,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.4293,0.3648,0.5331,0.2413,0.5070,0.8533,0.6036,0.8514,0.8512,0.5045,0.1862,0.2709,0.4232,0.3043,0.6116,0.6756,0.5375,0.4719,0.4647,0.2587,0.2129,0.2222,0.2111,0.0176,0.1348,0.0744,0.0130,0.0106,0.0033,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R
0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,0.0881,0.1992,0.0184,0.2261,0.1729,0.2131,0.0693,0.2281,0.4060,0.3973,0.2741,0.3690,0.5556,0.4846,0.3140,0.5334,0.5256,0.2520,0.2090,0.3559,0.6260,0.7340,0.6120,0.3497,0.3953,0.3012,0.5408,0.8814,0.9857,0.9167,0.6121,0.5006,0.3210,0.3202,0.4295,0.3654,0.2655,0.1576,0.0681,0.0294,0.0241,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R
0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,0.4152,0.3952,0.4256,0.4135,0.4528,0.5326,0.7306,0.6193,0.2032,0.4636,0.4148,0.4292,0.5730,0.5399,0.3161,0.2285,0.6995,1.0000,0.7262,0.4724,0.5103,0.5459,0.2881,0.0981,0.1951,0.4181,0.4604,0.3217,0.2828,0.2430,0.1979,0.2444,0.1847,0.0841,0.0692,0.0528,0.0357,0.0085,0.0230,0.0046,0.0156,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R

5.钞票数据集

钞票数据集涉及根据从照片中采取的一些措施来预测给定钞票是否真实。

这是一个二元(2 级)分类问题。每个班级的观察数量不均衡。有 1,372 个观测值,包含 4 个输入变量和 1 个输出变量。变量名称如下:

  1. 小波变换图像的方差(连续)。
  2. 小波的偏斜变换图像(连续)。
  3. 小波变换图像的峰度(连续)。
  4. 图像的熵(连续)。
  5. 类(0 表示真实,1 表示不真实)。

预测最普遍类别的基线表现是大约 50%的分类准确度。

下面列出了前 5 行的样本。

3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
3.866,-2.6383,1.9242,0.10645,0
3.4566,9.5228,-4.0112,-3.5944,0
0.32924,-4.4552,4.5718,-0.9888,0
4.3684,9.6718,-3.9606,-3.1625,0

6.鸢尾花数据集

鸢尾花数据集涉及在测量虹膜花的情况下预测花种。

这是一个多类别的分类问题。每个班级的观察数量是平衡的。有 150 个观测值,包含 4 个输入变量和 1 个输出变量。变量名称如下:

  1. 萼片长度(cm)。
  2. 萼片宽度(cm)。
  3. 花瓣长度以厘米为单位。
  4. 花瓣宽度以厘米为单位。
  5. 班级(Iris Setosa,Iris Versicolour,Iris Virginica)。

预测最普遍类别的基线表现是大约 26%的分类准确度。

下面列出了前 5 行的样本。

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa

7.鲍鱼数据集

鲍鱼数据集涉及根据个人的客观测量来预测鲍鱼的年龄。

这是一个多类别的分类问题,但也可以作为回归框架。每个班级的观察数量不均衡。有 4,177 个观测值,有 8 个输入变量和 1 个输出变量。变量名称如下:

  1. 性别(M,F,I)。
  2. 长度。
  3. 直径。
  4. 高度。
  5. 整体重量。
  6. 去掉了重量。
  7. 内脏重量。
  8. 壳重量。
  9. 戒指。

预测最普遍类别的基线表现是大约 16%的分类准确度。预测平均值的基线表现是大约 3.2 环的 RMSE。

下面列出了前 5 行的样本。

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7

8.电离层数据集

电离层数据集需要预测大气中的结构,因为雷达回波目标是电离层中的自由电子。

这是一个二元(2 级)分类问题。每个班级的观察数量不均衡。共有 351 个观测值,包含 34 个输入变量和 1 个输出变量。变量名称如下:

  1. 17 对雷达返回数据。
  2. ...
  3. 等级(g 代表好,b 代表坏)。

预测最普遍类别的基线表现是大约 64%的分类准确度。最佳结果实现了大约 94%的分类准确度。

下面列出了前 5 行的样本。

1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
1,0,1,-0.03365,1,0.00485,1,-0.12062,0.88965,0.01198,0.73082,0.05346,0.85443,0.00827,0.54591,0.00299,0.83775,-0.13644,0.75535,-0.08540,0.70887,-0.27502,0.43385,-0.12062,0.57528,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g
1,0,1,-0.45161,1,1,0.71216,-1,0,0,0,0,0,0,-1,0.14516,0.54094,-0.39330,-1,-0.54467,-0.69975,1,0,0,1,0.90695,0.51613,1,1,-0.20099,0.25682,1,-0.32382,1,b
1,0,1,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,0.52798,-0.20275,0.56409,-0.00712,0.34395,-0.27457,0.52940,-0.21780,0.45107,-0.17813,0.05982,-0.35575,0.02309,-0.52879,0.03286,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g

9.小麦种子数据集

小麦种子数据集涉及通过测量来自不同品种小麦的种子来预测物种。

这是一个二元(2 级)分类问题。每个班级的观察数量是平衡的。有 210 个观测值,包含 7 个输入变量和 1 个输出变量。变量名称如下:

  1. 区域。
  2. 周长。
  3. 紧凑
  4. 内核的长度。
  5. 内核宽度。
  6. 不对称系数。
  7. 核仁沟的长度。
  8. 等级(1,2,3)。

预测最普遍类别的基线表现是大约 28%的分类准确度。

下面列出了前 5 行的样本。

15.26,14.84,0.871,5.763,3.312,2.221,5.22,1
14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1
14.29,14.09,0.905,5.291,3.337,2.699,4.825,1
13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1
16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1

10.波士顿房价数据集

鉴于房屋及其附近的细节,波士顿房屋价格数据集涉及以数千美元预测房价。

这是一个回归问题。每个班级的观察数量是平衡的。共有 506 个观测值,包含 13 个输入变量和 1 个输出变量。变量名称如下:

  1. CRIM:城镇人均犯罪率。
  2. ZN:占地面积超过 25,000 平方英尺的住宅用地比例。
  3. INDUS:每个城镇的非复杂商业面积比例。
  4. CHAS:Charles River 虚拟变量(如果管道限制河流则= 1;否则为 0)。
  5. NOX:一氧化氮浓度(每千万份)。
  6. RM:每栋住宅的平均房间数。
  7. 年龄:1940 年以前建造的自住单位比例。
  8. DIS:到波士顿五个就业中心的加权距离。
  9. RAD:径向高速公路的可达性指数。
  10. 税:每 10,000 美元的全额物业税率。
  11. PTRATIO:城镇的师生比例。
  12. B:1000(Bk-0.63)^ 2 其中 Bk 是城镇黑人的比例。
  13. LSTAT:人口状况下降%。
  14. MEDV:自住房的中位数价值 1000 美元。

预测平均值的基准表现是大约 9.21 千美元的 RMSE。

下面列出了前 5 行的样本。

0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 21.60
0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80 392.83 4.03 34.70
0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.0622 3 222.0 18.70 394.63 2.94 33.40
0.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.0622 3 222.0 18.70 396.90 5.33 36.20

摘要

在这篇文章中,您发现了 10 个可用于练习应用机器学习的顶级标准数据集。

这是您的下一步:

  1. 选择一个数据集。
  2. 抓住你最喜欢的工具(如 Weka,scikit-learn 或 R)
  3. 看看你能打多少标准分数。
  4. 在下面的评论中报告您的结果。