一、赛题内容

数据分析达人赛:汽车产品聚类分析

实验环境:天池notebook

赛题背景

赛题以竞品分析为背景,通过数据的聚类,为汽车提供聚类分类。对于指定的车型,可以通过聚类分析找到其竞品车型。通过这道赛题,鼓励学习者利用车型数据,进行车型画像的分析,为产品的定位竞品分析提供数据决策。

竞品:指产品在同领域的竞争对手。如肯德基与麦当劳。

赛题数据

数据源:car_price.csv,数据包括了205款车的26个字段

1 Car_ID 每个观测值的唯一 ID(英格)
2 Symboling 其分配的保险风险评级,值为+3表示汽车有风险,-3表示它可能非常安全。(分类)
3 CarName 汽车公司名称(分类)
4 fueltype 汽车燃料类型,即汽油或柴油(分类)
5 aspiration 汽车中使用的吸气(分类)
6 doornumber 汽车车门数(分类)
7 carbody 车身(分类)
8 drivewheel 驱动轮类型(分类)
9 enginelocation 汽车发动机的位置(分类)
10 wheelbase 汽车底座(数字)
11 carlength 汽车长度(数字)
12 carwidth 汽车宽度(数字)
13 carheight 汽车高度(数字)
14 curbweight 没有乘员或行李的汽车的重量。(数字)
15 enginetype 引擎类型。(分类)
16 cylindernumber 放置在车内的气缸(分类)
17 enginesize 汽车尺寸(数字)
18 fuelsystem 汽车燃油系统(分类)
19 boreratio 汽车的孔径比(数字)
20 stroke 发动机内部冲程或体积(数字)
21 compressionratio 汽车压缩比(数字)
22 horsepower 马力(数字)
23 peakrpm 汽车峰值转速 (数字)
24 citympg 城市里程(数字)
25 highwaympg 高速公路里程(数字)
26 price(Dependent variable) 汽车价格(数字)

赛题任务

​ 选手需要对该汽车数据进行聚类分析,并找到vokswagen(大众)汽车的相应竞品。要求选手在天池实验室中用notebook完成以上任务,并分享到比赛论坛。
​ (聚类分析是常用的数据分析方法之一,不仅可以帮助我们对用户进行分组,还可以帮我们对产品进行分组(比如竞品分析) 这里的聚类个数选手可以根据数据集的特点自己指定,并说明聚类的依据)

对于指定的车型,可以通过聚类分析找到其竞品车型

二、数据处理

2.1 导入数据库和数据集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import copy
# 引入模型
from sklearn.cluster import KMeans
# 引入数据处理库
from sklearn.preprocessing import LabelEncoder, MinMaxScaler,StandardScaler, MaxAbsScaler,OneHotEncoder#标签转,最大最小转换 标准化
from sklearn.model_selection import train_test_split, cross_val_score

# 引入评价模型
from sklearn.metrics import accuracy_score, make_scorer, silhouette_score, accuracy_score,calinski_harabaz_score
# 数据加载
data = pd.read_csv('./car_price.csv')

2.2 数据观察

1
2
3
4
5
6
# 描述性统计
print(data.describe(include="all"))

# 观察汽车品牌分布
car_brand = data["CarName"].str.split(expand=True)[0]# 以空格 切割字符串 保留前面的公司名
car_brand.value_counts().sort_index() # 数量统计,并根据公司字母排序

2.3 异常值处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 异常值处理,(错别字或同义字替换)
# 观察发现 大众汽车 应该是 vw,volkswagen,vokswagen
# 保时捷 porcshce porsche
# toyota toyouta
# maxda mazda
wrong_brand = {"Nissan": "nissan",
"maxda": "mazda",
"porcshce": "porsche",
"toyouta": "toyota",
"vokswagen": "volkswagen",
"vw": "volkswagen"}

# 修改对应的 公司
for index, brand in car_brand.items():
if brand in wrong_brand.keys():
print(index, data.loc[index, "CarName"], end=" | ")
data.loc[index, "CarName"] = data.loc[index, "CarName"].replace(brand, wrong_brand[brand])
print(data.loc[index, "CarName"])

# 修改后的情况
car_brand_new = data["CarName"].str.split(expand=True)[0]# 以空格 切割字符串 保留前面的公司名
car_brand_new.value_counts()

数据均为不重复数据,即没有完全相同的两条数据。

观察数据得 我们发现部分车名相同的车,配置不同,价格不同,我们根据汽车市场规则,将同款汽车分为 ,1,2,3,4.。离散数值的配置。1为最高配置(价格最高).

部分数据,英文数字表示one two three ,有具体意义的非数值数据转换.

重点集中在doornumber 和 cylindernumber.

原文背景所说:对于指定的车型,可以通过聚类分析找到其竞品车型。所以可以按不同车型的类别进行聚类,可以先观察carbody 这一数据的分布。

1
data['carbody'].value_counts()

2.4 相关性分析

1
2
3
4
5
6
7
8
9
# 计算各变量间的相关系数
corr = data.corr()

# 设置画板风格
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(20, 20))
ax = sns.heatmap(corr, square=True, annot=True)
ax.tick_params(labelsize=15)
fig.savefig("./Figures/heatmap.png", dpi=400, bbox_inches='tight')

2.5 数据处理

1
2
3
4
5
6
7
8
9
10
11
12
13
# 以下数据为data_new 用来聚类分析。
data_new = data.copy(deep = True)

# 聚类分析中,无用数据为ID和名字,先删除掉id和名字。
del data_new['car_ID']
del data_new['CarName']

# 经总的 数据 相关性分析compressionratio与fueltype的相关性-0.99,highwaympg与citympg的相关性0.97,两组相关性都很强,去除两组中的各一个
data_new = data_new.drop(columns=["fueltype", "citympg"])

# 观察volkswagen 总共有多少车型,其他车型就不聚类分析
volkswagen_brand = data[data['CarName'].str.split(expand=True)[0].isin(['volkswagen'])]
volkswagen_brand['carbody'].value_counts()

1
2
3
4
5
6
# 观察发现,没有hardtop的车型,不考虑它的聚类。
# 将4大车型,进行划分数据集。
str_carbody = 'sedan'

data_final = data_new[data_new['carbody'].isin([str_carbody])].copy(deep = True)
data_final_haveID = data[data['carbody'].isin([str_carbody])].copy(deep = True)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 非数值标签转为 数值标签
le = LabelEncoder()
object_columns = data_final.select_dtypes(include=["object"]).columns
data_final[object_columns] = data_final[object_columns].apply(le.fit_transform)

# min max 归一化处理

mms = MinMaxScaler()
data_final = mms.fit_transform(data_final)
train_data = data_final
pd.DataFrame(data_final).describe()

# 标准化处理
#K_scaler = StandardScaler()
#data_new = K_scaler.fit_transform(data_new)
#train_data = data_new
#pd.DataFrame(data_new).describe()

3 确定聚类K值。

3.1 肘部分析法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 肘部分析法

SSE = []
ks = range(1, 31)
for k in ks:
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(train_data)
# 样本点到簇中心的平方距离和
SSE.append(kmeans.inertia_)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(ks, SSE, "o-")
ax.set_xlabel("k")
ax.set_ylabel("SSE")
ax.set_xticks(ks)
ax.set_title("SSE curve")
fig.savefig("KM_SSE_curve.png", dpi=400, bbox_inches='tight')

3.2 轮廓系数法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 轮廓系数法
SLE = [0]
ks = range(1, 31)
for k in ks[1:]:
kmeans = KMeans(n_clusters=k, random_state=1)
kmeans.fit(train_data)
# 轮廓系数
SLE.append(silhouette_score(train_data, kmeans.labels_, metric='euclidean'))

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(ks, SLE, 'o-')
ax.set_xlabel("k")
ax.set_ylabel("Silhouette score")
ax.set_xticks(ks)
ax.set_title("SLE curve")
fig.savefig("KM_SLE_curve.png", dpi=400, bbox_inches='tight')

3.3 CH系数

1
2
3
4
5
# 对不同K值的聚类效果进行评分
for i in range(2,13):
kmeans=KMeans(n_clusters=i).fit(train_data)
score=calinski_harabaz_score(train_data,kmeans.labels_)
print('聚类数 =%d 的calinski_harabaz分数为:%f'%(i,score))

四、K-means 聚类分析

聚类分析简述

具体要分多少类,除了肘部分析法和轮廓系数法,还可以根据具体的实际情况分类。

1
2
3
4
5
6
7
8
9
# K-means聚类法,及其结果
k = 10
kmeans = KMeans(n_clusters=k, init="k-means++", random_state=1)
car_label = kmeans.fit_predict(train_data)
print('silhouette_score = %.4f \n' %silhouette_score(train_data, car_label, metric="euclidean"))

data_final_haveID.loc[:, "label"] = car_label
print("使用%d类聚类频数:" %k)
print(data_final_haveID.loc[:, "label"].value_counts().sort_index())
1
2
3
4
5
6
7
8
9
10
11
12
13
vw_label = {}
print(str_carbody + '原始数据如下')
for index, row in data_final_haveID.iterrows():
if row["CarName"].startswith("volkswagen"):
cname = row["CarName"]
cbody = row["carbody"]
lbl = row["label"]
print(f"{index}\t{cname:<30}{cbody:<10}\t{lbl}")
# 记录 volkswagen 不同型号对应类别
if lbl not in vw_label:
vw_label[lbl] = [(cname, cbody)]
else:
vw_label[lbl].append((cname, cbody))
1
2
3
4
5
6
7
8
9
10
sedan原始数据如下
182 volkswagen rabbit sedan 6
183 volkswagen 1131 deluxe sedan sedan 6
184 volkswagen model 111 sedan 2
185 volkswagen type 3 sedan 5
186 volkswagen 411 (sw) sedan 5
187 volkswagen super beetle sedan 2
188 volkswagen dasher sedan 5
191 volkswagen rabbit sedan 5
192 volkswagen rabbit custom sedan
1
2
3
4
5
6
7
8
print("\n" + str_carbody+ "聚类最终结果如下:")
for key in vw_label.keys():
print(f"类型{key}:")
cluster_car = data_final_haveID.loc[car_label==key, ["CarName", "carbody"]] # 聚类的到同一簇的汽车
cmp_car = cluster_car[~cluster_car["CarName"].str.contains("volkswagen")] # 排除包含 volkswagen的部分
for car in vw_label[key]:
print(car)
print(cmp_car[cmp_car["carbody"]==car[1]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
sedan聚类最终结果如下:
类型6:
('volkswagen rabbit', 'sedan')
CarName carbody
5 audi fox sedan
10 bmw 320i sedan
12 bmw x1 sedan
42 honda civic (auto) sedan
44 isuzu D-Max sedan
89 nissan versa sedan
90 nissan gt-r sedan
91 nissan rogue sedan
94 nissan leaf sedan
163 toyota corolla liftback sedan
165 toyota celica gt liftback sedan
('volkswagen 1131 deluxe sedan', 'sedan')
CarName carbody
5 audi fox sedan
10 bmw 320i sedan
12 bmw x1 sedan
42 honda civic (auto) sedan
44 isuzu D-Max sedan
89 nissan versa sedan
90 nissan gt-r sedan
91 nissan rogue sedan
94 nissan leaf sedan
163 toyota corolla liftback sedan
165 toyota celica gt liftback sedan
类型2:
('volkswagen model 111', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
('volkswagen super beetle', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
('volkswagen rabbit custom', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
类型5:
('volkswagen type 3', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen 411 (sw)', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen dasher', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen rabbit', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan

其余4类车型同理。代码不再做粘贴。

五、最终结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
sedan聚类最终结果如下:
类型6
('volkswagen rabbit', 'sedan')
CarName carbody
5 audi fox sedan
10 bmw 320i sedan
12 bmw x1 sedan
42 honda civic (auto) sedan
44 isuzu D-Max sedan
89 nissan versa sedan
90 nissan gt-r sedan
91 nissan rogue sedan
94 nissan leaf sedan
163 toyota corolla liftback sedan
165 toyota celica gt liftback sedan
('volkswagen 1131 deluxe sedan', 'sedan')
CarName carbody
5 audi fox sedan
10 bmw 320i sedan
12 bmw x1 sedan
42 honda civic (auto) sedan
44 isuzu D-Max sedan
89 nissan versa sedan
90 nissan gt-r sedan
91 nissan rogue sedan
94 nissan leaf sedan
163 toyota corolla liftback sedan
165 toyota celica gt liftback sedan
类型2
('volkswagen model 111', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
('volkswagen super beetle', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
('volkswagen rabbit custom', 'sedan')
CarName carbody
63 mazda glc deluxe sedan
66 mazda rx-7 gs sedan
158 toyota corona sedan
174 toyota celica gt sedan
类型5
('volkswagen type 3', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen 411 (sw)', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen dasher', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan
('volkswagen rabbit', 'sedan')
CarName carbody
3 audi 100 ls sedan
4 audi 100ls sedan
6 audi 100ls sedan
41 honda civic sedan
88 mitsubishi mirage g4 sedan
101 nissan dayz sedan
103 nissan otti sedan
133 saab 99le sedan
135 saab 99gle sedan
143 subaru baja sedan
173 toyota corolla sedan
176 toyota corolla sedan

convertible聚类最终结果如下:
类型0
('volkswagen dasher', 'convertible')
CarName carbody
0 alfa-romero giulia convertible
1 alfa-romero stelvio convertible
172 toyota cressida convertible


wagon聚类最终结果如下:
类型2
('volkswagen dasher', 'wagon')
CarName carbody
93 nissan titan wagon
97 nissan note wagon
130 renault 12tl wagon
146 subaru trezia wagon
147 subaru tribeca wagon
148 subaru dl wagon
149 subaru dl wagon


hatchback聚类最终结果如下:
类型1
('volkswagen rabbit', 'hatchback')
CarName carbody
2 alfa-romero Quadrifoglio hatchback
46 isuzu D-Max hatchback
55 mazda 626 hatchback
56 mazda glc hatchback
57 mazda rx-7 gs hatchback
58 mazda glc 4 hatchback
81 mitsubishi g4 hatchback
104 nissan teana hatchback
106 nissan clipper hatchback
125 porsche macan hatchback
129 porsche cayenne hatchback
131 renault 5 gtl hatchback
132 saab 99e hatchback
134 saab 99le hatchback
166 toyota corolla tercel hatchback
169 toyota starlet hatchback
171 toyota corolla hatchback
178 toyota corolla liftback hatchback
179 toyota corona hatchback