2018 世界杯 -> 看数据,猜全场最佳球员?机器学习可解释性 -> 特征重要性


2018 男足世界杯(128 场比赛)基本统计信息

完整数据分析报告:https://github.com/adi0229/ML-DL/blob/master/fifa2018.ipynb

数据特征包含:

1
2
3
4
5
6
7
Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
'Goals in PSO', 'Own goals', 'Own goal Time'],
dtype='object')

随机森林分类器(Baseline)及特征重要性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv(path + 'FIFA_2018_Statistics.csv')
y = (data['Man of the Match'] == "Yes")
# 特征工程 -> 选取numerical类数值作为训练特征
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
rf = RandomForestClassifier(random_state=0).fit(train_X, train_y)
from sklearn.metrics import accuracy_score
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
1
accuracy_score: 0.59375
1
2
3
4
5
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rf, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

随机森林分类器(微调)及特征重要性变化

1
2
3
4
rf = RandomForestClassifier(random_state=0,n_estimators=500).fit(train_X, train_y)
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
1
accuracy_score: 0.71875

分析:「随机森林」准确率(60% - 72%)提升之后

  • 扑救、传球准确率、射门命中率的重要性上升

  • 角球、全场跑动距离的重要性下降

    符合足球战术常识

Xgboost 分类器(微调)及特征重要性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from xgboost import XGBRFClassifier
xgb = XGBRFClassifier(silent=False,
scale_pos_weight=1,
learning_rate=0.01,
colsample_bytree = 0.4,
subsample = 0.8,
n_estimators=1000,
reg_alpha = 0.3,
max_depth=4,
gamma=10).fit(train_X, train_y)
predictions = xgb.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))
1
accuracy_score: 0.71875

Xgboost发现进球是唯一重要特征。
简单粗暴,也更符合足球常理。进球多,更容易获胜,获胜一方容易出 MVP 球员。其他数据的关系并不大。

1
2
perm_xgb = PermutationImportance(xgb, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm_xgb, feature_names = val_X.columns.tolist())

特征重要性(Permutation Importance)

Ref: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html?highlight=PermutationImportance