2018 世界杯 -> 看数据，猜全场最佳球员？机器学习可解释性 -> 特征重要性

2018 男足世界杯（128 场比赛）基本统计信息

完整数据分析报告：https://github.com/adi0229/ML-DL/blob/master/fifa2018.ipynb

数据特征包含：

Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
       'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
       'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
       'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
       'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
       'Goals in PSO', 'Own goals', 'Own goal Time'],
      dtype='object')

随机森林分类器（Baseline）及特征重要性

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv(path + 'FIFA_2018_Statistics.csv')
y = (data['Man of the Match'] == "Yes")  
# 特征工程 -> 选取numerical类数值作为训练特征
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
rf = RandomForestClassifier(random_state=0).fit(train_X, train_y)
from sklearn.metrics import accuracy_score
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))

1	accuracy_score: 0.59375

import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(rf, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

随机森林分类器（微调）及特征重要性变化

rf = RandomForestClassifier(random_state=0,n_estimators=500).fit(train_X, train_y)
predictions = rf.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))

1	accuracy_score: 0.71875

分析：「随机森林」准确率（60% - 72%）提升之后

扑救、传球准确率、射门命中率的重要性上升
角球、全场跑动距离的重要性下降

符合足球战术常识

Xgboost 分类器（微调）及特征重要性

from xgboost import XGBRFClassifier
xgb = XGBRFClassifier(silent=False, 
                      scale_pos_weight=1,
                      learning_rate=0.01,  
                      colsample_bytree = 0.4,
                      subsample = 0.8, 
                      n_estimators=1000, 
                      reg_alpha = 0.3,
                      max_depth=4, 
                      gamma=10).fit(train_X, train_y)
predictions = xgb.predict(val_X)
print("accuracy_score: " + str(accuracy_score(predictions, val_y)))

1	accuracy_score: 0.71875

Xgboost发现进球是唯一重要特征。
简单粗暴，也更符合足球常理。进球多，更容易获胜，获胜一方容易出 MVP 球员。其他数据的关系并不大。

1 2	perm_xgb = PermutationImportance(xgb, random_state=1).fit(val_X, val_y) eli5.show_weights(perm_xgb, feature_names = val_X.columns.tolist())

特征重要性（Permutation Importance）

Ref: https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html?highlight=PermutationImportance