数据不平衡处理方法的整理.
Data-based Methods
sampling methods
欠采样:
过采样: SMOTE
Algorithm-based Methods
Cost Sensitive Learning
Theshold adjustment
Generally, predicted label is positive when the score $s > 0.5$. It’s reasonable when $num_pos:num_neg = 1: 1$. However, the score should be adjust with the change of ratio between the number of positive and negtive samples.
We define $m^+$ is the number of positive samples, and $m^-$ is the number of negtive samples. The decision rule should be satified with
In implementation, we choose the optimal threshold by ROC curve.
模型训练心得
对于 extremely class imbalance 的问题, 比较有效的处理方法包括重采样, meta-learning et.al..