Imbalance Learning

数据不平衡处理方法的整理.

Data-based Methods

sampling methods

欠采样:

过采样: SMOTE

Algorithm-based Methods

Cost Sensitive Learning

Theshold adjustment

Generally, predicted label is positive when the score $s > 0.5$. It’s reasonable when $num_pos:num_neg = 1: 1$. However, the score should be adjust with the change of ratio between the number of positive and negtive samples.

We define $m^+$ is the number of positive samples, and $m^-$ is the number of negtive samples. The decision rule should be satified with

In implementation, we choose the optimal threshold by ROC curve.

模型训练心得

对于 extremely class imbalance 的问题, 比较有效的处理方法包括重采样, meta-learning et.al..

重采样