目的：

选取样本类别不平衡情形下样本数偏少的类别；
选取难样本。

主动学习解决的是在有限标注样本条件下, 从大量未标注样本中挖掘有价值的样本提升模型性能. 其方法可以分为 membership query synthesis, stream-based selective sampling 和 pool-based sampling 三类方法。

对于一批未标注样本, 根据样本的选取策略可以划分为三种方法:

Uncertainty-based method: 定义和度量样本的不确定性, 并选取不确定的样本;
Diversity-based method: 从未标注的数据池中选取多样性样本以代表整个样本池的分布;
Expected model change: 选取能够给当前模型参数带来最大改变的样本.
Query By Committee: train several different models on the labeled set and look at their disagreement on examples in the unlabeled set.

Question

1) 在神经网络中，为何不直接是用 softmax 分数作为不确定性的概率？
神经网络的训练过程使训练样本的 softmax 分数越接近 one-hot, 输出结果是过度自信的。所以若无正则，输出结果的分布是不可靠的。

Conlusion

Uncertainty sampling is a strong algorithm even in large batch sizes.

Papers

Learning Loss for Active Learning

Yoo, D. and Kweon, I.S., 2019. Learning Loss for Active Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 93-102).

图1 模块结构

如图1 所示, 作者的方法包含了一个目标模型 $\Theta_{target}$ 和一个损失预测模块 $\Theta_{loss}$. $\Theta_{target}$ 的输出为 $\hat{y}=\Theta_{target}(x)$, $\Theta_{loss}$ 的输出为 $\hat{l}=\Theta_{loss}(h)$. 其中的 $h$ 是 $\Theta_{target}$ 中一组隐藏层输出特征.

图2 Active Learning 的一个迭代周期

Active Learning 的主要步骤如下:

真实场景中, $\mathcal{U}_{N}$ 表示未标注数据池. 从中随机采样 $K$ 个数据样本人工标注构建初始有标签数据集 $\mathcal{L}_{K}^{0}$, 从而未标注池的样本减少为 $\mathcal{U}_{N-K}^0$.
获得初始标注数据 $\mathcal{L}_{K}^{0}$ 后, 则可以联合学习初始版本模型 $\Theta_{target}^0$ 和 $\Theta_{loss}^0$.
完成训练后, 通过数据损失对 $\{(x, \hat{l}) | x \in \mathcal{U}_{K}^{0}\}$ 评估未标注池中所有的数据. 然后人工标注 Top-k 损失的数据样本.

按照如上步骤不断循环迭代.

Deep Bayesian Active Learning with Image Data

深度学习中使用 Active Learning (AL) 的难点：

通常 AL 应用于小数据集；
AL 查询函数依赖模型的不确定性，而深度学习模型难以表现这种不确定性。

Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

1
2
3


## Minimum-Margin Active Learning
``` TODO

Active Learning used in Imbalance Data

TODO

Active Learning for Convolutional Neural Networks: A Core-Set Approach

ICLR2018

A Robust Zero-Sum Game Framework for Pool-based Active Learning

Zhu, D., Li, Z., Wang, X., Gong, B., & Yang, T. (2019, April). A Robust Zero-Sum Game Framework for Pool-based Active Learning. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 517-526).

文章亮点

易于优化，线性甚至是对数时间复杂度
综合了标注和未标注图像的信息，避免了采样偏置
对于不平衡的数据分布稳健
基于最新的机器学习理论

Active Learning for Convolutional Neural Networks: A Core-Set Approach (2018-ICLR)

code
思路：最小化采样点与非采样点在模型的特征空间中的欧氏距离。
优点：对于深度神经网络和大数据集有较好表现；
缺点：(1) 分类类别数较多时性能变差；(2) 对与高维数据无效；

Variational Adversarial Active Learning (2019-ICCV)

code

Computer Vision, Deep Learning

Active Learning