├── 图像人脸OCR语音算法模型整理.docx └── README.md /图像人脸OCR语音算法模型整理.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taylorlu/MachineLearningDOC/HEAD/图像人脸OCR语音算法模型整理.docx -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 图像、人脸、OCR、语音相关算法整理 2 | ##### [概述-图像语音机器学习(Outline-Image & Audio & Machine Learning)](#0) 3 | ##### 1. [通用物体检测和识别(General Object Detection/Recognition)](#1) 4 | ##### 2. [特定物体检测和识别和检索(Specific Object Detection/CBIR)](#2) 5 | ##### 3. [物体跟踪(Object Tracking)](#3) 6 | ##### 4. [物体分割(Object Segmentation)](#4) 7 | ##### 5. [人脸检测(Face Detection)](#5) 8 | ##### 6. [人脸关键点对齐(Face Alignment)](#6) 9 | ##### 7. [人脸识别(Face Recognition)](#7) 10 | ##### 8. [人像重建(Face Reconstruct)](#8) 11 | ##### 9. [OCR字符识别(Wild Scene & Hand Written)](#9) 12 | ##### 10. [语音识别(Automatic Speech Recognition/Speech to Text)](#10) 13 | ##### 11. [说话人识别(Speaker Recognition/Identification/Verification)](#11) 14 | ##### 12. [说话人语音分割(Speaker Diarization)](#12) 15 | ##### 13. [语音合成(Text To Speech)](#13) 16 | ##### 14. [声纹转换(Voice Conversion)](#14) 17 | ##### 15. [人脸生物特征(Age Gender)](#15) 18 | 19 | **概述-图像语音机器学习(Outline-Image & Audio & Machine Learning)** 20 | + 图像: 21 | ``` 22 | 1. 变换(Transform),分为旋转、放缩、平移、仿射、投影 23 | ``` 24 | Rotation和Scale可以看做是一个SVD分解,对于二维图像,对应2x2矩阵。 25 | Translate为了支持矩阵相加,需要扩充一列,所以前三者结合变成一个2x3或3x3矩阵。 26 | Affine加上了翻转和斜切,保持点的共线性和直线的平行性,共有6个自由度dof。 27 | Projection变换不是线性的,共有8个自由度。 28 | 可参考[Transformations](https://courses.cs.washington.edu/courses/csep576/11sp/pdf/Transformations.pdf)。 29 | 通过对变换做处理,可用于变形OCR的纠正,比如[TPS算法](https://profs.etsmtl.ca/hlombaert/thinplates)。 30 | ``` 31 | 2. 卷积(convolution),分为一阶、二阶 32 | ``` 33 | 一阶算子有Roberts、Sobel、Prewitt,由于只求了一阶导数,所以一次只能检测一个方向的边缘。 34 | 二阶算子有Laplace、LoG、DoG,是角点检测的第一步,不抗噪。 35 | 卷积其实就是信号处理里面的求积再求和运算,在CNN中,卷积核是需要训练的参数,但由于大多数是共享的,参数量并不大,一般不需要Dropout。由于训练出的卷积核大多并不对称,所以并没有旋转不变性(rotation invariant),对于放缩和平移不变性也只能由pooling层起很小的作用。最初的方法是通过Data Argument,在NIPS2015上,[spatial transformer networks](https://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf)提出了一种自动学习变换矩阵的BP网络,对于数据增强的依赖大大降低。 36 | ``` 37 | 3. 大津阈值二值化,分水岭分割 38 | 离散傅里叶变换DFT,离散余弦变换DCT,小波变换Wavelet 39 | 图像的一阶二阶矩,形状描述 40 | 颜色空间(RGB, YUV, HSV) 41 | 以上用于视频编码和图像分析的多 42 | ``` 43 | ``` 44 | 4. 图像融合 45 | ``` 46 | 图像融合可用在深度学习后处理,比如分割后的物体融合到另一个背景,人像换脸等。常用的有[poisson Image Editing](https://www.cs.virginia.edu/~connelly/class/2014/comp_photo/proj2/poisson.pdf) 47 | 48 | + 语音: 49 | ``` 50 | 1. wav和mfcc 51 | ``` 52 | 由于语音是含有时域信息的,在进行实时频域转换的时候会采用加窗的短时STFT变换,根据不同的窗函数,会生成不同频段的频谱值。mfcc是基于梅尔频率的倒谱,是非线性的对数倒频谱。在进行ASR、SV时,一般都会先将wav文件转成mfcc进行处理,当然也不排除直接用wav的,比如wavenet, sincnet等。采用mfcc的好处是既含有时域信息也含有频域信息,由小窗函数将数据压缩成二维可采用普通CNN网络对其进行处理。 53 | ``` 54 | 2. 听歌识曲,哼唱识别 55 | ``` 56 | 曾经研究过的传统方法,基于mfcc和倒排索引。 57 | 1. A Highly Robust Audio Fingerprinting System 58 | 2. ROBUST AUDIO FINGERPRINT EXTRACTION ALGORITHM 59 | 3. An Industrial-Strength Audio Search Algorithm
60 | 深度学习的检索
61 | A Tutorial on Deep Learning for Music Information Retrieval 62 | 63 | 64 | + 统计学习: 65 | ``` 66 | 1. SVM支持向量机 67 | ``` 68 | 这个是老外写的一本《支持向量机导论》,网上中文英文都有。
69 | an introduction to support vector machines and other kernel-based learning methods
70 | 包含从核函数到VC维最大泛化间隔,到KKT不等式约束的拉格朗日对偶问题,再到SMO算法求解拉格朗日乘子,算是很完整的一个教材了。 71 | 72 | ``` 73 | 2. Adaboost 74 | ``` 75 | 从弱学习机到强学习机,是一种迭代算法,只要分类器比随机分类器好一点,它就能逐渐迭代出一个强分类器。优点是不容易过拟合,缺点对噪声敏感。
76 | 1. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting 77 | 2. Multi-class AdaBoost 78 | 79 | ``` 80 | 3. Decision tree决策树 81 | ``` 82 | 主要用在数据挖掘,最优树的生成常用有ID3/4/5,CART等算法,缺点是不稳定,特别是样本数量不一致的情况。 83 | 84 | ``` 85 | 4. 贝叶斯网络、随机森林 86 | ``` 87 | 88 | ``` 89 | 5. EM/GMM模型 90 | ``` 91 | 含有隐变量的聚类模型。隐变量/隐分布就是每个概率分布的权重以及每个样本属于每个分布的概率。
92 | EM算法分为2步,E-Step是固定已知变量利用Jensen不等式求对数似然函数的极值,更新隐变量,M-Step是在固定隐变量求整个似然函数的极值,更新已知变量 93 | GMM模型是先假定分布是高斯分布,已知变量即均值和方差,隐变量即高斯分布的权重。
94 | EM算法对初始值敏感,无法保证全局最优。用途很多,比如聚类、声纹模型UBM。
95 | 神经网络求解EM算法:
96 | 1. Neural Expectation Maximization
97 | https://github.com/sjoerdvansteenkiste/Neural-EM 98 | 99 | ``` 100 | 6. 无监督聚类Kmeans、Meanshift,基于图模型的Spectral Clustering 101 | ``` 102 | 103 | ``` 104 | 7. 不用指定聚类个数的模型DBSCAN、Chinese Whisper 105 | ``` 106 | + 深度学习: 107 | 深度学习即完全基于神经网络的模型,包括CNN空域、RNN时域等模型,重点在于网络设计、损失函数设计,以及优化器这3方面。
108 | **网络设计**代表性的有CNN、空洞卷积、通道可分离卷积、DropOut、RNN/LSTM/GRU、Attention/Self-Attention/Transformer、Resnet、Inception系列、Squeezenet/Mobilenet/Shufflenet等
109 | **损失函数**代表性的有Triplet loss、Center loss、SphereFace、ArcFace、AMSoftmax等
110 | **优化器**主要有SGD、Moment、Adagrad、Adadelta、Adam、RMSprop、Adabound、Admm等,还有其他加快收敛防止过拟合的方法如Batchnorm,正则化等。 111 | 112 | 113 | 1. **通用物体检测和识别(General Object Detection/Recognition)** 114 | + 传统方法: 115 | ``` 116 | 1. 基于Bag Of Words词袋模型的,SIFT/SURF+KMeans+SVM 117 | 2. 基于Sparse Coding稀疏编码的,LLC 118 | 3. 基于聚合特征的,Fisher Vector/VLAD 119 | 4. 基于变形部件组合模型的,DPM用到HOG/Latent SVM 120 | 5. 有关角点的检测和描述,近几年有基于深度学习的方法,如LIFT、DELP、LFNET,缺点是速度慢 121 | ``` 122 | - 相关论文: 123 | ``` 124 | 1. Visual Object Recognition, Kristen Grauman 125 | 2. Locality-constrained Linear Coding for Image Classification 126 | 3. Fisher Kernels on Visual Vocabularies for Image Categorization 127 | 4. Improving the Fisher Kernel for Large-Scale Image Classification 128 | 5. Aggregating local descriptors into a compact image representation 129 | 6. Object Detection with Discriminatively Trained Part Based Models 130 | 7. LIFT: Learned Invariant Feature Transform 131 | 8. Large-Scale Image Retrieval with Attentive Deep Local Feature 132 | 9. LF-Net: Learning Local Features from Images 133 | ``` 134 | - 相关开源地址: 135 | * http://www.vlfeat.org 136 | * https://github.com/rbgirshick/voc-dpm 137 | * https://github.com/cbod/cs766-llc 138 | * https://github.com/nashory/DeLF-pytorch 139 | * https://github.com/vcg-uvic/lf-net-release 140 |
141 | 142 | + 深度学习: 143 | ``` 144 | RCNN/SPPNet/Faster RCNN,Yolo系列,SSD,R-FCN,RetinaNet,CFENet 145 | ``` 146 | - 相关论文: 147 | ``` 148 | 1. Rich feature hierarchies for accurate object detection and semantic segmentation 149 | 2. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition 150 | 3. Fast R-CNN 151 | 4. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 152 | 5. You Only Look Once: Unified, Real-Time Object Detection 153 | 6. YOLO9000: Better, Faster, Stronger 154 | 7. YOLOv3: An Incremental Improvemen 155 | 8. SSD: Single Shot MultiBox Detector 156 | 9. R-FCN: Object Detection via Region-based Fully Convolutional Networks 157 | 10. Focal Loss for Dense Object Detection 158 | 11. CFENet: An Accurate and Efficient Single-Shot Object Detector for Autonomous Driving 159 | ``` 160 | - 相关开源地址: 161 | * https://github.com/rbgirshick/rcnn 162 | * https://github.com/rbgirshick/fast-rcnn 163 | * https://github.com/rbgirshick/py-faster-rcnn 164 | * https://github.com/balancap/SSD-Tensorflow 165 | * https://github.com/chuanqi305/MobileNet-SSD 166 | * https://github.com/gliese581gg/YOLO_tensorflow 167 | * https://github.com/choasup/caffe-yolo9000 168 | * https://github.com/qqwweee/keras-yolo3 169 | * https://github.com/daijifeng001/R-FCN 170 | * https://github.com/YuwenXiong/py-R-FCN 171 | * https://github.com/daijifeng001/caffe-rfcn 172 | * https://github.com/facebookresearch/Detectron 173 | 174 | 175 | 2. **特定物体检测和识别和检索(Specific Object Detection/CBIR)** 176 | - 特定物体只识别一张特定的图,不能进行大样本训练,也即不需要进行训练和学习。大多数只是用Artificial Feature手工特征,比如特征点,而且对于刚性物体,特征点匹配可以用SVD分解和RANSAC计算出仿射变换矩阵,进而判断物体边缘的方向。也有基于神经网络的,如R-MAC,NetVlad,但用的都是backpone预训练模型。 177 | - 特征点匹配,基于欧氏距离的,如SIFT/SURF,基于海明距离的,如AKAZE/FREAK,欧氏距离的检索可以用KD-Tree或者其他算法如hnsw、Falconn,海明距离的检索用LSH。 178 | - 基于Fisher Vector/VLAD,采用随机超平面的方式切换成海明距离进行检索 179 | - 检索,基于欧式距离的检索有hnsw、Falconn、Faiss等开源库。 180 | + 相关论文: 181 | ``` 182 | 1. Aggregating Deep Convolutional Features for Image Retrieval 183 | 2. PARTICULAR OBJECT RETRIEVAL WITH INTEGRAL MAX-POOLING OF CNN ACTIVATIONS 184 | 3. Deep Learning of Binary Hash Codes for Fast Image Retrieval 185 | 4. Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks 186 | 5. Bags of Local Convolutional Features for Scalable Instance Search 187 | 6. Deep Image Retrieval: Learning global representations for image search 188 | 7. Region-Based Image Retrieval Revisited 189 | ``` 190 | + 相关开源地址: 191 | * https://github.com/Relja/netvlad 192 | * https://github.com/uzh-rpg/netvlad_tf_open 193 | * https://github.com/nmslib/hnswlib 194 | * https://github.com/facebookresearch/faiss 195 | * https://github.com/FALCONN-LIB/FALCONN 196 | * https://github.com/imatge-upc/retrieval-2016-icmr 197 | 198 | 199 | 3. **物体跟踪(Object Tracking)** 200 | - 光流法 201 | - 卡尔曼滤波器 202 | - 均值漂移 203 | 物体跟踪在OpenCV里面都有实现,大多都是针对刚性物体,对于人脸这种物体不适合。 204 | 深度学习的方法: 205 | - CFNet 206 | + 相关论文: 207 | ``` 208 | End-to-end representation learning for Correlation Filter based tracking 209 | ``` 210 | + 相关开源地址: 211 | * https://github.com/bertinetto/cfnet 212 | 213 | 214 | 4. **物体分割(Object Segmentation)** 215 | - 目前主流的都是基于神经网络的。 216 | - FCN、SegNet、PSPNet、MaskRCNN 、DeepLab系列、RefineNet、DeeperLab 217 | + 相关论文: 218 | ``` 219 | 1. Fully Convolutional Networks for Semantic Segmentation 220 | 2. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation 221 | 3. Pyramid Scene Parsing Network 222 | 4. Mask R-CNN 223 | 5. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs 224 | 6. Rethinking Atrous Convolution for Semantic Image Segmentation 225 | 7. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation 226 | 8. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation 227 | 9. DeeperLab: Single-Shot Image Parser 228 | 10. MobileNetV2: Inverted Residuals and Linear Bottlenecks 229 | ``` 230 | 231 | + 相关开源地址: 232 | * https://github.com/shekkizh/FCN.tensorflow 233 | * https://github.com/alexgkendall/caffe-segnet 234 | * https://github.com/hszhao/PSPNet 235 | * https://github.com/Vladkryvoruchko/PSPNet-Keras-tensorflow 236 | * https://github.com/matterport/Mask_RCNN 237 | * https://github.com/sthalles/deeplab_v3 238 | * https://github.com/DrSleep/tensorflow-deeplab-resnet 239 | * https://github.com/guosheng/refinenet 240 | * https://github.com/DrSleep/light-weight-refinenet 241 | 242 | 243 | 5. **人脸检测(Face Detection)** 244 | + 传统方法:特征提取+分类器的方式 245 | ``` 246 | 特征主要有HOG、HAAR等,分类器有Adaboost、SVM、Cascade等。 247 | 常用的开源库有:OpenCV、Dlib等。 248 | ``` 249 | + 深度学习: 250 | ``` 251 | MTCNN、PyramidBox、HR、Face R-CNN、SSH、RSA、S3FD、FaceBoxes 252 | ``` 253 | + 相关论文: 254 | ``` 255 | 1. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks 256 | 2. PyramidBox: A Context-assisted Single Shot Face Detector. 257 | 3. Finding Tiny Faces 258 | 4. Face R-CNN 259 | 5. SSH: Single Stage Headless Face Detector 260 | 6. Recurrent Scale Approximation for Object Detection in CNN 261 | 7. S 3FD: Single Shot Scale-invariant Face Detector 262 | 8. FaceBoxes: A CPU Real-time Face Detector with High Accuracy 263 | ``` 264 | + 相关开源地址: 265 | * https://github.com/kpzhang93/MTCNN_face_detection_alignment 266 | * https://github.com/EricZgw/PyramidBox 267 | * https://github.com/cydonia999/Tiny_Faces_in_Tensorflow 268 | * https://github.com/mahyarnajibi/SSH 269 | * https://github.com/sciencefans/RSA-for-object-detection 270 | * https://github.com/louis-she/sfd.pytorch 271 | * https://github.com/sfzhang15/FaceBoxes 272 | 273 | 274 | 6. **人脸关键点对齐(Face Alignment)** 275 | + 一些人脸检测算法中会集成有人脸关键点对齐,在训练时2个任务的误差函数加权相加。对齐有2D和3D的区别,2D只考虑二维信息,3D需要有3维模型,能预测人脸的姿态信息。 276 | + 2D关键点对齐:DCNN、MTCNN、TCDCN、LAB 277 | + 3D关键点对齐:3DDFA、DenseReg、FAN、PRNet、PIPA 278 | + 相关论文: 279 | ``` 280 | 1. Facial Landmark Detection by Deep Multi-task Learning 281 | 2. Deep Convolutional Network Cascade for Facial Point Detection 282 | 3. Look at Boundary: A Boundary-Aware Face Alignment Algorithm 283 | 4. Face Alignment Across Large Poses: A 3D Solution 284 | 5. Pose-Invariant Face Alignment via CNN-Based Dense 3D Model Fitting 285 | 6. Dense Face Alignment 286 | 7. DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild 287 | 8. How far are we from solving the 2D & 3D Face Alignment problem 288 | 9. Learning Dense Facial Correspondences in Unconstrained Images 289 | 10. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network 290 | 11. Dense Face Alignment 291 | ``` 292 | + 相关开源地址: 293 | * https://github.com/zhzhanp/TCDCN-face-alignment 294 | * https://github.com/wywu/LAB 295 | * https://github.com/cleardusk/3DDFA 296 | * https://github.com/ralpguler/DenseReg 297 | * https://github.com/YadiraF/PRNet 298 | * http://cvlab.cse.msu.edu/project-pifa.html 299 | 300 | 301 | 7. **人脸识别(Face Recognition)** 302 | + 非神经网络:GaussianFace高斯脸 303 | + 深度学习:大多数和损失函数设计有关 304 | + DeepFace、DeepID系列、VGGFace、FaceNet、CenterLoss、MarginalLoss、SphereFace、ArcFace、AMSoftmax 305 | + 相关论文: 306 | ``` 307 | 1. Surpassing Human-Level Face Verification Performance on LFW with GaussianFace 308 | 2. DeepFace: Closing the Gap to Human-Level Performance in Face Verification 309 | 3. Deep Learning Face Representation from Predicting 10,000 Classes 310 | 4. Deep Learning Face Representation by Joint Identification-Verification 311 | 5. DeepID3: Face Recognition with Very Deep Neural Networks 312 | 6. Deep Face Recognition 313 | 7. FaceNet: A Unified Embedding for Face Recognition and Clustering 314 | 8. A Discriminative Feature Learning Approach for Deep Face Recognition 315 | 9. Marginal Loss for Deep Face Recognition 316 | 10. SphereFace: Deep Hypersphere Embedding for Face Recognition 317 | 11. ArcFace: Additive Angular Margin Loss for Deep Face Recognition 318 | 12. Additive Margin Softmax for Face Verification 319 | ``` 320 | + 相关开源地址: 321 | * https://github.com/jangerritharms/GaussianFace 322 | * http://www.robots.ox.ac.uk/~vgg/software/vgg_face/ 323 | * https://github.com/davidsandberg/facenet 324 | * https://github.com/wy1iu/sphereface 325 | * https://github.com/xialuxi/arcface-caffe 326 | * https://github.com/deepinsight/insightface 327 | 328 | 329 | 8. **人像重建(Face Reconstruct)** 330 | + 基本上都是基于3D的,人像重建后可以进行姿态估计,以及换脸。有的换脸算法需要多张人脸训练GAN网络。 331 | + PRNet、VRN、Face2Face 332 | + 相关论文: 333 | ``` 334 | 1. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications 335 | 2. 3D Face Reconstruction with Geometry Details from a Single Image 336 | 3. Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network 337 | 4. CNN-based Real-time Dense Face Reconstruction with Inverse-rendered Photo-realistic Face Images 338 | 5. Large Pose 3D Face Reconstruction from a Single Image via Direct Volumetric CNN Regression 339 | 6. Deep Video Portraits 340 | 7. VDub: Modifying Face Video of Actors for Plausible Visual Alignment to a Dubbed Audio Track 341 | 8. paGAN: Real-time Avatars Using Dynamic Textures 342 | 9. On Face Segmentation, Face Swapping, and Face Perception 343 | 10. Extreme 3D Face Reconstruction: Looking Past Occlusions 344 | ``` 345 | + 相关开源地址: 346 | * https://github.com/YadiraF/PRNet 347 | * https://github.com/AaronJackson/vrn 348 | * https://github.com/deepfakes/faceswap 349 | * https://github.com/datitran/face2face-demo 350 | * https://github.com/YuvalNirkin/face_swap 351 | * https://github.com/anhttran/extreme_3d_faces 352 | 353 | 354 | 9. **OCR字符识别(Wild Scene & Hand Written)** 355 | + OCR涉及到字符场景定位和分割,以及字符识别。传统的方法是采用垂直方向直方图形式对字符进行分割,然后一个个字符分别送入分类器进行识别。由于CNN/RNN/CTC动态规划算法及Attention机制的出现,当今的主流模型是CNN+RNN+CTC,采用和语音识别类似的自动语素分割的方式。检测框一般是水平的,如果要纠正还需要用Hough变换把文本方向纠正。近几年又出现了很多支持不同形状的文本区域检测方法,一种是基于分割的,如PixelLink、TextSnake,一种是基于回归的,如TextBoxes、DMPNet、RSDD,还有结合2者的,如SSTD。还有检测和识别端到端的,如FOTS、EAA、Mask TextSpotter、STN-OCR。 356 | + 字符区域检测: 357 | CTPN、EAST、TextBoxes++、AdvancedEast、TextSnake、Mask TextSpotter、DMPNet、RSDD、LOMO、PSENet、Pixel-Anchor 358 | + 相关论文: 359 | ``` 360 | 1. Detecting Text in Natural Image with Connectionist Text Proposal Network 361 | 2. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes 362 | 3. Single Shot Scene Text Retrieval 363 | 4. EAST: An Efficient and Accurate Scene Text Detector 364 | 5. DeepTextSpotter: An End-to-End Trainable Scene Text Localization and Recognition Framework 365 | 6. Recursive Recurrent Nets with Attention Modeling for OCR in the Wild 366 | 7. Multi-Oriented Text Detection with Fully Convolutional Networks 367 | 8. Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network 368 | 9. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes 369 | 10. An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes 370 | 11. Rotation-Sensitive Regression for Oriented Scene Text Detection 371 | 12. Character Region Awareness for Text Detection 372 | 13. Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes 373 | 14. Shape Robust Text Detection with Progressive Scale Expansion Network 374 | 15. Pixel-Anchor: A Fast Oriented Scene Text Detector with Combined Networks 375 | 16. 总结Overview:https://github.com/whitelok/image-text-localization-recognition 376 | 17. 挑战赛:http://rrc.cvc.uab.es 377 | 18. An end-to-end textspotter with explicit alignment and attention 378 | 19. STN-OCR: A single Neural Network for Text Detection and Text Recognition 379 | ``` 380 | + 字符识别: 381 | 针对wild形变场景,检测到的框有粗糙的矩形,也有精确的多边形,在识别之前一般要进行纠正。关于纠正其实大体分为2个方向,一个是基于character划分的,如TextSnake、Char-Net,还有一种是通过TPS+STN网络自动去训练多点纠正的参数,这在很多Paper里面都有介绍。
382 | CRNN、GRCNN、CRAFT、ASTER、MORAN、ESIR、FAN,支持垂直方向文本识别的AON 383 | + 相关论文: 384 | ``` 385 | 1. Gated Recurrent Convolution Neural Network for OCR 386 | 2. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 387 | 3. What is wrong with scene text recognition model comparisons? dataset and model analysis 388 | 4. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification 389 | 5. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition 390 | 6. MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition 391 | 7. SEE: Towards Semi-Supervised End-to-End Scene Text Recognition 392 | 8. ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification 393 | 9. AON: Towards Arbitrarily-Oriented Text Recognition 394 | 10. Simultaneous Recognition of Horizontal and Vertical Text in Natural Images 395 | 11. Focusing Attention: Towards Accurate Text Recognition in Natural Images 396 | ``` 397 | + 相关开源地址: 398 | * https://github.com/eragonruan/text-detection-ctpn 399 | * https://github.com/MhLiao/TextBoxes_plusplus 400 | * https://github.com/lluisgomez/single-shot-str 401 | * https://github.com/huoyijie/AdvancedEAST 402 | * https://github.com/MichalBusta/DeepTextSpotter 403 | * https://github.com/Jianfeng1991/GRCNN-for-OCR 404 | * https://github.com/princewang1994/TextSnake.pytorch 405 | * https://github.com/clovaai/deep-text-recognition-benchmark 406 | * https://github.com/bgshih/aster 407 | * https://github.com/liuheng92/tensorflow_PSENet 408 | * https://github.com/whai362/PSENet 409 | * https://github.com/Canjie-Luo/MORAN_v2 410 | * https://github.com/Bartzi/see 411 | * https://github.com/huizhang0110/AON 412 | * https://github.com/Bartzi/stn-ocr 413 | 414 | + 手写字体识别: 415 | hand written由于各种书法风格,难度远高于印刷字体。NIPS上发表的几篇基于2维LSTM-RNN的方法,后面又有提速版的attention机制,这种方法支持一段手写文本的自动分行及对齐。后面ECCV又出现了一篇分多步的方法。 416 | + 相关论文: 417 | ``` 418 | 1. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks 419 | 2. Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention 420 | 3. Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition 421 | 4. Start, Follow, Read: End-to-End Full-Page Handwriting Recognition 422 | 5. Handwriting Recognition with Large Multidimensional Long Short-Term Memory Recurrent Neural Networks 423 | 6. Handwriting Recognition of Historical Documents with few labeled data 424 | 7. Measuring Human Perception to Improve Handwritten Document Transcription 425 | 8. Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition 426 | 9. Gated Convolutional Recurrent Neural Networks for Multilingual Handwriting Recognition 427 | 10. Joint Recognition of Handwritten Text and Named Entities with a Neural End-to-end Model 428 | ``` 429 | + 相关开源地址: 430 | * https://github.com/cwig/start_follow_read 431 | * https://github.com/0x454447415244/HandwritingRecognitionSystem 432 | * http://www.tbluche.com/scan_attend_read.html 433 | 434 | 435 | 10. **语音识别(Automatic Speech Recognition/Speech to Text)** 436 | + 传统方式基于GMM-HMM模型和Vertibi算法 437 | + 深度学习:对WAV进行MFCC短时频谱信号提取,依次采用CNN卷积网络和LSTM循环网络以及CTC Loss误差函数进行建模。 438 | GRU-CTC、DFCNN、DFSMN、DeepSpeech、CLDNN 439 | + 相关论文 440 | ``` 441 | 1. DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION 442 | 2. Deep Speech: Scaling up end-to-end speech recognition 443 | 3. CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS 444 | ``` 445 | + 相关开源地址: 446 | * https://github.com/buriburisuri/speech-to-text-wavenet 447 | * https://github.com/Kyubyong/tacotron 448 | * https://github.com/PaddlePaddle/DeepSpeech 449 | 450 | 451 | 11. **说话人识别(Speaker Recognition/Identification/Verification)** 452 | + 声纹识别的主要问题在于语音时长、文本无关、开集比对、背景噪声等问题上。目前基于d-vector、x-vector的深度学习模型和TE2E/GE2E等的损失函数设计在短时长上比较占优势。传统方法的state-of-the-art是i-vector,采用pLDA信道补偿算法,所有基于深度学习的模型都会引用ivector的ERR作为baseline进行比对。以前的方法有UBM-GMM和JFA信道补偿,但是需要大量的不同信道的语料样本。传统方法的相关开源框架有Kaldi、ALIZE、SIDEKIT、pyannote-audio等。深度学习的方法有d-vector、x-vector、j-vector(文本有关)以及结合E2E损失函数的模型。还有基于GhostVlad和直接基于wave信号的SINCNET。 453 | + 相关开源地址: 454 | * http://www-lium.univ-lemans.fr/sidekit/ 455 | * https://alize.univ-avignon.fr/ 456 | * http://www.kaldi-asr.org/ 457 | * https://github.com/rajathkmp/speaker-verification 458 | * https://github.com/wangleiai/dVectorSpeakerRecognition 459 | * https://github.com/Janghyun1230/Speaker_Verification 460 | * https://github.com/pyannote/pyannote-audio 461 | * https://github.com/WeidiXie/VGG-Speaker-Recognition 462 | * https://github.com/mravanelli/SincNet 463 | 464 | 465 | 12. **说话人语音分割(Speaker Diarization)** 466 | - 语音智能分割是基于说话人识别的,说话人识别效果的好坏决定语音分割的效果,当然还有切换点的识别效果也很重要。首先需要用VAD静音检测对语音进行分割,最简单的是用振幅来判断,如果有背景音则需要设计其他的VAD算法。切换点的判断可以通过BIC贝叶斯准则,最后就是聚类,判断哪些片段属于一个说话人,对于无监督学习算法,先验信息说话人数量显得尤为重要。目前基于深度学习的框架也有不少,比如最近Google出的UIS-RNN(其实是另类的聚类方法),还有法国LIUM团队的S4D。 467 | + 相关论文: 468 | ``` 469 | 1. FULLY SUPERVISED SPEAKER DIARIZATION 470 | 2. SPAKER DIARIZATION WITH LSTM 471 | 3. S4D: Speaker Diarization Toolkit in Python 472 | ``` 473 | + 相关开源地址: 474 | * https://github.com/google/uis-rnn 475 | * https://github.com/wq2012/SpectralCluster 476 | * https://projets-lium.univ-lemans.fr/s4d 477 | 478 | 479 | 13. **语音合成(Text To Speech)** 480 | - 文本转语音,传统方法是采用语素拼接,这种方式合成的语音比较生硬,没有语调。当前Baidu、Google、FaceBook等出了很多基于深度学习的方法。一般的流程是先Encoder再Decoder,最后用Griffin-Lim算法或者WaveNet自回归模型将MFCC变成wave信号。 481 | WaveNet系列(MFCC-->WAVE)、DeepVoice系列、Tacotron系列、VoiceLoop、ClariNet 482 | 483 | + 相关论文: 484 | ``` 485 | 1. VOICELOOP: VOICE FITTING AND SYNTHESIS VIA A PHONOLOGICAL LOOP 486 | 2. TACOTRON: TOWARDS END-TO-END SPEECH SYNTHESIS 487 | 3. NATURAL TTS SYNTHESIS BY CONDITIONING WAVENET ON MEL SPECTROGRAM PREDICTIONS 488 | 4. Deep Voice: Real-time Neural Text-to-Speech 489 | 5. Deep Voice 2: Multi-Speaker Neural Text-to-Speech 490 | 6. DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH 491 | 7. WAVENET: A GENERATIVE MODEL FOR RAW AUDIO 492 | 8. Parallel WaveNet: Fast High-Fidelity Speech Synthesis 493 | 9. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech 494 | 10. SAMPLE EFFICIENT ADAPTIVE TEXT-TO-SPEECH 495 | 11. FastSpeech: Fast, Robust and Controllable Text to Speech 496 | ``` 497 | + 相关开源地址: 498 | * https://github.com/ibab/tensorflow-wavenet 499 | * https://github.com/keithito/tacotron 500 | * https://github.com/Kyubyong/tacotron 501 | * https://github.com/c1niv/Voiceloop_TensorFlow 502 | * https://github.com/israelg99/deepvoice 503 | * https://github.com/andabi/parallel-wavenet-vocoder 504 | * https://github.com/xcmyz/FastSpeech 505 | 506 | 507 | 14. **声纹转换(Voice Conversion)** 508 | - 声纹转换其实就是TTS的多人版,根据说话人的不同将文本生成不同的wave信号。大多数都是在网络架构中加入说话人Embedding向量,如DeepVoice2/DeepVoice3,Tacotron2,有的甚至会在声码器Vocoder中加入,比如WaveNet。 509 | + 相关开源地址: 510 | * https://github.com/r9y9/deepvoice3_pytorch 511 | * https://github.com/Kyubyong/deepvoice3 512 | * https://github.com/Rayhane-mamah/Tacotron-2 513 | * https://github.com/GSByeon/multi-speaker-tacotron-tensorflow 514 | 515 | 516 | 14. **人脸生物特征(Age Gender Estimate)** 517 | - 经典的DEX模型,SSR-NET精简模型 518 | + 相关论文: 519 | ``` 520 | 1. DEX: Deep EXpectation of apparent age from a single image 521 | 2. Age Progression/Regression by Conditional Adversarial Autoencode 522 | 3. SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation 523 | 4. Deep Regression Forests for Age Estimation 524 | ``` 525 | + 相关开源地址: 526 | * https://github.com/truongnmt/multi-task-learning 527 | * https://github.com/ZZUTK/Face-Aging-CAAE 528 | * https://github.com/yu4u/age-gender-estimation 529 | * https://github.com/shamangary/SSR-Net 530 | * https://github.com/shenwei1231/caffe-DeepRegressionForests 531 | 532 | --------------------------------------------------------------------------------