深度学习完全指南（九）：计算机视觉应用

📅 创建时间 2024年2月9日

📁 分类技术

🏷️ 标签

从图像分类到目标检测、语义分割、人脸识别，全面掌握计算机视觉的核心技术与实战应用

计算机视觉概述

计算机视觉（Computer Vision, CV）是深度学习最成功的应用领域之一。从2012年AlexNet在ImageNet上的突破开始，深度学习彻底改变了CV的发展轨迹。

CV任务分类

任务类型	输入	输出	典型应用
图像分类	图像	类别标签	产品识别、医学诊断
目标检测	图像	边界框+类别	自动驾驶、安防监控
语义分割	图像	像素级标签	医学图像、自动驾驶
实例分割	图像	实例级掩码	机器人视觉
姿态估计	图像	关键点坐标	动作捕捉、健身AI
人脸识别	图像	身份ID	门禁、支付
OCR	图像	文字	文档数字化

图像分类进阶

经典架构演进

LeNet (1998) → AlexNet (2012) → VGG (2014) → GoogLeNet (2014)
     → ResNet (2015) → DenseNet (2017) → EfficientNet (2019)
          → ViT (2020) → Swin Transformer (2021) → ConvNeXt (2022)

现代分类网络：EfficientNet

EfficientNet通过复合缩放（Compound Scaling）平衡网络的深度、宽度和分辨率：

import torch
import torch.nn as nn
import torchvision.models as models
from torchvision import transforms

# 加载预训练EfficientNet
model = models.efficientnet_b0(weights='IMAGENET1K_V1')

# 修改分类头用于自定义任务
num_classes = 10
model.classifier[1] = nn.Linear(model.classifier[1].in_features, num_classes)

# 数据预处理
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

Vision Transformer (ViT)

ViT将Transformer引入图像分类，开创了CV的新范式：

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    """将图像分割为patches并嵌入"""
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.num_patches = (img_size // patch_size) ** 2
        self.proj = nn.Conv2d(in_channels, embed_dim, 
                             kernel_size=patch_size, stride=patch_size)
    
    def forward(self, x):
        # x: (B, C, H, W) -> (B, num_patches, embed_dim)
        x = self.proj(x)  # (B, embed_dim, H/P, W/P)
        x = x.flatten(2).transpose(1, 2)  # (B, num_patches, embed_dim)
        return x

class ViT(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, 
                 num_classes=1000, embed_dim=768, depth=12, num_heads=12):
        super().__init__()
        
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.num_patches
        
        # 可学习的cls token和位置编码
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        
        # Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=num_heads, 
            dim_feedforward=embed_dim * 4, batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=depth)
        
        # 分类头
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)
        
        nn.init.trunc_normal_(self.cls_token, std=0.02)
        nn.init.trunc_normal_(self.pos_embed, std=0.02)
    
    def forward(self, x):
        B = x.shape[0]
        
        # Patch嵌入
        x = self.patch_embed(x)  # (B, num_patches, embed_dim)
        
        # 添加cls token
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # (B, num_patches+1, embed_dim)
        
        # 添加位置编码
        x = x + self.pos_embed
        
        # Transformer
        x = self.transformer(x)
        
        # 分类
        x = self.norm(x[:, 0])  # 取cls token
        x = self.head(x)
        
        return x

# 创建模型
model = ViT(num_classes=100)
print(f"参数量: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M")

目标检测

目标检测需要同时完成定位（Localization）和分类（Classification）。

检测范式演进

范式	代表模型	特点
两阶段	R-CNN系列	先提取候选框，再分类
单阶段	YOLO, SSD	端到端，速度快
Anchor-free	FCOS, CenterNet	无需预设锚框
Transformer	DETR, DINO	端到端，无NMS

YOLO系列

YOLO（You Only Look Once）是最流行的实时检测框架：

from ultralytics import YOLO

# 加载预训练模型
model = YOLO('yolov8n.pt')  # nano版本，最快

# 推理
results = model('image.jpg')

# 显示结果
for result in results:
    boxes = result.boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].tolist()
        conf = box.conf[0].item()
        cls = int(box.cls[0].item())
        print(f"类别: {model.names[cls]}, 置信度: {conf:.2f}, 位置: ({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

# 训练自定义数据集
model.train(data='custom_dataset.yaml', epochs=100, imgsz=640)

# 导出模型
model.export(format='onnx')

DETR: 端到端目标检测

DETR使用Transformer实现端到端检测，无需NMS后处理：

import torch
import torch.nn as nn
from torchvision.models import resnet50

class DETR(nn.Module):
    def __init__(self, num_classes, hidden_dim=256, num_queries=100):
        super().__init__()
        
        # CNN backbone
        backbone = resnet50(pretrained=True)
        self.backbone = nn.Sequential(*list(backbone.children())[:-2])
        self.conv = nn.Conv2d(2048, hidden_dim, 1)
        
        # Transformer
        self.transformer = nn.Transformer(
            d_model=hidden_dim, nhead=8, 
            num_encoder_layers=6, num_decoder_layers=6
        )
        
        # 可学习的查询
        self.query_embed = nn.Embedding(num_queries, hidden_dim)
        
        # 位置编码
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        
        # 预测头
        self.class_head = nn.Linear(hidden_dim, num_classes + 1)  # +1 for no object
        self.bbox_head = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 4)  # cx, cy, w, h
        )
    
    def forward(self, x):
        # Backbone特征
        features = self.backbone(x)  # (B, 2048, H/32, W/32)
        h = self.conv(features)  # (B, hidden_dim, H, W)
        
        B, C, H, W = h.shape
        
        # 位置编码
        pos = torch.cat([
            self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
            self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
        ], dim=-1).flatten(0, 1).unsqueeze(1)  # (H*W, 1, C)
        
        # Transformer
        src = h.flatten(2).permute(2, 0, 1)  # (H*W, B, C)
        query = self.query_embed.weight.unsqueeze(1).repeat(1, B, 1)  # (num_queries, B, C)
        
        hs = self.transformer(src + pos, query)  # (num_queries, B, C)
        hs = hs.permute(1, 0, 2)  # (B, num_queries, C)
        
        # 预测
        class_logits = self.class_head(hs)  # (B, num_queries, num_classes+1)
        bbox_pred = self.bbox_head(hs).sigmoid()  # (B, num_queries, 4)
        
        return class_logits, bbox_pred

匈牙利匹配算法

DETR使用匈牙利算法进行预测与GT的最优匹配：

from scipy.optimize import linear_sum_assignment
import torch.nn.functional as F

def hungarian_matching(pred_logits, pred_boxes, gt_labels, gt_boxes):
    """
    计算预测与GT的最优匹配
    """
    num_queries = pred_logits.shape[0]
    num_gts = len(gt_labels)
    
    # 分类代价
    pred_probs = F.softmax(pred_logits, dim=-1)
    class_cost = -pred_probs[:, gt_labels]  # (num_queries, num_gts)
    
    # 边界框L1代价
    bbox_cost = torch.cdist(pred_boxes, gt_boxes, p=1)  # (num_queries, num_gts)
    
    # GIoU代价
    giou_cost = -generalized_box_iou(pred_boxes, gt_boxes)  # (num_queries, num_gts)
    
    # 总代价
    cost_matrix = class_cost + 5 * bbox_cost + 2 * giou_cost
    
    # 匈牙利算法求解
    pred_indices, gt_indices = linear_sum_assignment(cost_matrix.cpu().numpy())
    
    return pred_indices, gt_indices

语义分割

语义分割为每个像素分配类别标签。

主流架构

模型	年份	特点
FCN	2015	首个端到端分割网络
U-Net	2015	跳跃连接，医学图像
DeepLab	2017	空洞卷积，ASPP
PSPNet	2017	金字塔池化
Mask2Former	2022	统一分割

U-Net实现

import torch
import torch.nn as nn

class DoubleConv(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, 3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        return self.conv(x)

class UNet(nn.Module):
    def __init__(self, in_channels=3, num_classes=21):
        super().__init__()
        
        # Encoder (下采样)
        self.enc1 = DoubleConv(in_channels, 64)
        self.enc2 = DoubleConv(64, 128)
        self.enc3 = DoubleConv(128, 256)
        self.enc4 = DoubleConv(256, 512)
        
        self.pool = nn.MaxPool2d(2)
        
        # Bottleneck
        self.bottleneck = DoubleConv(512, 1024)
        
        # Decoder (上采样)
        self.up4 = nn.ConvTranspose2d(1024, 512, 2, stride=2)
        self.dec4 = DoubleConv(1024, 512)
        
        self.up3 = nn.ConvTranspose2d(512, 256, 2, stride=2)
        self.dec3 = DoubleConv(512, 256)
        
        self.up2 = nn.ConvTranspose2d(256, 128, 2, stride=2)
        self.dec2 = DoubleConv(256, 128)
        
        self.up1 = nn.ConvTranspose2d(128, 64, 2, stride=2)
        self.dec1 = DoubleConv(128, 64)
        
        # 输出层
        self.out = nn.Conv2d(64, num_classes, 1)
    
    def forward(self, x):
        # Encoder
        e1 = self.enc1(x)
        e2 = self.enc2(self.pool(e1))
        e3 = self.enc3(self.pool(e2))
        e4 = self.enc4(self.pool(e3))
        
        # Bottleneck
        b = self.bottleneck(self.pool(e4))
        
        # Decoder + Skip connections
        d4 = self.dec4(torch.cat([self.up4(b), e4], dim=1))
        d3 = self.dec3(torch.cat([self.up3(d4), e3], dim=1))
        d2 = self.dec2(torch.cat([self.up2(d3), e2], dim=1))
        d1 = self.dec1(torch.cat([self.up1(d2), e1], dim=1))
        
        return self.out(d1)

# 创建模型
model = UNet(num_classes=21)
x = torch.randn(2, 3, 256, 256)
out = model(x)
print(f"输出形状: {out.shape}")  # (2, 21, 256, 256)

DeepLab v3+ with ASPP

class ASPP(nn.Module):
    """Atrous Spatial Pyramid Pooling"""
    def __init__(self, in_channels, out_channels=256):
        super().__init__()
        
        # 1x1卷积
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
        
        # 空洞卷积，不同膨胀率
        rates = [6, 12, 18]
        self.atrous_convs = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 3, padding=r, dilation=r, bias=False),
                nn.BatchNorm2d(out_channels),
                nn.ReLU(inplace=True)
            ) for r in rates
        ])
        
        # 全局平均池化
        self.global_pool = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(in_channels, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
        
        # 融合
        self.project = nn.Sequential(
            nn.Conv2d(out_channels * 5, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5)
        )
    
    def forward(self, x):
        size = x.shape[2:]
        
        # 多尺度特征
        feat1 = self.conv1(x)
        feat2 = self.atrous_convs[0](x)
        feat3 = self.atrous_convs[1](x)
        feat4 = self.atrous_convs[2](x)
        feat5 = F.interpolate(self.global_pool(x), size=size, mode='bilinear', align_corners=False)
        
        # 拼接融合
        out = torch.cat([feat1, feat2, feat3, feat4, feat5], dim=1)
        return self.project(out)

分割损失函数

class DiceLoss(nn.Module):
    """Dice Loss，适合类别不平衡"""
    def __init__(self, smooth=1.0):
        super().__init__()
        self.smooth = smooth
    
    def forward(self, pred, target):
        pred = torch.softmax(pred, dim=1)
        
        # One-hot编码
        num_classes = pred.shape[1]
        target_one_hot = F.one_hot(target, num_classes).permute(0, 3, 1, 2).float()
        
        # 计算Dice
        intersection = (pred * target_one_hot).sum(dim=(2, 3))
        union = pred.sum(dim=(2, 3)) + target_one_hot.sum(dim=(2, 3))
        
        dice = (2 * intersection + self.smooth) / (union + self.smooth)
        return 1 - dice.mean()

class FocalLoss(nn.Module):
    """Focal Loss，关注难样本"""
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, pred, target):
        ce_loss = F.cross_entropy(pred, target, reduction='none')
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

# 组合损失
class CombinedLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.ce = nn.CrossEntropyLoss()
        self.dice = DiceLoss()
    
    def forward(self, pred, target):
        return self.ce(pred, target) + self.dice(pred, target)

实例分割

实例分割 = 目标检测 + 语义分割，需要区分同类别的不同实例。

Mask R-CNN

import torchvision
from torchvision.models.detection import maskrcnn_resnet50_fpn

# 加载预训练模型
model = maskrcnn_resnet50_fpn(pretrained=True)
model.eval()

# 推理
image = torch.randn(3, 800, 800)
with torch.no_grad():
    predictions = model([image])

# 解析结果
boxes = predictions[0]['boxes']      # 边界框
labels = predictions[0]['labels']    # 类别
scores = predictions[0]['scores']    # 置信度
masks = predictions[0]['masks']      # 实例掩码 (N, 1, H, W)

# 自定义训练
def get_model(num_classes):
    model = maskrcnn_resnet50_fpn(pretrained=True)
    
    # 修改分类头
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(
        in_features, num_classes
    )
    
    # 修改mask头
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    model.roi_heads.mask_predictor = torchvision.models.detection.mask_rcnn.MaskRCNNPredictor(
        in_features_mask, 256, num_classes
    )
    
    return model

人脸识别

人脸识别系统包括：人脸检测 → 人脸对齐 → 特征提取 → 特征比对。

ArcFace损失

ArcFace是目前最流行的人脸识别损失函数：

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class ArcFace(nn.Module):
    """
    ArcFace: Additive Angular Margin Loss
    """
    def __init__(self, in_features, out_features, s=64.0, m=0.50):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.s = s  # 缩放因子
        self.m = m  # 角度margin
        
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)
        
        self.cos_m = math.cos(m)
        self.sin_m = math.sin(m)
        self.th = math.cos(math.pi - m)
        self.mm = math.sin(math.pi - m) * m
    
    def forward(self, features, labels):
        # 归一化
        cosine = F.linear(F.normalize(features), F.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        
        # cos(theta + m)
        phi = cosine * self.cos_m - sine * self.sin_m
        
        # 处理边界情况
        phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        
        # One-hot
        one_hot = torch.zeros_like(cosine)
        one_hot.scatter_(1, labels.view(-1, 1).long(), 1)
        
        # 只对正确类别添加margin
        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output *= self.s
        
        return output

# 使用示例
class FaceRecognitionModel(nn.Module):
    def __init__(self, num_classes, embedding_dim=512):
        super().__init__()
        # Backbone
        self.backbone = torchvision.models.resnet50(pretrained=True)
        self.backbone.fc = nn.Linear(2048, embedding_dim)
        
        # ArcFace head
        self.arcface = ArcFace(embedding_dim, num_classes)
    
    def forward(self, x, labels=None):
        features = self.backbone(x)
        features = F.normalize(features)
        
        if labels is not None:
            return self.arcface(features, labels)
        return features

人脸检测：RetinaFace

from retinaface import RetinaFace

# 检测人脸
faces = RetinaFace.detect_faces("group_photo.jpg")

for face_id, face_info in faces.items():
    facial_area = face_info["facial_area"]  # [x1, y1, x2, y2]
    landmarks = face_info["landmarks"]       # 5个关键点
    confidence = face_info["score"]
    
    print(f"人脸 {face_id}: 置信度={confidence:.2f}")
    print(f"  位置: {facial_area}")
    print(f"  左眼: {landmarks['left_eye']}")
    print(f"  右眼: {landmarks['right_eye']}")

完整人脸识别流程

import cv2
import numpy as np
from facenet_pytorch import MTCNN, InceptionResnetV1

class FaceRecognitionSystem:
    def __init__(self):
        self.detector = MTCNN(keep_all=True)
        self.encoder = InceptionResnetV1(pretrained='vggface2').eval()
        self.database = {}  # {name: embedding}
    
    def register(self, name, image):
        """注册人脸"""
        faces = self.detector(image)
        if faces is None:
            return False
        
        embedding = self.encoder(faces[0].unsqueeze(0))
        self.database[name] = embedding.detach().numpy()
        return True
    
    def recognize(self, image, threshold=0.6):
        """识别人脸"""
        faces = self.detector(image)
        if faces is None:
            return []
        
        results = []
        for face in faces:
            embedding = self.encoder(face.unsqueeze(0)).detach().numpy()
            
            # 与数据库比对
            best_match = None
            best_similarity = 0
            
            for name, db_embedding in self.database.items():
                similarity = np.dot(embedding.flatten(), db_embedding.flatten())
                if similarity > best_similarity:
                    best_similarity = similarity
                    best_match = name
            
            if best_similarity > threshold:
                results.append((best_match, best_similarity))
            else:
                results.append(("Unknown", best_similarity))
        
        return results

姿态估计

OpenPose关键点

人体姿态估计检测人体关键点（关节位置）：

import torch
import torchvision.models as models

# 使用torchvision的KeypointRCNN
model = models.detection.keypointrcnn_resnet50_fpn(pretrained=True)
model.eval()

image = torch.randn(3, 800, 600)

with torch.no_grad():
    predictions = model([image])

# 解析关键点
# COCO格式：17个关键点
KEYPOINT_NAMES = [
    'nose', 'left_eye', 'right_eye', 'left_ear', 'right_ear',
    'left_shoulder', 'right_shoulder', 'left_elbow', 'right_elbow',
    'left_wrist', 'right_wrist', 'left_hip', 'right_hip',
    'left_knee', 'right_knee', 'left_ankle', 'right_ankle'
]

keypoints = predictions[0]['keypoints']  # (N, 17, 3) - x, y, visibility
scores = predictions[0]['keypoints_scores']

轻量级姿态估计：MoveNet

import tensorflow as tf
import tensorflow_hub as hub

# 加载MoveNet
model = hub.load("https://tfhub.dev/google/movenet/singlepose/lightning/4")
movenet = model.signatures['serving_default']

def detect_pose(image):
    """检测单人姿态"""
    # 预处理
    input_image = tf.cast(image, dtype=tf.int32)
    input_image = tf.image.resize_with_pad(input_image, 192, 192)
    input_image = tf.expand_dims(input_image, axis=0)
    
    # 推理
    outputs = movenet(input_image)
    keypoints = outputs['output_0'].numpy()[0, 0, :, :]
    
    return keypoints  # (17, 3) - y, x, confidence

OCR文字识别

基于深度学习的OCR流程

文本检测 → 文本方向矫正 → 文本识别 → 后处理

使用PaddleOCR

from paddleocr import PaddleOCR

# 初始化
ocr = PaddleOCR(use_angle_cls=True, lang='ch')

# 识别
result = ocr.ocr('document.jpg', cls=True)

for line in result[0]:
    box = line[0]      # 文本框坐标
    text = line[1][0]  # 识别文字
    conf = line[1][1]  # 置信度
    print(f"文字: {text}, 置信度: {conf:.2f}")

CRNN文字识别模型

class CRNN(nn.Module):
    """CNN + RNN + CTC"""
    def __init__(self, img_height, num_chars, hidden_size=256):
        super().__init__()
        
        # CNN特征提取
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2, 2),
            nn.Conv2d(128, 256, 3, padding=1), nn.BatchNorm2d(256), nn.ReLU(),
            nn.Conv2d(256, 256, 3, padding=1), nn.ReLU(), nn.MaxPool2d((2, 1)),
            nn.Conv2d(256, 512, 3, padding=1), nn.BatchNorm2d(512), nn.ReLU(),
            nn.Conv2d(512, 512, 3, padding=1), nn.ReLU(), nn.MaxPool2d((2, 1)),
            nn.Conv2d(512, 512, 2), nn.ReLU()
        )
        
        # 计算CNN输出特征维度
        self.feature_height = img_height // 16 - 1
        
        # 双向LSTM
        self.rnn = nn.LSTM(512 * self.feature_height, hidden_size, 
                          num_layers=2, bidirectional=True, batch_first=True)
        
        # 输出层
        self.fc = nn.Linear(hidden_size * 2, num_chars + 1)  # +1 for blank
    
    def forward(self, x):
        # CNN
        conv = self.cnn(x)  # (B, 512, H', W')
        
        # 转换为序列
        b, c, h, w = conv.size()
        conv = conv.view(b, c * h, w).permute(0, 2, 1)  # (B, W', C*H')
        
        # RNN
        rnn_out, _ = self.rnn(conv)  # (B, W', hidden*2)
        
        # 输出
        output = self.fc(rnn_out)  # (B, W', num_chars+1)
        
        return output.permute(1, 0, 2)  # CTC要求 (T, B, C)

# CTC损失
ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)

数据增强

数据增强是CV训练的关键技术：

import albumentations as A
from albumentations.pytorch import ToTensorV2

# 训练时增强
train_transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    A.GaussNoise(p=0.1),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2()
])

# 检测任务增强
detection_transform = A.Compose([
    A.RandomResizedCrop(640, 640, scale=(0.5, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.2),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2()
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['labels']))

# MixUp增强
def mixup(images, labels, alpha=0.2):
    lam = np.random.beta(alpha, alpha)
    batch_size = images.size(0)
    index = torch.randperm(batch_size)
    
    mixed_images = lam * images + (1 - lam) * images[index]
    labels_a, labels_b = labels, labels[index]
    
    return mixed_images, labels_a, labels_b, lam

# CutMix增强
def cutmix(images, labels, alpha=1.0):
    lam = np.random.beta(alpha, alpha)
    batch_size, _, H, W = images.size()
    index = torch.randperm(batch_size)
    
    # 计算裁剪区域
    cut_ratio = np.sqrt(1 - lam)
    cut_w, cut_h = int(W * cut_ratio), int(H * cut_ratio)
    cx, cy = np.random.randint(W), np.random.randint(H)
    
    x1 = np.clip(cx - cut_w // 2, 0, W)
    x2 = np.clip(cx + cut_w // 2, 0, W)
    y1 = np.clip(cy - cut_h // 2, 0, H)
    y2 = np.clip(cy + cut_h // 2, 0, H)
    
    images[:, :, y1:y2, x1:x2] = images[index, :, y1:y2, x1:x2]
    lam = 1 - (x2 - x1) * (y2 - y1) / (W * H)
    
    return images, labels, labels[index], lam

评估指标

分类指标

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Top-K准确率
def topk_accuracy(output, target, topk=(1, 5)):
    maxk = max(topk)
    batch_size = target.size(0)
    
    _, pred = output.topk(maxk, 1, True, True)
    pred = pred.t()
    correct = pred.eq(target.view(1, -1).expand_as(pred))
    
    results = []
    for k in topk:
        correct_k = correct[:k].reshape(-1).float().sum(0)
        results.append(correct_k / batch_size)
    
    return results

检测指标：mAP

def calculate_ap(recalls, precisions):
    """计算单类别AP"""
    # 11点插值法
    ap = 0
    for t in np.arange(0, 1.1, 0.1):
        if np.sum(recalls >= t) == 0:
            p = 0
        else:
            p = np.max(precisions[recalls >= t])
        ap += p / 11
    return ap

def calculate_iou(box1, box2):
    """计算IoU"""
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    
    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    
    return intersection / (area1 + area2 - intersection + 1e-6)

分割指标：mIoU

def calculate_miou(pred, target, num_classes):
    """计算mIoU"""
    ious = []
    for cls in range(num_classes):
        pred_mask = (pred == cls)
        target_mask = (target == cls)
        
        intersection = (pred_mask & target_mask).sum()
        union = (pred_mask | target_mask).sum()
        
        if union == 0:
            iou = 1.0 if intersection == 0 else 0.0
        else:
            iou = intersection / union
        
        ious.append(iou)
    
    return np.mean(ious)

实战项目：车牌识别系统

import torch
import torch.nn as nn
import cv2
from ultralytics import YOLO

class LicensePlateRecognizer:
    def __init__(self):
        # 车牌检测模型
        self.detector = YOLO('license_plate_detector.pt')
        
        # 字符识别模型
        self.chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ京沪津渝冀晋蒙辽吉黑苏浙皖闽赣鲁豫鄂湘粤桂琼川贵云藏陕甘青宁新'
        self.recognizer = CRNN(32, len(self.chars))
        self.recognizer.load_state_dict(torch.load('crnn_license.pt'))
        self.recognizer.eval()
    
    def detect(self, image):
        """检测车牌"""
        results = self.detector(image)
        plates = []
        
        for box in results[0].boxes:
            x1, y1, x2, y2 = map(int, box.xyxy[0].tolist())
            plate_img = image[y1:y2, x1:x2]
            plates.append({
                'bbox': (x1, y1, x2, y2),
                'image': plate_img,
                'confidence': box.conf[0].item()
            })
        
        return plates
    
    def recognize(self, plate_img):
        """识别车牌号"""
        # 预处理
        img = cv2.cvtColor(plate_img, cv2.COLOR_BGR2GRAY)
        img = cv2.resize(img, (100, 32))
        img = img.astype(np.float32) / 255.0
        img = torch.from_numpy(img).unsqueeze(0).unsqueeze(0)
        
        # 推理
        with torch.no_grad():
            output = self.recognizer(img)
        
        # CTC解码
        _, preds = output.max(2)
        preds = preds.transpose(1, 0).contiguous().view(-1)
        
        # 去重复和blank
        text = []
        prev = 0
        for p in preds:
            if p != 0 and p != prev:
                text.append(self.chars[p - 1])
            prev = p
        
        return ''.join(text)
    
    def process(self, image):
        """完整处理流程"""
        plates = self.detect(image)
        
        for plate in plates:
            plate['text'] = self.recognize(plate['image'])
        
        return plates

# 使用示例
recognizer = LicensePlateRecognizer()
image = cv2.imread('car.jpg')
results = recognizer.process(image)

for r in results:
    print(f"车牌: {r['text']}, 置信度: {r['confidence']:.2f}")

小结

本文涵盖了计算机视觉的核心应用：

任务	主流方法	推荐模型
图像分类	CNN/ViT	EfficientNet, Swin
目标检测	单阶段/Transformer	YOLOv8, DINO
语义分割	Encoder-Decoder	U-Net, DeepLab
实例分割	两阶段	Mask R-CNN
人脸识别	度量学习	ArcFace
姿态估计	关键点检测	MoveNet, HRNet
OCR	检测+识别	PaddleOCR

下一篇：深度学习在自然语言处理中的应用，包括文本分类、命名实体识别、机器翻译等。