手搓Vit（模式识别作业）

重要说明：已手搓Patch+Position Embedding + Vision Transformer（详见后文），分类准确率99.20%，但即使加载模型，预测时间也很长，1000个需要3个小时，4个npy文件需要12个小时，但是我要做科研项目，故直接使用老师所给的4个npy文件手搓一个MLP做最后的预测。

重点观察：随机初始化：算力不足故舍弃。更换权重：算力不足故舍弃。调整MLP：分数99%基本达上限，很难观察识别性能影响，故舍弃。

实验流程：

先查看老师所给的DINOV2-base.npz的文件形式和参数命名。
根据权重内embedding的参数量和论文实现图的Embedding操作。
将任务需求和参数命名喂给GPT手搓transformer，让他不要叫错”名字“，可以不加入mask token。每张图获得（1，1370，768）的总token和最上面的分类用cls token（1，768）。
cls token才是输入至MLPheads以进行分类的特征向量。后续简单设计一个MLP即可进行预测。

手搓transformer

参考文献：Vit论文，DINO，DINOv2。

实践过程

1. 图像预处理和 Patch Embedding

函数: image_to_patch_embedding_v2
功能:
- 该函数首先将输入图像调整到 518x518 的大小，随后对图像进行标准化，使其均值和标准差符合 DINOv2 的预训练模型所要求的标准。
- 然后通过一个卷积操作（模拟 Patch Embedding），将图像切分成多个小的 14x14 的 patch，并通过权重矩阵将每个 patch 投影到一个 768 维的向量空间中。每个 patch 的输出形状为 [1369, 768]，其中 1369 是通过 37x37 的 patch 网格得到的数量。
输入输出:
- 输入：一个图像 [H, W, 3]，其中建议大小为 518x518。
- 输出：形状为 [1, 1369, 768] 的数组，其中每个 patch 变为一个 768 维的向量。

2. 添加 CLS token 和位置嵌入

函数: add_cls_and_position_v2
功能:
- 在 Patch Embedding 后，DINOv2 模型会将一个特殊的 CLS token（[CLS] token）添加到特征序列的开头。CLS token 的作用是聚合整个图像的信息，通常用于后续的分类任务。
- 该函数将 CLS token 和位置嵌入加到图像特征上，并将其返回。
输入输出:
- 输入：来自 image_to_patch_embedding_v2 的特征向量 [1, 1369, 768]，以及位置嵌入和 CLS token 的权重。
- 输出：形状为 [1, 1370, 768] 的张量，包含了 CLS token 和位置嵌入。

3. Transformer 推理过程

函数: forward_transformer 和 transformer_block
功能:
- DINOv2 使用了 12 层的 Transformer 编码器对图像特征进行处理。每一层包括了多头自注意力机制和全连接前馈网络，经过这些层后，CLS token 被用来表示整个图像的特征。
- 每一层的计算步骤包括：
  - LayerNorm：首先进行 Layer Normalization，对输入进行标准化。
  - QKV 投影：然后输入经过三次线性变换，得到 Query、Key 和 Value。
  - 多头自注意力：通过计算 Query 和 Key 的点积得到注意力分数，进而生成加权的 Value 输出。
  - 前馈网络：经过多头自注意力的输出进入前馈神经网络，使用激活函数（GELU）。
  - 残差连接：每一层的输出通过残差连接进行更新，逐层传递。

实验结果

减少数据量，确保流程完整且正确：

MLP分类

MLP是3层线性层。

（768，256）
（256，128）
（128，128）
（128，1）二分类

其他参数：

lr = 0.001 # 学习率
epochs = 100
batch_size = 128
# 激活函数
def relu(x):
    return np.maximum(0, x)
def relu_derivative(x):
    return (x > 0).astype(float)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
# 损失函数
# --------------------------
def binary_cross_entropy(y_pred, y_true):
    eps = 1e-12
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

运行结果

1000个猫，1000个狗

我生成的5个猫5个狗的预测集，准确率100%：

重要代码

transformer_hand.py

import numpy as np
from skimage.io import imread
from skimage.transform import resize
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
# embedding 操作
def image_to_patch_embedding_v2(image, weights):
    """
    使用权重中 14x14 的 patch embedding 生成 1370 个 token（含 1 个 CLS）
    image: [H, W, 3] 输入图像，建议 518x518
    weights: dict, 含 embedding 参数
    return: x_patch: [1, 1369, 768]
    """
    # Resize 到 518x518（14x14 的 patch 可分成 37x37 = 1369）
    image_resized = resize(image, (518, 518), anti_aliasing=True)

    # 标准化
    mean = np.array([0.485, 0.456, 0.406])
    std  = np.array([0.229, 0.224, 0.225])
    image_norm = (image_resized - mean) / std

    # [H, W, 3] -> [1, 3, H, W]
    x = image_norm.transpose(2, 0, 1)[np.newaxis, :]  # [1, 3, 518, 518]

    # 卷积 Patch Embedding
    W = weights["embeddings.patch_embeddings.projection.weight"]  # [768, 3, 14, 14]
    B = weights["embeddings.patch_embeddings.projection.bias"]    # [768]

    # 手动做 stride-14 的 sliding window
    stride = 14
    patch_size = 14
    num_patches = (518 // 14) ** 2  # = 1369

    patches = []
    for i in range(0, 518 - patch_size + 1, stride):
        for j in range(0, 518 - patch_size + 1, stride):
            patch = x[0, :, i:i+14, j:j+14]  # [3,14,14]
            patch_flat = patch.reshape(1, -1)  # [1, 3*14*14]
            W_flat = W.reshape(768, -1)        # [768, 3*14*14]
            patch_out = patch_flat @ W_flat.T + B  # [1,768]
            patches.append(patch_out)

    x_patch = np.concatenate(patches, axis=0)  # [1369, 768]
    return x_patch[np.newaxis, :, :]           # [1, 1369, 768]

def add_cls_and_position_v2(x_patch, weights):
    """
    x_patch: [1, 1369, 768]
    weights: dict
    return: [1, 1370, 768]
    """
    cls_token = weights["embeddings.cls_token"]              # [1, 1, 768]
    pos_embed = weights["embeddings.position_embeddings"]    # [1, 1370, 768]

    x = np.concatenate([cls_token, x_patch], axis=1)  # [1, 1370, 768]
    x += pos_embed
    return x

# Transformer 层
def gelu(x):
    return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi)*(x + 0.044715*np.power(x, 3))))

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def layer_norm(x, weight, bias, eps=1e-5):
    mean = x.mean(-1, keepdims=True)
    var = x.var(-1, keepdims=True)
    norm = (x - mean) / np.sqrt(var + eps)
    return norm * weight + bias

def transformer_block(x, weights, layer_id, num_heads=12):
    dim = x.shape[-1]
    head_dim = dim // num_heads

    prefix = f"encoder.layer.{layer_id}"

    # LayerNorm1
    x_ln1 = layer_norm(
        x,
        weights[f"{prefix}.norm1.weight"],
        weights[f"{prefix}.norm1.bias"]
    )

    # QKV Projection
    q = x_ln1 @ weights[f"{prefix}.attention.attention.query.weight"].T + weights[
        f"{prefix}.attention.attention.query.bias"]
    k = x_ln1 @ weights[f"{prefix}.attention.attention.key.weight"].T + weights[
        f"{prefix}.attention.attention.key.bias"]
    v = x_ln1 @ weights[f"{prefix}.attention.attention.value.weight"].T + weights[
        f"{prefix}.attention.attention.value.bias"]

    # Reshape to multi-head
    def reshape_heads(x):  # [B, num_heads, T, head_dim]
        B, T, _ = x.shape
        return x.reshape(B, T, num_heads, head_dim).transpose(0, 2, 1, 3)

    q = reshape_heads(q)
    k = reshape_heads(k)
    v = reshape_heads(v)

    # Scaled Dot-Product Attention
    attn_scores = np.matmul(q, k.transpose(0, 1, 3, 2)) / np.sqrt(head_dim)
    attn_weights = softmax(attn_scores, axis=-1)
    attn_output = np.matmul(attn_weights, v)  # [B, num_heads, T, head_dim]

    # Concatenate heads
    attn_output = attn_output.transpose(0, 2, 1, 3).reshape(x.shape)

    # Linear projection
    attn_output = attn_output @ weights[f"{prefix}.attention.output.dense.weight"].T + weights[
        f"{prefix}.attention.output.dense.bias"]

    # LayerScale1
    attn_output *= weights[f"{prefix}.layer_scale1.lambda1"]

    # Residual 1
    x = x + attn_output

    # LayerNorm2
    x_ln2 = layer_norm(
        x,
        weights[f"{prefix}.norm2.weight"],
        weights[f"{prefix}.norm2.bias"]
    )

    # MLP
    fc1 = x_ln2 @ weights[f"{prefix}.mlp.fc1.weight"].T + weights[f"{prefix}.mlp.fc1.bias"]
    fc1 = gelu(fc1)
    fc2 = fc1 @ weights[f"{prefix}.mlp.fc2.weight"].T + weights[f"{prefix}.mlp.fc2.bias"]

    # LayerScale2
    fc2 *= weights[f"{prefix}.layer_scale2.lambda1"]

    # Residual 2
    x = x + fc2

    return x

def forward_transformer(x, weights, num_layers=12):
    # x = x[np.newaxis, ...]  # Add batch dim
    for i in range(num_layers):
        x = transformer_block(x, weights, i)

    cls_token = x[:, 0]  # 提取所有样本的 CLS token，形状为 [batch_size, 768]

    # Final LayerNorm
    x = layer_norm(
        x,
        weights["layernorm.weight"],
        weights["layernorm.bias"]
    )

    return x,cls_token

# 对单个图进行推理
def dinov2_inference(image, weights):
    """
    一步执行 embedding + transformer 推理
    """
    x_patch = image_to_patch_embedding_v2(image, weights)  # [1, 1369, 768]
    x = add_cls_and_position_v2(x_patch, weights)          # [1, 1370, 768]
    features = forward_transformer(x, weights)              # [1370, 768]
    return features  # [1370, 768]

# 对文件中的所有jpg文件进行推理
def process_images_in_folder(folder_path, weights):
    """
    处理文件夹中所有 .jpg 图像，返回 features 和 cls_token
    folder_path: str, 包含 .jpg 文件的文件夹路径
    weights: dict, DINOv2 模型的权重字典
    return: features, cls_token
    """
    # 获取文件夹中所有的 jpg 文件
    image_files = [f for f in os.listdir(folder_path) if f.endswith('.jpg')]

    # 初始化空列表保存所有的 features 和 cls_token
    features_list = []
    cls_token_list = []

    # 遍历每个图片文件
    for image_file in tqdm(image_files, desc="Processing Images", ncols=100):
        # 读取图像
        image_path = os.path.join(folder_path, image_file)
        image = imread(image_path) / 255.0

        # 执行推理
        features = dinov2_inference(image, weights)
        features_list.append(features[0])  # 提取 [1370, 768]
        cls_token_list.append(features[1])  # 提取 [768]

    # 转换为 numpy 数组
    features_array = np.array(features_list)  # [num_images, 1370, 768]
    cls_token_array = np.array(cls_token_list)  # [num_images, 768]

    return features_array, cls_token_array

# 使用示例
weights = dict(np.load(r"D:\moshishibie\transformer\fangangnb\dinov2-base.npz"))
train_data_dir = r'D:\moshishibie\transformer\test'
test_data_dir = r'D:\moshishibie\transformer\train'
save_dir = r'D:\moshishibie\transformer\trans_result'

animals =['cats','dogs']
for i in range(len(animals)):
    folder_path = os.path.join(train_data_dir,animals[i])
    features, cls_token = process_images_in_folder(folder_path, weights)
    np.save(os.path.join(save_dir, "cjh_train_" + animals[i] + "_features.npy"), features)
    np.save(os.path.join(save_dir, "cjh_train_" + animals[i] + "_cls_token.npy"), cls_token)

    folder_path = os.path.join(test_data_dir,animals[i])
    features, cls_token = process_images_in_folder(folder_path, weights)
    np.save(os.path.join(save_dir, "cjh_test_"+animals[i]+ "_features.npy"), features)
    np.save(os.path.join(save_dir, "cjh_test_"+animals[i]+ "_cls_token.npy"), cls_token)

MLPcls.py

import numpy as np
from tqdm import tqdm
import csv
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('Agg')  # 使用非交互式后端，避免 FigureCanvas 错误

# --------------------------
# 超参数
# --------------------------
input_dim = 768
hidden1 = 256
hidden2 = 128
hidden3 = 128
output_dim = 1  # 二分类
lr = 0.001
epochs = 20
batch_size = 128

# --------------------------
# 数据加载与标签
# --------------------------
X_train_cat = np.load(r'D:\moshishibie\transformer\fangangnb\train_cat_features.npy')  # (4000, 768)
X_train_dog = np.load(r'D:\moshishibie\transformer\fangangnb\train_dog_features.npy')  # (4000, 768)
X_test_cat = np.load(r'D:\moshishibie\transformer\fangangnb\test_cat_features.npy')    # (1000, 768)
X_test_dog = np.load(r'D:\moshishibie\transformer\fangangnb\test_dog_features.npy')    # (1000, 768)
# 合并训练数据
X_train = np.vstack([X_train_cat, X_train_dog])  # (8000, 768)
y_train = np.array([0]*1000 + [1]*1000).reshape(-1, 1)

# 合并测试数据
X_test = np.vstack([X_test_cat, X_test_dog])     # (2000, 768)
y_test = np.array([0]*1000 + [1]*1000).reshape(-1, 1)

# 打乱训练数据
perm = np.random.permutation(len(X_train))
X_train = X_train[perm]
y_train = y_train[perm]

# --------------------------
# 初始化参数
# --------------------------
def init_weights():
    W1 = np.random.randn(input_dim, hidden1) * np.sqrt(2. / input_dim)
    b1 = np.zeros((1, hidden1))
    W2 = np.random.randn(hidden1, hidden2) * np.sqrt(2. / hidden1)
    b2 = np.zeros((1, hidden2))
    W3 = np.random.randn(hidden2, hidden3) * np.sqrt(2. / hidden2)
    b3 = np.zeros((1, hidden3))
    W4 = np.random.randn(hidden3, output_dim) * np.sqrt(2. / hidden3)
    b4 = np.zeros((1, output_dim))
    return W1, b1, W2, b2, W3, b3, W4, b4

# --------------------------
# 激活函数
# --------------------------
def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# --------------------------
# 损失函数
# --------------------------
def binary_cross_entropy(y_pred, y_true):
    eps = 1e-12
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# --------------------------
# 前向传播
# --------------------------
def forward(X, W1, b1, W2, b2, W3, b3, W4, b4):
    z1 = X @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    a2 = relu(z2)
    z3 = a2 @ W3 + b3
    a3 = relu(z3)
    z4 = a3 @ W4 + b4
    a4 = sigmoid(z4)
    return z1, a1, z2, a2, z3, a3, z4, a4

# --------------------------
# 反向传播
# --------------------------
def backward(X, y, z1, a1, z2, a2, z3, a3, z4, a4, W2, W3, W4):
    m = y.shape[0]
    dz4 = a4 - y
    dW4 = a3.T @ dz4 / m
    db4 = np.sum(dz4, axis=0, keepdims=True) / m

    da3 = dz4 @ W4.T
    dz3 = da3 * relu_derivative(z3)
    dW3 = a2.T @ dz3 / m
    db3 = np.sum(dz3, axis=0, keepdims=True) / m

    da2 = dz3 @ W3.T
    dz2 = da2 * relu_derivative(z2)
    dW2 = a1.T @ dz2 / m
    db2 = np.sum(dz2, axis=0, keepdims=True) / m

    da1 = dz2 @ W2.T
    dz1 = da1 * relu_derivative(z1)
    dW1 = X.T @ dz1 / m
    db1 = np.sum(dz1, axis=0, keepdims=True) / m

    return dW1, db1, dW2, db2, dW3, db3, dW4, db4

# --------------------------
# 训练
# --------------------------
def train(X, y, W1, b1, W2, b2, W3, b3, W4, b4):
    loss_list = []
    acc_list = []
    for epoch in range(epochs):
        epoch_loss = 0
        correct = 0
        total = 0
        pbar = tqdm(range(0, len(X), batch_size), desc=f"Epoch {epoch+1}/{epochs}")
        for i in pbar:
            X_batch = X[i:i+batch_size]
            y_batch = y[i:i+batch_size]

            z1, a1, z2, a2, z3, a3, z4, a4 = forward(X_batch, W1, b1, W2, b2, W3, b3, W4, b4)
            loss = binary_cross_entropy(a4, y_batch)
            epoch_loss += loss * len(X_batch)

            # accuracy
            pred = (a4 > 0.5).astype(int)
            correct += np.sum(pred == y_batch)
            total += len(X_batch)

            grads = backward(X_batch, y_batch, z1, a1, z2, a2, z3, a3, z4, a4, W2, W3, W4)
            dW1, db1, dW2, db2, dW3, db3, dW4, db4 = grads

            # 更新
            W1 -= lr * dW1
            b1 -= lr * db1
            W2 -= lr * dW2
            b2 -= lr * db2
            W3 -= lr * dW3
            b3 -= lr * db3
            W4 -= lr * dW4
            b4 -= lr * db4

            pbar.set_postfix(loss=loss)
        avg_loss = epoch_loss / total
        acc = correct / total
        loss_list.append(avg_loss)
        acc_list.append(acc)
        print(f"Epoch {epoch+1}: avg_loss={avg_loss:.4f}, acc={acc:.4f}")
    return W1, b1, W2, b2, W3, b3, W4, b4, loss_list, acc_list
# --------------------------
# 预测
# --------------------------
def predict(X, W1, b1, W2, b2, W3, b3, W4, b4):
    preds = []
    pbar = tqdm(range(0, len(X), batch_size), desc="Predicting")
    for i in pbar:
        X_batch = X[i:i+batch_size]
        _, _, _, _, _, _, _, a4 = forward(X_batch, W1, b1, W2, b2, W3, b3, W4, b4)
        preds.append(a4)
    preds = np.vstack(preds)
    return (preds > 0.5).astype(int), preds

# --------------------------
# 主流程
# --------------------------
W1, b1, W2, b2, W3, b3, W4, b4 = init_weights()
W1, b1, W2, b2, W3, b3, W4, b4, loss_list, acc_list = train(X_train, y_train, W1, b1, W2, b2, W3, b3, W4, b4)

# 绘制LOSS和acc曲线图
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(loss_list, label='Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(acc_list, label='Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training Accuracy')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.savefig('training_curve.png')
#plt.show()
# 预测
pred_labels, pred_probs = predict(X_test, W1, b1, W2, b2, W3, b3, W4, b4)

# 计算准确率
accuracy = np.mean(pred_labels == y_test)
print(f"✅ Test Accuracy: {accuracy * 100:.2f}%")

# 保存CSV
with open('prediction_results.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['id', 'true_label', 'predicted_label', 'probability'])
    for i, (true, pred, prob) in enumerate(zip(y_test, pred_labels, pred_probs)):
        writer.writerow([i, int(true[0]), int(pred[0]), float(prob[0])])

print("✅ Prediction results saved to prediction_results.csv")

requirement.txt

colorama==0.4.6
contourpy==1.3.0
cycler==0.12.1
fonttools==4.57.0
imageio==2.37.0
importlib_resources==6.5.2
kiwisolver==1.4.7
lazy_loader==0.4
matplotlib==3.9.4
networkx==3.2.1
numpy==1.24.4
packaging==25.0
pillow==11.2.1
pyparsing==3.2.3
python-dateutil==2.9.0.post0
scikit-image==0.24.0
scipy==1.13.1
six==1.17.0
tifffile==2024.8.30
tqdm==4.67.1
zipp==3.21.0