手搓Vit(模式识别作业)

重要说明:已手搓Patch+Position Embedding + Vision Transformer(详见后文),分类准确率99.20%,但即使加载模型,预测时间也很长,1000个需要3个小时,4个npy文件需要12个小时,但是我要做科研项目,故直接使用老师所给的4个npy文件手搓一个MLP做最后的预测。

重点观察:随机初始化:算力不足故舍弃。更换权重:算力不足故舍弃。调整MLP:分数99%基本达上限,很难观察识别性能影响,故舍弃。

实验流程

  1. 先查看老师所给的DINOV2-base.npz的文件形式和参数命名。
  2. 根据权重内embedding的参数量和论文实现图的Embedding操作。
  3. 将任务需求和参数命名喂给GPT手搓transformer,让他不要叫错”名字“,可以不加入mask token。每张图获得(1,1370,768)的总token和最上面的分类用cls token(1,768)。
  4. cls token才是输入至MLPheads以进行分类的特征向量。后续简单设计一个MLP即可进行预测。

手搓transformer

参考文献:Vit论文DINODINOv2

实践过程

1. 图像预处理和 Patch Embedding

  • 函数: image_to_patch_embedding_v2
  • 功能:
    • 该函数首先将输入图像调整到 518x518 的大小,随后对图像进行标准化,使其均值和标准差符合 DINOv2 的预训练模型所要求的标准。
    • 然后通过一个卷积操作(模拟 Patch Embedding),将图像切分成多个小的 14x14 的 patch,并通过权重矩阵将每个 patch 投影到一个 768 维的向量空间中。每个 patch 的输出形状为 [1369, 768],其中 1369 是通过 37x37 的 patch 网格得到的数量。
  • 输入输出:
    • 输入:一个图像 [H, W, 3],其中建议大小为 518x518。
    • 输出:形状为 [1, 1369, 768] 的数组,其中每个 patch 变为一个 768 维的向量。

2. 添加 CLS token 和位置嵌入

  • 函数: add_cls_and_position_v2
  • 功能:
    • 在 Patch Embedding 后,DINOv2 模型会将一个特殊的 CLS token([CLS] token)添加到特征序列的开头。CLS token 的作用是聚合整个图像的信息,通常用于后续的分类任务。
    • 该函数将 CLS token 和位置嵌入加到图像特征上,并将其返回。
  • 输入输出:
    • 输入:来自 image_to_patch_embedding_v2 的特征向量 [1, 1369, 768],以及位置嵌入和 CLS token 的权重。
    • 输出:形状为 [1, 1370, 768] 的张量,包含了 CLS token 和位置嵌入。

3. Transformer 推理过程

  • 函数: forward_transformertransformer_block
  • 功能:
    • DINOv2 使用了 12 层的 Transformer 编码器对图像特征进行处理。每一层包括了多头自注意力机制和全连接前馈网络,经过这些层后,CLS token 被用来表示整个图像的特征。
    • 每一层的计算步骤包括:
      • LayerNorm:首先进行 Layer Normalization,对输入进行标准化。
      • QKV 投影:然后输入经过三次线性变换,得到 Query、Key 和 Value。
      • 多头自注意力:通过计算 Query 和 Key 的点积得到注意力分数,进而生成加权的 Value 输出。
      • 前馈网络:经过多头自注意力的输出进入前馈神经网络,使用激活函数(GELU)。
      • 残差连接:每一层的输出通过残差连接进行更新,逐层传递。

实验结果

减少数据量,确保流程完整且正确:

MLP分类

MLP是3层线性层。

  1. (768,256)
  2. (256,128)
  3. (128,128)
  4. (128,1)二分类

其他参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
lr = 0.001 # 学习率
epochs = 100
batch_size = 128
# 激活函数
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# 损失函数
# --------------------------
def binary_cross_entropy(y_pred, y_true):
eps = 1e-12
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

运行结果

1000个猫,1000个狗

我生成的5个猫5个狗的预测集,准确率100%:

重要代码

transformer_hand.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
import numpy as np
from skimage.io import imread
from skimage.transform import resize
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
# embedding 操作
def image_to_patch_embedding_v2(image, weights):
"""
使用权重中 14x14 的 patch embedding 生成 1370 个 token(含 1 个 CLS)
image: [H, W, 3] 输入图像,建议 518x518
weights: dict, 含 embedding 参数
return: x_patch: [1, 1369, 768]
"""
# Resize 到 518x518(14x14 的 patch 可分成 37x37 = 1369)
image_resized = resize(image, (518, 518), anti_aliasing=True)

# 标准化
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
image_norm = (image_resized - mean) / std

# [H, W, 3] -> [1, 3, H, W]
x = image_norm.transpose(2, 0, 1)[np.newaxis, :] # [1, 3, 518, 518]

# 卷积 Patch Embedding
W = weights["embeddings.patch_embeddings.projection.weight"] # [768, 3, 14, 14]
B = weights["embeddings.patch_embeddings.projection.bias"] # [768]

# 手动做 stride-14 的 sliding window
stride = 14
patch_size = 14
num_patches = (518 // 14) ** 2 # = 1369

patches = []
for i in range(0, 518 - patch_size + 1, stride):
for j in range(0, 518 - patch_size + 1, stride):
patch = x[0, :, i:i+14, j:j+14] # [3,14,14]
patch_flat = patch.reshape(1, -1) # [1, 3*14*14]
W_flat = W.reshape(768, -1) # [768, 3*14*14]
patch_out = patch_flat @ W_flat.T + B # [1,768]
patches.append(patch_out)

x_patch = np.concatenate(patches, axis=0) # [1369, 768]
return x_patch[np.newaxis, :, :] # [1, 1369, 768]

def add_cls_and_position_v2(x_patch, weights):
"""
x_patch: [1, 1369, 768]
weights: dict
return: [1, 1370, 768]
"""
cls_token = weights["embeddings.cls_token"] # [1, 1, 768]
pos_embed = weights["embeddings.position_embeddings"] # [1, 1370, 768]

x = np.concatenate([cls_token, x_patch], axis=1) # [1, 1370, 768]
x += pos_embed
return x

# Transformer 层
def gelu(x):
return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi)*(x + 0.044715*np.power(x, 3))))

def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True)
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

def layer_norm(x, weight, bias, eps=1e-5):
mean = x.mean(-1, keepdims=True)
var = x.var(-1, keepdims=True)
norm = (x - mean) / np.sqrt(var + eps)
return norm * weight + bias

def transformer_block(x, weights, layer_id, num_heads=12):
dim = x.shape[-1]
head_dim = dim // num_heads

prefix = f"encoder.layer.{layer_id}"

# LayerNorm1
x_ln1 = layer_norm(
x,
weights[f"{prefix}.norm1.weight"],
weights[f"{prefix}.norm1.bias"]
)

# QKV Projection
q = x_ln1 @ weights[f"{prefix}.attention.attention.query.weight"].T + weights[
f"{prefix}.attention.attention.query.bias"]
k = x_ln1 @ weights[f"{prefix}.attention.attention.key.weight"].T + weights[
f"{prefix}.attention.attention.key.bias"]
v = x_ln1 @ weights[f"{prefix}.attention.attention.value.weight"].T + weights[
f"{prefix}.attention.attention.value.bias"]

# Reshape to multi-head
def reshape_heads(x): # [B, num_heads, T, head_dim]
B, T, _ = x.shape
return x.reshape(B, T, num_heads, head_dim).transpose(0, 2, 1, 3)

q = reshape_heads(q)
k = reshape_heads(k)
v = reshape_heads(v)

# Scaled Dot-Product Attention
attn_scores = np.matmul(q, k.transpose(0, 1, 3, 2)) / np.sqrt(head_dim)
attn_weights = softmax(attn_scores, axis=-1)
attn_output = np.matmul(attn_weights, v) # [B, num_heads, T, head_dim]

# Concatenate heads
attn_output = attn_output.transpose(0, 2, 1, 3).reshape(x.shape)

# Linear projection
attn_output = attn_output @ weights[f"{prefix}.attention.output.dense.weight"].T + weights[
f"{prefix}.attention.output.dense.bias"]

# LayerScale1
attn_output *= weights[f"{prefix}.layer_scale1.lambda1"]

# Residual 1
x = x + attn_output

# LayerNorm2
x_ln2 = layer_norm(
x,
weights[f"{prefix}.norm2.weight"],
weights[f"{prefix}.norm2.bias"]
)

# MLP
fc1 = x_ln2 @ weights[f"{prefix}.mlp.fc1.weight"].T + weights[f"{prefix}.mlp.fc1.bias"]
fc1 = gelu(fc1)
fc2 = fc1 @ weights[f"{prefix}.mlp.fc2.weight"].T + weights[f"{prefix}.mlp.fc2.bias"]

# LayerScale2
fc2 *= weights[f"{prefix}.layer_scale2.lambda1"]

# Residual 2
x = x + fc2

return x

def forward_transformer(x, weights, num_layers=12):
# x = x[np.newaxis, ...] # Add batch dim
for i in range(num_layers):
x = transformer_block(x, weights, i)

cls_token = x[:, 0] # 提取所有样本的 CLS token,形状为 [batch_size, 768]

# Final LayerNorm
x = layer_norm(
x,
weights["layernorm.weight"],
weights["layernorm.bias"]
)

return x,cls_token

# 对单个图进行推理
def dinov2_inference(image, weights):
"""
一步执行 embedding + transformer 推理
"""
x_patch = image_to_patch_embedding_v2(image, weights) # [1, 1369, 768]
x = add_cls_and_position_v2(x_patch, weights) # [1, 1370, 768]
features = forward_transformer(x, weights) # [1370, 768]
return features # [1370, 768]

# 对文件中的所有jpg文件进行推理
def process_images_in_folder(folder_path, weights):
"""
处理文件夹中所有 .jpg 图像,返回 features 和 cls_token
folder_path: str, 包含 .jpg 文件的文件夹路径
weights: dict, DINOv2 模型的权重字典
return: features, cls_token
"""
# 获取文件夹中所有的 jpg 文件
image_files = [f for f in os.listdir(folder_path) if f.endswith('.jpg')]

# 初始化空列表保存所有的 features 和 cls_token
features_list = []
cls_token_list = []

# 遍历每个图片文件
for image_file in tqdm(image_files, desc="Processing Images", ncols=100):
# 读取图像
image_path = os.path.join(folder_path, image_file)
image = imread(image_path) / 255.0

# 执行推理
features = dinov2_inference(image, weights)
features_list.append(features[0]) # 提取 [1370, 768]
cls_token_list.append(features[1]) # 提取 [768]

# 转换为 numpy 数组
features_array = np.array(features_list) # [num_images, 1370, 768]
cls_token_array = np.array(cls_token_list) # [num_images, 768]

return features_array, cls_token_array

# 使用示例
weights = dict(np.load(r"D:\moshishibie\transformer\fangangnb\dinov2-base.npz"))
train_data_dir = r'D:\moshishibie\transformer\test'
test_data_dir = r'D:\moshishibie\transformer\train'
save_dir = r'D:\moshishibie\transformer\trans_result'

animals =['cats','dogs']
for i in range(len(animals)):
folder_path = os.path.join(train_data_dir,animals[i])
features, cls_token = process_images_in_folder(folder_path, weights)
np.save(os.path.join(save_dir, "cjh_train_" + animals[i] + "_features.npy"), features)
np.save(os.path.join(save_dir, "cjh_train_" + animals[i] + "_cls_token.npy"), cls_token)

folder_path = os.path.join(test_data_dir,animals[i])
features, cls_token = process_images_in_folder(folder_path, weights)
np.save(os.path.join(save_dir, "cjh_test_"+animals[i]+ "_features.npy"), features)
np.save(os.path.join(save_dir, "cjh_test_"+animals[i]+ "_cls_token.npy"), cls_token)

MLPcls.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import numpy as np
from tqdm import tqdm
import csv
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use('Agg') # 使用非交互式后端,避免 FigureCanvas 错误

# --------------------------
# 超参数
# --------------------------
input_dim = 768
hidden1 = 256
hidden2 = 128
hidden3 = 128
output_dim = 1 # 二分类
lr = 0.001
epochs = 20
batch_size = 128

# --------------------------
# 数据加载与标签
# --------------------------
X_train_cat = np.load(r'D:\moshishibie\transformer\fangangnb\train_cat_features.npy') # (4000, 768)
X_train_dog = np.load(r'D:\moshishibie\transformer\fangangnb\train_dog_features.npy') # (4000, 768)
X_test_cat = np.load(r'D:\moshishibie\transformer\fangangnb\test_cat_features.npy') # (1000, 768)
X_test_dog = np.load(r'D:\moshishibie\transformer\fangangnb\test_dog_features.npy') # (1000, 768)
# 合并训练数据
X_train = np.vstack([X_train_cat, X_train_dog]) # (8000, 768)
y_train = np.array([0]*1000 + [1]*1000).reshape(-1, 1)

# 合并测试数据
X_test = np.vstack([X_test_cat, X_test_dog]) # (2000, 768)
y_test = np.array([0]*1000 + [1]*1000).reshape(-1, 1)

# 打乱训练数据
perm = np.random.permutation(len(X_train))
X_train = X_train[perm]
y_train = y_train[perm]

# --------------------------
# 初始化参数
# --------------------------
def init_weights():
W1 = np.random.randn(input_dim, hidden1) * np.sqrt(2. / input_dim)
b1 = np.zeros((1, hidden1))
W2 = np.random.randn(hidden1, hidden2) * np.sqrt(2. / hidden1)
b2 = np.zeros((1, hidden2))
W3 = np.random.randn(hidden2, hidden3) * np.sqrt(2. / hidden2)
b3 = np.zeros((1, hidden3))
W4 = np.random.randn(hidden3, output_dim) * np.sqrt(2. / hidden3)
b4 = np.zeros((1, output_dim))
return W1, b1, W2, b2, W3, b3, W4, b4

# --------------------------
# 激活函数
# --------------------------
def relu(x):
return np.maximum(0, x)

def relu_derivative(x):
return (x > 0).astype(float)

def sigmoid(x):
return 1 / (1 + np.exp(-x))

# --------------------------
# 损失函数
# --------------------------
def binary_cross_entropy(y_pred, y_true):
eps = 1e-12
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# --------------------------
# 前向传播
# --------------------------
def forward(X, W1, b1, W2, b2, W3, b3, W4, b4):
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
a2 = relu(z2)
z3 = a2 @ W3 + b3
a3 = relu(z3)
z4 = a3 @ W4 + b4
a4 = sigmoid(z4)
return z1, a1, z2, a2, z3, a3, z4, a4

# --------------------------
# 反向传播
# --------------------------
def backward(X, y, z1, a1, z2, a2, z3, a3, z4, a4, W2, W3, W4):
m = y.shape[0]
dz4 = a4 - y
dW4 = a3.T @ dz4 / m
db4 = np.sum(dz4, axis=0, keepdims=True) / m

da3 = dz4 @ W4.T
dz3 = da3 * relu_derivative(z3)
dW3 = a2.T @ dz3 / m
db3 = np.sum(dz3, axis=0, keepdims=True) / m

da2 = dz3 @ W3.T
dz2 = da2 * relu_derivative(z2)
dW2 = a1.T @ dz2 / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m

da1 = dz2 @ W2.T
dz1 = da1 * relu_derivative(z1)
dW1 = X.T @ dz1 / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m

return dW1, db1, dW2, db2, dW3, db3, dW4, db4

# --------------------------
# 训练
# --------------------------
def train(X, y, W1, b1, W2, b2, W3, b3, W4, b4):
loss_list = []
acc_list = []
for epoch in range(epochs):
epoch_loss = 0
correct = 0
total = 0
pbar = tqdm(range(0, len(X), batch_size), desc=f"Epoch {epoch+1}/{epochs}")
for i in pbar:
X_batch = X[i:i+batch_size]
y_batch = y[i:i+batch_size]

z1, a1, z2, a2, z3, a3, z4, a4 = forward(X_batch, W1, b1, W2, b2, W3, b3, W4, b4)
loss = binary_cross_entropy(a4, y_batch)
epoch_loss += loss * len(X_batch)

# accuracy
pred = (a4 > 0.5).astype(int)
correct += np.sum(pred == y_batch)
total += len(X_batch)

grads = backward(X_batch, y_batch, z1, a1, z2, a2, z3, a3, z4, a4, W2, W3, W4)
dW1, db1, dW2, db2, dW3, db3, dW4, db4 = grads

# 更新
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
W3 -= lr * dW3
b3 -= lr * db3
W4 -= lr * dW4
b4 -= lr * db4

pbar.set_postfix(loss=loss)
avg_loss = epoch_loss / total
acc = correct / total
loss_list.append(avg_loss)
acc_list.append(acc)
print(f"Epoch {epoch+1}: avg_loss={avg_loss:.4f}, acc={acc:.4f}")
return W1, b1, W2, b2, W3, b3, W4, b4, loss_list, acc_list
# --------------------------
# 预测
# --------------------------
def predict(X, W1, b1, W2, b2, W3, b3, W4, b4):
preds = []
pbar = tqdm(range(0, len(X), batch_size), desc="Predicting")
for i in pbar:
X_batch = X[i:i+batch_size]
_, _, _, _, _, _, _, a4 = forward(X_batch, W1, b1, W2, b2, W3, b3, W4, b4)
preds.append(a4)
preds = np.vstack(preds)
return (preds > 0.5).astype(int), preds

# --------------------------
# 主流程
# --------------------------
W1, b1, W2, b2, W3, b3, W4, b4 = init_weights()
W1, b1, W2, b2, W3, b3, W4, b4, loss_list, acc_list = train(X_train, y_train, W1, b1, W2, b2, W3, b3, W4, b4)

# 绘制LOSS和acc曲线图
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(loss_list, label='Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(acc_list, label='Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training Accuracy')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.savefig('training_curve.png')
#plt.show()
# 预测
pred_labels, pred_probs = predict(X_test, W1, b1, W2, b2, W3, b3, W4, b4)

# 计算准确率
accuracy = np.mean(pred_labels == y_test)
print(f"✅ Test Accuracy: {accuracy * 100:.2f}%")

# 保存CSV
with open('prediction_results.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['id', 'true_label', 'predicted_label', 'probability'])
for i, (true, pred, prob) in enumerate(zip(y_test, pred_labels, pred_probs)):
writer.writerow([i, int(true[0]), int(pred[0]), float(prob[0])])

print("✅ Prediction results saved to prediction_results.csv")

requirement.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
colorama==0.4.6
contourpy==1.3.0
cycler==0.12.1
fonttools==4.57.0
imageio==2.37.0
importlib_resources==6.5.2
kiwisolver==1.4.7
lazy_loader==0.4
matplotlib==3.9.4
networkx==3.2.1
numpy==1.24.4
packaging==25.0
pillow==11.2.1
pyparsing==3.2.3
python-dateutil==2.9.0.post0
scikit-image==0.24.0
scipy==1.13.1
six==1.17.0
tifffile==2024.8.30
tqdm==4.67.1
zipp==3.21.0