MAXIM Multi-Axis MLP for Image Processing
0. 前言
论文发表在2022CVPR
image processing tasks: including denoising, deblurring, deraining, dehazing, restoration,and enhancement
图像处理:降噪,去模糊,去雾,去雨,图像修复,图像增强。
中文视频讲解:(非常详细,有很多背景介绍,新手友好型)
youtu.be/gpUrUJwZxRQ
1. Introduction
- MLP
全连接层和激活层来构造仿神经元的架构
- Contribution
- A novel and generic architecture for image processing, named MAXIM.
- A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field with linear complexity relative to image size.
- A cross gating block that cross-conditions two separate features, which is also global fully convolutional.
- Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets for 5 different restoration tasks.
2.Related Work
there remain challenges in adapting them for low-level vision.
ViT在high-level tasks表现很好,但是在low-level enhancement and restoration problems表现不是很好
low-level problems上前人的工作都基本引入了自注意力机制,它需要固定输入照片的大小,原始照片需要裁剪
Local-attention based Transformers基于局部注意力的变换器虽然改善了这一问题,但它们也被限制为感受野的大小有限,或者失去非局部性
2.1 Why MLP(Multi-Layer Perceptron)?
MLP -> Convolution(2012)
MLP把图像平坦化为一位向量,因为在图像块中是会有一定的关联性的,打成一维在此时就失去了位置信息,所以引入local 卷积核
Convolution-> Transformer(2017)
图片要掌握全局信息,从local演化成global
Transformer-> MLP?? (2021)
掌握全局信息计算复杂度太高
2.2 ViT
Vision transformer架构,纯self-attetion,不要卷积
2.3 MLP-Mixer
减少计算量,不用self-attetion,用纯MLP来计算,效果稍逊ViT
2.4 gMLP
在MLP-Mixer的基础上的改进,也是不用self-attetion
2.5 Computational complexity of Transformer
全局模型允许在输入的特征图上进行全局空间交互,即每个输出像素是由输入特征的每个点加权而来,需要 O(N) 次乘法(假设 N = HW 为空间尺寸)。因此,输出整张大小为 N 的特征图需要 O(N^2) 次乘法操作,这即为注意力机制/Transformer高计算复杂度的由来。但本质上来说,密集感受野的全局模型如 ViT, Mixer, gMLP 都具有平方计算复杂度。
这种没法Scaling up的平方算子是很难作为通用模块来广泛使用在各大视觉任务上的,例如需要在高分辨率上训练/推理的目标检测,语义分割等,甚至对于几乎所有的底层视觉任务如去噪、去模糊、超分、去雨、去雾、去雪、去阴影、去摩尔、去反射、去水印、去马赛克等等等等。。。
虽然但是,不妨也直接用用!华为北大等联手打造的IPT模型第一次把ViT模型应用在多个底层视觉任务,刷新了各大榜单并发表在CVPR 2021[9][10]。
虽然性能很好,但IPT使用的全局注意力机制具有一些明显的局限性:
(1)需要大量数据预训练(如ImageNet)
(2)无法直接在高分辨率图片上进行推理。在实际推理时,往往需要对输入图像进行切块,分别对每个图像块进行推理,然后再进行拼接来还原大图。这种办法往往会导致输出图片中有一些明显的“块状效应”(如下图),同时推理速度也比较慢,限制了其实际落地和部署能力。
2.6 HiT
一个block同时涵盖local和global信息,分成channel维度,一半的部分做global,一半的部分做local,
3. Method
The main properties of MAXIM
- 利用线性复杂度获得全局的感受野
- 解决input图像的大小是固定的
- 同时掌握global和local信息
The backbone of MAXIM
- Multi-Axis gated MLP block
Complexity analysis
The computational complexity of our proposed Multi-Axis gMLP block (MAB) is:
Ω(MAB) = d^2 HWC(Global gMLP) + b^2 HWC(Local gMLP) + 10HW C^2(Dense layers )
- Cross-gating MLP Block
4. Experiment
5.Conclusion
MAXIM, inspired by recently popular MLP-based global models
1.A novel and generic architecture for image processing, dubbed MAXIM, using a stack of encoder-decoder backbones, supervised by a multi-scale, multi-stage loss.
2.A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field, with linear complexity relative to image size.
3.A cross gating block that cross-conditions two separate features, which is also global and fully-convolutional.
4.Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets including denoising, deblurring, deraining, dehazing, and enhancement.
觉得不错的话,支持一根棒棒糖吧 ୧(๑•̀⌄•́๑)૭
wechat pay
alipay