MAXIM Multi-Axis MLP for Image Processing

0. 前言

论文发表在2022CVPR

image processing tasks: including denoising, deblurring, deraining, dehazing, restoration,and enhancement

图像处理：降噪，去模糊，去雾，去雨，图像修复，图像增强。

中文视频讲解：（非常详细，有很多背景介绍，新手友好型）

youtu.be/gpUrUJwZxRQ

1. Introduction

全连接层和激活层来构造仿神经元的架构

Contribution
1. A novel and generic architecture for image processing, named MAXIM.
2. A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field with linear complexity relative to image size.
3. A cross gating block that cross-conditions two separate features, which is also global fully convolutional.
4. Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets for 5 different restoration tasks.

there remain challenges in adapting them for low-level vision.

ViT在high-level tasks表现很好，但是在low-level enhancement and restoration problems表现不是很好

low-level problems上前人的工作都基本引入了自注意力机制，它需要固定输入照片的大小，原始照片需要裁剪

Local-attention based Transformers基于局部注意力的变换器虽然改善了这一问题，但它们也被限制为感受野的大小有限，或者失去非局部性

2.1 Why MLP(Multi-Layer Perceptron)?

MLP -> Convolution(2012)

MLP把图像平坦化为一位向量，因为在图像块中是会有一定的关联性的，打成一维在此时就失去了位置信息，所以引入local 卷积核
Convolution-> Transformer(2017)

图片要掌握全局信息，从local演化成global
Transformer-> MLP?? (2021)

掌握全局信息计算复杂度太高

2.2 ViT

Vision transformer架构，纯self-attetion，不要卷积

2.3 MLP-Mixer

减少计算量，不用self-attetion，用纯MLP来计算，效果稍逊ViT

2.4 gMLP

在MLP-Mixer的基础上的改进，也是不用self-attetion

2.5 Computational complexity of Transformer

全局模型允许在输入的特征图上进行全局空间交互，即每个输出像素是由输入特征的每个点加权而来，需要 O(N) 次乘法（假设 N = HW 为空间尺寸）。因此，输出整张大小为 N 的特征图需要 O(N^2) 次乘法操作，这即为注意力机制/Transformer高计算复杂度的由来。但本质上来说，密集感受野的全局模型如 ViT, Mixer, gMLP 都具有平方计算复杂度。

这种没法Scaling up的平方算子是很难作为通用模块来广泛使用在各大视觉任务上的，例如需要在高分辨率上训练/推理的目标检测，语义分割等，甚至对于几乎所有的底层视觉任务如去噪、去模糊、超分、去雨、去雾、去雪、去阴影、去摩尔、去反射、去水印、去马赛克等等等等。。。

虽然但是，不妨也直接用用！华为北大等联手打造的IPT模型第一次把ViT模型应用在多个底层视觉任务，刷新了各大榜单并发表在CVPR 2021[9][10]。

虽然性能很好，但IPT使用的全局注意力机制具有一些明显的局限性：

（1）需要大量数据预训练（如ImageNet）

（2）无法直接在高分辨率图片上进行推理。在实际推理时，往往需要对输入图像进行切块，分别对每个图像块进行推理，然后再进行拼接来还原大图。这种办法往往会导致输出图片中有一些明显的“块状效应”（如下图），同时推理速度也比较慢，限制了其实际落地和部署能力。

2.6 HiT

一个block同时涵盖local和global信息，分成channel维度，一半的部分做global，一半的部分做local，

3. Method

The main properties of MAXIM

利用线性复杂度获得全局的感受野
解决input图像的大小是固定的
同时掌握global和local信息

The backbone of MAXIM

Multi-Axis gated MLP block

Complexity analysis

The computational complexity of our proposed Multi-Axis gMLP block (MAB) is:

Ω(MAB) = d^2 HWC(Global gMLP) + b^2 HWC(Local gMLP) + 10HW C^2(Dense layers )

Cross-gating MLP Block

4. Experiment

5.Conclusion

MAXIM, inspired by recently popular MLP-based global models

1.A novel and generic architecture for image processing, dubbed MAXIM, using a stack of encoder-decoder backbones, supervised by a multi-scale, multi-stage loss.

2.A multi-axis gated MLP module tailored for low-level vision tasks, which always enjoys a global receptive field, with linear complexity relative to image size.

3.A cross gating block that cross-conditions two separate features, which is also global and fully-convolutional.

4.Extensive experiments show that MAXIM achieves SOTA results on more than 10 datasets including denoising, deblurring, deraining, dehazing, and enhancement.

觉得不错的话，支持一根棒棒糖吧 ୧(๑•̀⌄•́๑)૭

wechat pay

alipay

论文阅读

#CV

MAXIM Multi-Axis MLP for Image Processing

http://yuting0907.github.io/posts/4d22ba36.html

作者

Echo Yu

发布于

2023年7月22日

许可协议

Designing a Practical Degradation Model for Deep Blind Image Super-Resolution 上一篇

A ConvNet for the 2020s 下一篇

MAXIM Multi-Axis MLP for Image Processing

0. 前言

1. Introduction

2.Related Work

2.1 Why MLP(Multi-Layer Perceptron)?

2.2 ViT

2.3 MLP-Mixer

2.4 gMLP

2.5 Computational complexity of Transformer

2.6 HiT

3. Method

4. Experiment

5.Conclusion