参考链接(DONE)#

参考链接(Reference)介绍了 AI 系统相关的链接。

这里在二次文献中,标注出与一次文献的网络链接关系,实现二次文献与全文的直接链接。

一. AI 系统概述#

  1. Silver, D., Huang, A., Maddison, C. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). https://doi.org/10.1038/nature16961

  2. McCulloch, W.S., Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5, 115–133 (1943).

  3. The perceptron - A perceiving and recognizing automaton. Rosenblatt, F. Technical Report 85-460-1, Cornell Aeronautical Laboratory, Ithaca, New York, January, 1957.

  4. Bernard Widrow. (1960). “Adaptive "Adaline" Neuron Using Chemical "memistors".” Number Technical Report 1553-2. Stanford Electron. Labs. Stanford, CA

  5. Minsky, M., Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. Cambridge, MA, USA: MIT Press.

  6. Werbos, Paul J.. “Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences.” (1974).

  7. Rina Dechter. 1986. Learning while searching in constraint-satisfaction-problems. In Proceedings of the Fifth AAAI National Conference on Artificial Intelligence (AAAI'86). AAAI Press, 178–183.

  8. Y. LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," in Neural Computation, vol. 1, no. 4, pp. 541-551, Dec. 1989, doi: 10.1162/neco.1989.1.4.541.

  9. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006 Jul 28;313(5786):504-7. doi: 10.1126/science.1127647. PMID: 16873662.

  10. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

  11. Dong Yu, Frank Seide, and Gang Li. 2012. Conversational speech transcription using context-dependent deep neural networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML'12). Omnipress, Madison, WI, USA, 1–2.

  12. Quoc V. Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, and Andrew Y. Ng. 2012. Building high-level features using large scale unsupervised learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML'12). Omnipress, Madison, WI, USA, 507–514.

  13. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. https://doi.org/10.1145/3065386

  14. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: an imperative style, high-performance deep learning library. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 721, 8026–8037.

  15. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: a system for large-scale machine learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, USA, 265–283.

二. AI 硬件体系结构#

  1. https://www.knime.com/blog/a-friendly-introduction-to-deep-neural-networks

  2. https://machine-learning.paperspace.com/wiki/activation-function

  3. https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/

  4. https://arxiv.org/pdf/1704.04861 英伟达 GPU 架构白皮书:https://www.NVIDIA.cn/technologies/

  5. In-Datacenter Performance Analysis of a Tensor Processing Unit

  6. [An in-depth look at 谷歌’s first Tensor Processing Unit (TPU)](https://cloud.谷歌.com/blog/products/ai-machine-learning/an-in-depth-look-at-谷歌 s-first-tensor-processing-unit-tpu)

  7. 谷歌 Tensor G3: The new chip that gives your Pixel an AI upgrade

  8. Wikipedia-Tensor Processing Unit

  9. A Domain-Specific Supercomputer for Training Deep Neural Networks

[1] Chen T , Du Z , Sun N ,et al.DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning[C]//International Conference on Architectural Support for Programming Languages & Operating Systems.ACM, 2014.DOI:10.1145/2541940.2541967.

[2] Chen Y , Luo T , Liu S ,et al.DaDianNao: A Machine-Learning Supercomputer[C]//2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.0[2024-04-14].DOI:10.1109/MICRO.2014.58.

[3] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O., 2015. ShiDianNao: shifting vision processing closer to the sensor, in: Proceedings of the 42nd Annual International Symposium on Computer Architecture. Presented at the ISCA ’15: The 42nd Annual International Symposium on Computer Architecture, ACM, Portland Oregon, pp. 92–104. https://doi.org/10.1145/2749469

[4] Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X., Chen, Y., 2015. PuDianNao: A Polyvalent Machine Learning Accelerator, in: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. Presented at the ASPLOS ’15: Architectural Support for Programming Languages and Operating Systems, ACM, Istanbul Turkey, pp. 369–381. https://doi.org/10.1145/2694344

[5] Liu S, Du Z, Tao J, et al. Cambricon: An Instruction Set Architecture for Neural Networks[C]// Acm/ieee International Symposium on Computer Architecture. 2016.

[6] 寒武纪 CAMBRICON BANG C/C++ 编程指南

[7] 陈云霁,李玲,李威,郭崎,杜子东,2020. 《智能计算系统》, 机械工业出版社

[1] 未名超算队. "北大未名超算队高性能计算入门讲座(一):概论." Bilibili, [https://www.bilibili.com/video/BV1814y1g7YC/]

[2] 专用架构与 AI 软件栈(1). Zhihu, [https://zhuanlan.zhihu.com/p/387269513]

[3] "AMD’s CDNA 3 Compute Architecture." Chips and Cheese, [https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/]

[4] CUDA 生态才是英伟达 AI 霸主护城河-深度分析 2024. WeChat, [https://mp.weixin.qq.com/s/VGej8Jjags5v0JsHIuf_tQ]

[1] 未名超算队. "北大未名超算队高性能计算入门讲座(一):概论." Bilibili, [https://www.bilibili.com/video/BV1814y1g7YC/]

[2] 专用架构与 AI 软件栈(1). Zhihu, [https://zhuanlan.zhihu.com/p/387269513]

[3] "AMD’s CDNA 3 Compute Architecture." Chips and Cheese, [https://chipsandcheese.com/2023/12/17/amds-cdna-3-compute-architecture/]

[4] CUDA 生态才是英伟达 AI 霸主护城河-深度分析 2024. WeChat, [https://mp.weixin.qq.com/s/VGej8Jjags5v0JsHIuf_tQ]

[1] "David Patterson: A Decade of Machine Learning Accelerators:Lessons Learned and Carbon Footprint" YouTube, [https://www.youtube.com/watch?v=PLK3pGELbSs]

[2] "TPU 演进十年:谷歌的十大经验教训" 知乎, [https://zhuanlan.zhihu.com/p/573794328]

三. AI 编译器#

此外,c 语言中常见的操作还有对数组和结构体的操作,内置函数和外部函数的引用等,更深一步的内容可以参考简单了解 LLVM IR 基本语法-CSDN 博客

  1. https://zh.wikipedia.org/wiki/三位址碼

  2. https://buaa-se-compiling.github.io/miniSysY-tutorial/pre/llvm_ir_quick_primer.html

  3. https://llvm-tutorial-cn.readthedocs.io/en/latest/chapter-2.html

  4. https://buaa-se-compiling.github.io/miniSysY-tutorial/pre/llvm_ir_ssa.html

  5. https://buaa-se-compiling.github.io/miniSysY-tutorial/pre/design_hints.html

  6. 简单了解 LLVM IR 基本语法-CSDN 博客

  7. https://learning.acm.org/techtalks/computerarchitecture

  8. https://segmentfault.com/a/1190000041739045

四. 推理系统&引擎#

  1. Deep Learning Inference in Meta Data Centers: Characterization, Performance Optimizations and Hardware Implications

  2. Clipper: A Low-Latency Online Prediction Serving System

  3. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

  4. TensorFlow-Serving: Flexible, High-Performance ML Serving

  5. Optimal Aggregation Policy for Reducing Tail Latency of Web Search

  6. A Survey of Model Compression and Acceleration for Deep Neural Networks

  7. CSE 599W: System for ML - Model Serving

  8. https://developer.NVIDIA.com/deep-learning-performance-training-inference

  9. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

  10. Learning both Weights and Connections for Efficient Neural Networks

  11. DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT

  12. Halide: A Language and Compiler for Optimizing Parallelism,Locality, and Recomputation in Image Processing Pipelines

  13. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  14. 8-bit Inference with TensorRT

  15. microsoft/AI-System

  16. 推理系统&引擎

  17. NCNN、OpenVino、 TensorRT、MediaPipe、ONNX,各种推理部署架构,到底哪家强?

  18. 【AI System】第 8 章:深度学习推理系统

  19. 【AI】推理系统和推理引擎的整体架构

  20. Deep Learning Inference in Meta Data Centers: Characterization, Performance Optimizations and Hardware Implications

  21. Clipper: A Low-Latency Online Prediction Serving System

  22. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

  23. TensorFlow-Serving: Flexible, High-Performance ML Serving

  24. Optimal Aggregation Policy for Reducing Tail Latency of Web Search

  25. A Survey of Model Compression and Acceleration for Deep Neural Networks

  26. CSE 599W: System for ML - Model Serving

  27. Deep Learning Performance Training Inference

  28. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

  29. Learning both Weights and Connections for Efficient Neural Networks

  30. DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT

  31. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines

  32. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  33. 8-bit Inference with TensorRT

  34. Microsoft AI System

  35. 模型推理服务化之 Triton:如何基于 Triton 开发自己的推理引擎?

  36. Deep Learning Inference in Meta Data Centers: Characterization, Performance Optimizations and Hardware Implications

  37. Clipper: A Low-Latency Online Prediction Serving System

  38. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

  39. TensorFlow-Serving: Flexible, High-Performance ML Serving

  40. Optimal Aggregation Policy for Reducing Tail Latency of Web Search

  41. A Survey of Model Compression and Acceleration for Deep Neural Networks

  42. CSE 599W: System for ML - Model Serving

  43. https://developer.nvidia.com/deep-learning-performance-training-inference

  44. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

  45. Learning both Weights and Connections for Efficient Neural Networks

  46. DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT

  47. Halide: A Language and Compiler for Optimizing Parallelism,Locality, and Recomputation in Image Processing Pipelines

  48. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  49. 8-bit Inference with TensorRT

  50. microsoft/AI-System

  51. 【AI System】第 8 章:深度学习推理系统

  52. Tengine-Kit 人脸检测及关键点

  53. Crazy Rockets-教你如何集成华为 HMS ML Kit 人脸检测和手势识别打造爆款小游戏

  54. 记录自己神经网络模型训练的全流程

  55. 推理系统和推理引擎的整体架构

  56. Pytorch-Onnx-Tensorrt 模型转换教程案例

  57. 昇思 MindSpore 基本介绍

  58. 飞桨产品全景

  59. Deep Learning Inference in Meta Data Centers: Characterization, Performance Optimizations and Hardware Implications

  60. Clipper: A Low-Latency Online Prediction Serving System

  61. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

  62. TensorFlow-Serving: Flexible, High-Performance ML Serving

  63. Optimal Aggregation Policy for Reducing Tail Latency of Web Search

  64. A Survey of Model Compression and Acceleration for Deep Neural Networks

  65. CSE 599W: System for ML - Model Serving

  66. https://developer.NVIDIA.com/deep-learning-performance-training-inference

  67. DEEP COMPRESSION: COMPRESSING DEEP NEURAL NETWORKS WITH PRUNING, TRAINED QUANTIZATION AND HUFFMAN CODING

  68. Learning both Weights and Connections for Efficient Neural Networks

  69. DEEP LEARNING DEPLOYMENT WITH NVIDIA TENSORRT

  70. Halide: A Language and Compiler for Optimizing Parallelism,Locality, and Recomputation in Image Processing Pipelines

  71. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

  72. 8-bit Inference with TensorRT

  73. https://github.com/microsoft/AI-System

  74. J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “MoDNN: Local distributed mobilecomputing system for deep neural network,” in Proc. Design, Autom. Test Eur. Conf. Exhibit.(DATE), Mar. 2017, pp. 1396–1401.

  75. Z. Zhao, K. M. Barijough, and A. Gerstlauer, “Deepthings: Distributed adaptive deep learning inference on resource-constrained iot edge clusters,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 11, pp. 2348–2359, Nov. 2018.

1.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012

2.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (June 2017), 84–90. https://doi.org/10.1145/3065386

3.Silver, D., Huang, A., Maddison, C. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016). https://doi.org/10.1038/nature16961

4.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

5.Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell,Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C Ling, et al. Dla: Compiler and fpga overlay for neural network inference acceleration. In International Conference on Field Programmable Logic and Applications, pages 411–4117. IEEE, 2018.

1.François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv:1610.02357, 2016.

2.Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. arXiv:1611.05431, 2016.

3.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

4.Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

5.Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

6.Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE (2017) 2755–2763

7.Zhang, X., Zou, J., Ming, X., He, K., Sun, J.: Efficient and accurate approximations of nonlinear convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1984–1992

8.Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38(10) (2016) 1943–1955

9.Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B.,Shelhamer, E.: cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)

10.Das, D., Avancha, S., Mudigere, D., Vaidynathan, K., Sridharan, S., Kalamkar,D., Kaul, B., Dubey, P.: Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016)

11.Ioannou, Y., Robertson, D., Cipolla, R., Criminisi, A.: Deep roots: Improving cnn efficiency with hierarchical filter groups. arXiv preprint arXiv:1605.06489 (2016)

12.Zhang, T., Qi, G.J., Xiao, B., Wang, J.: Interleaved group convolutions for deep neural networks. In: International Conference on Computer Vision. (2017)

13.Xie, G., Wang, J., Zhang, T., Lai, J., Hong, R., Qi, G.J.: Igcv 2: Interleaved structured sparse convolutional neural networks. arXiv preprint arXiv:1804.06202(2018)

14.Sun, K., Li, M., Liu, D., Wang, J.: Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178 (2018)

15.Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI. Volume 4. (2017)

16.Zhang, X., Zou, J., He, K., Sun, J.: Accelerating very deep convolutional networks for classification and detection. IEEE transactions on pattern analysis and machine intelligence 38(10) (2016) 1943–1955

17.Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B.,Shelhamer, E.: cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)

18.O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252,2015.

19.S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou.Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016. 2

20.B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017. 1,

1.M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from TensorFlow. org, 1,2015.

2.I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. arXiv preprint arXiv:1609.07061, 2016. 2

3.F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 1mb model size. arXiv preprint arXiv:1602.07360, 2016. 1, 6

4.S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015.

5.M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions.arXiv preprint arXiv:1405.3866, 2014. 2

6.Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. 4

7. J. Jin, A. Dundar, and E. Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014. 1, 3

8.A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei.Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition,Colorado Springs, CO, June 2011. 6

9.J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev,T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. arXiv preprint arXiv:1511.06789, 2015. 6

10.R. Avenash and P. Vishawanth. Semantic segmentation of satellite images using a modified cnn with hard-swish activation function. In VISIGRAPP, 2019. 2, 4

11. Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi,Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR,2017. 7

12.Wei Liu, Dragomir Anguelov, Dumitru Erhan,Christian Szegedy, Scott Reed, Cheng-Yang Fu,and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016.

13.Jonathan Huang, Vivek Rathod, Derek Chow,Chen Sun, and Menglong Zhu. TensorFlow object detection api, 2017. 7

14.Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017. 7

15.Matthias Holschneider, Richard KronlandMartinet, Jean Morlet, and Ph Tchamitchian.A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets:Time-Frequency Methods and Phase Space, pages 289–297. 1989. 7

16.Pierre Sermanet, David Eigen, Xiang Zhang,Michael Mathieu, Rob Fergus, and Yann Le- ¨Cun. Overfeat: Integrated recognition, localization and detection using convolutional networks.arXiv:1312.6229, 2013. 7

17.George Papandreou, Iasonas Kokkinos, and PierreAndre Savalle. Modeling local and global deformations in deep learning: Epitomic convolution,multiple instance learning, and sliding window detection. In CVPR, 2015. 7

18.T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014. 7

19C. Liu, B. Zoph, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L.Yuille, J. Huang, and K. Murphy.Progressive neural architecture search. CoRR, abs/1712.00559, 2017. 2

20.H. Liu, K. Simonyan, and Y. Yang. DARTS: differentiable architecture search. CoRR, abs/1806.09055, 2018. 2

21.W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. CoRR, abs/1506.04579, 2015. 7

22. J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015. 8

22.S. Mehta, M. Rastegari, A. Caspi, L. G. Shapiro, and H. Hajishirzi. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Computer Vision -ECCV 2018 - 15th European Conference, Munich, Germany,September 8-14, 2018, Proceedings, Part X, pages 561–580,2018. 8

23.S. Mehta, M. Rastegari, L. G. Shapiro, and H. Hajishirzi. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. CoRR, abs/1811.11431, 2018.

24.H. Park, Y. Yoo, G. Seo, D. Han, S. Yun, and N. Kwak.Concentrated-comprehensive convolutions for lightweightsemantic segmentation. CoRR, abs/1812.04920, 2018. 8

25.H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean.Efficient neural architecture search via parameter sharing.CoRR, abs/1802.03268, 2018. 2

26.P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. CoRR, abs/1710.05941, 2017. 2, 4

27.F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR,abs/1602.07360, 2016. 2

28.J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng. Quantized convolutional neural networks for mobile devices. CoRR,abs/1512.06473, 2015. 2

29.S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.

30.Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vision transformer using structural reparameterization. arXiv preprint arXiv:2303.14189, 2023.

1.Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. (2017)

2.He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: ECCV. (2014)

3.Ess, A., Muller, T., Grabner, H., Van Gool, L.J.: Segmentation-based urban traffic scene understanding. In: BMVC. (2009)

4.Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: CVPR. (2015)

5.Xiang, Y., Fox, D.: DA-RNN: Semantic mapping with data associated recurrent neural networks. Robotics: Science and Systems (RSS) (2017)

6.Chollet, F.: Xception: Deep learning with depthwise separable convolutions. CVPR (2017)

7.Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. ICLR (2016)

8.Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. CVPR (2017)

9.Zhao, H., Qi, X., Shen, X., Shi, J., Jia, J.: Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545 (2017)

10.Dai, J., He, K., Sun, J.: Convolutional feature masking for joint object and stuff segmentation.In: CVPR. (2015)

11.Tao Lei, Yu Zhang, and Yoav Artzi. Training rnns as fast as cnns. In EMNLP, 2018. 8

12.Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017. 5

13.Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In ICCV, 2011. 6

14.Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578,2016.2

15.M. Siam, M. Gamal, M. Abdel-Razek, S. Yogamani, and M.Jagersand. rtseg: Real-time semantic segmentation comparative study. In 2018 25th IEEE International Conference on Image Processing (ICIP).7

1.Anonymous. Snas: stochastic neural architecture search. In Submitted to International Conference on Learning Repre sentations, 2019. under review.

2.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages770–778, 2016.

3.G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger.Condensenet: An efficient densenet using learned group convolutions. group, 3(12):11, 2017.

4.E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144,2016.

5.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

6.H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.

7.X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arxiv 2017. arXiv preprint arXiv:1707.01083.

8.T. Veniat and L. Denoyer. Learning time/memory-efficient deep architectures with budgeted super networks. arXiv preprint arXiv:1706.00046, 2017.

9.T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platform-aware neuralnetwork adaptation for mobile applications. Energy, 41:46,2018.

10.Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-efficient convolutional neural networks using energy aware pruning. arXiv preprint arXiv:1611.05128, 2016.

11.Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc Le. Learning transferable architectures for scalable image recognition. pages 8697–8710, 06 2018

12.HongyiZhang, MoustaphaCisse, YannNDauphin, andDavid Lopez-Paz. mixup: Beyond empirical risk minimization.ICLR, 2018. 5

13.NingningMa,XiangyuZhang,Hai-TaoZheng, andJianSun.ShuffleNet V2: Practical guidelines for efficient CNN architecture design. arXiv preprint arXiv:1807.11164, 2018.

14.Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.

15.M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le.Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.

16. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database.In CVPR, 2009. 5

17.Piotr Dollár, Mannat Singh, and Ross Girshick. Fast and accurate model scaling. arXiv preprint arXiv:2103.06877, 2021. 12

18.Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts:Differentiable architecture search. ICLR, 2019. 3

19.Jieru Mei, Yingwei Li, Xiaochen Lian, Xiaojie Jin, Linjie Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,Yang, AlanYuille, and Jianchao Yang. Atomnas: Fine-grained end-to-end neural architecture search. ICLR, 2020. 7

20.Linnan Wang, Yiyang Zhao, Yuu Jinnai, Yuandong Tian,and Rodrigo Fonseca. Neural architecture search using deep neural networks and monte carlo tree search. In AAAI, 2020.

1.Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,K. Q. Densely connected convolutional networks. CVPR,2017.

2.Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? CVPR, 2019.

3.Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009.

4.Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp. 6172 6181, 2018.

5.Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2:Practical guidelines for efficient cnn architecture design.ECCV, 2018.

6.Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition.CVPR, 2018.

7.Zagoruyko, S. and Komodakis, N. Wide residual networks.BMVC, 2016.

8.Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba,A. Learning deep features for discriminative localization.CVPR, pp. 2921–2929, 2016.

9.Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. ICLR, 2018.

10.Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009.

11.Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self training with noisy student improves imagenet classification. CVPR, 2020.

12.Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.Mixup: Beyond empirical risk minimization. ICLR, 2018.

13. Ridnik, T., Lawen, H., Noy, A., Baruch, E. B., Sharir,G., and Friedman, I. Tresnet: High performance gpu dedicated architecture. arXiv preprint arXiv:2003.13630,2020.

14.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018.

15.Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width.NeurIPS, 2018. Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L.The expressive power of neural networks: A view from the width.NeurIPS, 2018.

1.Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR, 2019.

2.Hanting Chen, Yunhe Wang, Chang Xu, Zhaohui Yang,Chuanjian Liu, Boxin Shi, Chunjing Xu, Chao Xu, and Qi Tian. Data-free learning of student networks. In ICCV, 2019.

3.Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2016.

4.Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for generative adversarial networks. In ICCV, 2019.

5.Kai Han, Jianyuan Guo, Chao Zhang, and Mingjian Zhu.Attribute-aware attention model for fine-grained representation learning. In ACM MM, 2018.

6.Tsung-Yi Lin, Piotr Doll´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

7.Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In ICLR, 2019.

8.Mingzhu Shen, Kai Han, Chunjing Xu, and Yunhe Wang. Searching for accurate binary neural architectures. In ICCV Workshops, 2019.

9.Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, pages 8697–8710, 2018.

10.Shan You, Chang Xu, Chao Xu, and Dacheng Tao. Learning from multiple teacher networks. In SIGKDD, 2017.

11.Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

12.Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.

13.Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV),pages 116–131, 2018.

14.Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.

15.Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067, 2015.

1.Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 3286–3295, 2019.

2.Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. CrossVit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021a.

3.Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.

4.Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895,2021b.

5.Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258, 2017.

6.Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment:Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123, 2019.

7.Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.

8.St´ ephane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun.Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697, 2021.

9.Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszko reit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

10.Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural net works. In International conference on machine learning, pp. 6105–6114. PMLR, 2019.

11.Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll´ar, and Ross Girshick. Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34:30392–30400, 2021.

12.Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. Advances in Neural Information Processing Systems, 34:28522 28535, 2021b.

13.Qinglong Zhang and Yu-Bin Yang. Rest: An efficient transformer for visual recognition. Advances in Neural Information Processing Systems, 34:15475–15485, 2021.

14.Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.

15.Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.

16.Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.

17.Apoorv Vyas, Angelos Katharopoulos, and François Fleuret. Fast transformers with clustered attention.Advances in Neural Information Processing Systems, 33:21665–21674, 2020.

18.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

19.[Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.](Acceleration of stochastic approximation by averaging)

20.Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.

1.Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to end object detection with transformers. In ECCV, 2020. 2,4, 5, 7, 8

2.Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. Dynamic relu. In ECCV,2020. 2, 3, 4, 6

3.Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Va sudevan, and Quoc V. Le. Autoaugment: Learning augmen tation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), June 2019. 5

4.St´ephane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases.arXiv preprint arXiv:2103.10697, 2021. 2, 3, 5, 6

5.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5, 6, 12

6.Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652, 2021. 2

7.Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. 1, 2, 3

8.Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron,Pierre Stock, Armand Joulin, Herv´ e J´ egou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:22104.01136, 2021. 1,2, 3, 6

9.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and patter recognition, pages 770–778, 2016. 3, 7, 8

10.Geoffrey E. Hinton. How to represent part-whole hierarchies in a neural network. CoRR, abs/2102.12627, 2021. 2

11.Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu,Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV), October 2019. 1, 2, 4, 5, 6, 7, 8, 11, 12

12.Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An dreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. 1, 2

13.Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018. 2

14.Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu,Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers, 2021. 1, 2, 3

15.DaquanZhou, Qi-BinHou, Y.Chen, Jiashi Feng, and S. Yan.Rethinking bottleneck structure for efficient mobile network design. In ECCV, August 2020. 2

1.Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6, 8

2.Hugo Touvron, Matthieu Cord, and Herve J ´ egou. Deit iii: ´Revenge of the vit. arXiv preprint arXiv:2204.07118, 2022.13

3.Zizheng Pan, Jianfei Cai, and Bohan Zhuang. Fast vision transformers with hilo attention. arXiv preprint arXiv:2205.13213, 2022. 1

4.Noam Shazeer, Zhenzhong Lan, Youlong Cheng, Nan Ding, and Le Hou. Talking-heads attention. arXiv preprint arXiv:2003.02436, 2020. 4

5.Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. Inception transformer. arXiv preprint arXiv:2205.12956, 2022. 1, 2, 4

6.Wenqiang Zhang, Zilong Huang, Guozhong Luo, Tao Chen,Xinggang Wang, Wenyu Liu, Gang Yu, and Chunhua Shen.Topformer: Token pyramid transformer for mobile semantic segmentation, 2022. 2

7.Zizhao Zhang, Han Zhang, Long Zhao, Ting Chen, Sercan Arik, and Tomas Pfister. Nested hierarchical transformer:Towards accurate, data-efficient and interpretable visual understanding. 2022. 2

8.Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng,and Shuicheng Yan. Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418, 2021

9.Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907, 2021.

10.Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.

1.Khalid Ashraf, Bichen Wu, Forrest N. Iandola, Matthew W. Moskewicz, and Kurt Keutzer. Shallow networks for high-accuracy road object-detection. arXiv:1606.01561, 2016.

2.Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoderdecoder architecture for image segmentation. arxiv:1511.00561, 2015.

3.Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274, 2015a.

4.Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531, 2013.

5.Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Shijian Tang, Erich Elsen, Bryan Catanzaro, John Tran, and William J. Dally. Dsd: Regularizing deep neural networks with dense-sparse-dense training flow. arXiv:1607.04381, 2016b

6.C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In Computer Vision and Pattern Recognition Workshops (CVPRW),2011 IEEE Computer Society Conference on, pages109–116, 2011.

7.M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.

8.M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi.Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542, 2016.

9. S. Williams, A. Waterman, and D. Patterson. Roofline:an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4):65–76, 2009.

10.B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. arXiv preprint arXiv:1710.07368, 2017.

11.K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.

12.K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

13.S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456, 2015.

14.S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations(ICLR), 2016.

15.A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,W. Wang, T. Weyand, M. Andreetto, and H. Adam.Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

  • Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

  • Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks(ICCV 2019)

  • IR-Net: Forward and Backward Information Retention for Highly Accurate Binary Neural Networks(CVPR 2020)

  • Towards Unified INT8 Training for Convolutional Neural Network(CVPR 2020)

  • Rotation Consistent Margin Loss for Efficient Low-bit Face Recognition(CVPR 2020)

  • DMS: Differentiable diMension Search for Binary Neural Networks(ICLR 2020 Workshop)

  • Nagel, Markus, et al. "A white paper on neural network quantization." arXiv preprint arXiv:2106.08295 (2021).

  • Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint arXiv:1806.08342 (2018)

  • 全网最全-网络模型低比特量化 https://zhuanlan.zhihu.com/p/453992336

  • Practical Quantization in PyTorch

  • Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

  • Wu, Hao, et al. "Integer quantization for deep learning inference: Principles and empirical evaluation." arXiv preprint arXiv:2004.09602 (2020).

  • Learning Accurate Low-Bit Deep Neural Networks with Stochastic Quantization

  • Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks(ICCV 2019)

  • IR-Net: Forward and Backward Information Retention for Highly Accurate Binary Neural Networks(CVPR 2020)

  • Towards Unified INT8 Training for Convolutional Neural Network(CVPR 2020)

  • Rotation Consistent Margin Loss for Efficient Low-bit Face Recognition(CVPR 2020)

  • DMS: Differentiable diMension Search for Binary Neural Networks(ICLR 2020 Workshop)

  • Nagel, Markus, et al. "A white paper on neural network quantization." arXiv preprint arXiv:2106.08295 (2021).

  • Krishnamoorthi, Raghuraman. "Quantizing deep convolutional networks for efficient inference: A whitepaper." arXiv preprint arXiv:1806.08342 (2018)

  • Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., & Keutzer, K. (2021). A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630.

  • Wu, H., Judd, P., Zhang, X., Isaev, M., & Micikevicius, P. (2020). Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602.

  • 8-bit Inference with TensorRT

  • Practical Quantization in PyTorch

  • 全网最全-网络模型低比特量化

  1. Jianping Gou et al. Knowledge Distillation: A Survey. https://doi.org/10.1007/s11263-021-01453-z

  2. Hinton et al. Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531

  3. Longhui Wei et al. Circumventing outlier of autoaugment with knowledge distillation. https://doi.org/10.1007/978-3-030-58580-8_36

  4. Caruana et al. Model compression. https://doi.org/10.1145/1150402.1150464

  5. 模型压缩(上)--知识蒸馏(Distilling Knowledge)https://www.jianshu.com/p/a6d87b338bcf

  6. DeiT:注意力也能蒸馏 https://www.cnblogs.com/ZOMI/p/16496326.html

  7. AI 框架部署方案之模型转换

  8. AI 技术方案(个人总结)

  9. 人工智能系统 System for AI 课程介绍 Lecture Introduction

  10. 【AI】推理引擎的模型转换模块

  11. Pytorch 和 TensorFlow 在 padding 实现上的区别

  12. 训练模型到推理模型的转换及优化

  13. 使用 Grappler 优化 TensorFlow 计算图

  14. 死代码消除

  15. AI 编译器之前端优化-下(笔记)

  16. PyTorch 官方教程中文版

  17. MindSpore 教程

  18. TensorFlow Core

  19. 保存和加载 Keras 模型

  20. 探索 ONNX 模型:动态输入尺寸的实践与解决方案

  21. Pytorch 复习笔记--导出 Onnx 模型为动态输入和静态输入

  22. PyTorch 学习—19.模型的加载与保存(序列化与反序列化)

  23. 开源 AI 模型序列化总结

  24. ONNX 学习笔记

  25. 深入 CoreML 模型定义

  26. Swift loves TensorFlow and CoreML

  27. 什么是 Protobuf?

  28. Protobuf 语法指南

  29. 深入浅出 FlatBuffers 之 Schema

  30. FlatBuffers,MNN 模型存储结构基础 ---- 无法解读 MNN 模型文件的秘密

  31. 华为昇思 MindSpore 详细教程(一)

  32. 如何将在 GPU 上训练的模型加载到 CPU(系统)内存中?

  33. 11 模型的保存加载¶

  1. Winograd, Shmuel. Arithmetic complexity of computations. Vol. 33. Siam, 1980.

  2. Lavin, Andrew, and Scott Gray. "Fast algorithms for convolutional neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

  3. A simple python module for computing minimal Winograd convolution algorithms for use with convolutional neural networks

  4. video: Fast Algorithms for Convolutional Neural Networks by Andrew Lavin and Scott Gray

  5. video: Even Faster CNNs Exploring the New Class of Winograd Algorithms

  6. Understanding ‘Winograd Fast Convolution’

  7. 详解卷积中的 Winograd 加速算法

  8. 一文看懂 winograd 卷积加速算法

  9. 详解 Winograd 变换矩阵生成原理

  10. AI 算法基础 [4]:Winograd 算法原理

  11. [DL]Winograd 快速卷积算法

  12. MegEngine Inference 卷积优化之 Im2col 和 winograd 优化

  13. Winograd 卷积的纯 Python 实现

  14. Winograd 优化算法

五. AI 框架核心模块#

  1. 深入浅出:AI 框架与计算图的关系

  2. 4.1. 计算图的设计背景和作用

  3. 【AI】推理系统和推理引擎的整体架构

  4. 谈谈深度学习框架的数据排布

  5. 从零构建 AI 推理引擎系列

  6. 一篇就够:高性能推理引擎理论与实践 (TensorRT)

  7. 序列化之 FlatBuffers

  8. 【AI】推理引擎的模型转换模块

  9. 深度学习模型转换

  10. deep-learning-model-convertor

  11. hb_mapper_tools_guide

  12. 模型转换:由 Pytorch 到 TFlite

  13. AI 框架部署方案之模型转换

  14. Open Neural Network Exchange Intermediate Representation (ONNX IR) Specification

  15. 模型部署入门教程(一):模型部署简介

  16. 模型部署入门教程(三):PyTorch 转 ONNX 详解

[1] Boris Ginsburg, Sergei Nikolaev, Paulius Micikevicius, (2017). TRAINING WITH MIXED PRECISION. Retrieved from https://on-demand.GPUtechconf.com/gtc/2017/presentation/s7218-training-with-mixed-precision-boris-ginsburg.pdf.

[2] Wickipedia. Half-precision floating-point format. Retrieved from https://en.wikipedia.org/wiki/Half-precision_floating-point_format.

[3] The huggingface Authors. (2024). Methods and tools for efficient training on a single GPU. Retrieved from https://huggingface.co/docs/transformers/main/en/perf_train_GPU_one

[1] Li S, Zhao Y, Varma R, et al. Pytorch distributed: Experiences on accelerating data parallel training[J]. arXiv preprint arXiv:2006.15704, 2020.

[2] Rajbhandari S, Rasley J, Ruwase O, et al. Zero: Memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020: 1-16.

[3] Li M, Zhou L, Yang Z, et al. Parameter server for distributed machine learning[C]//Big learning NIPS workshop. 2013, 6(2).

[1] The Pytorch Authors. (2024). Getting Started with Distributed Data Parallel. Retrieved from https://pytorch.org/tutorials/intermediate/ddp_tutorial.html.

[1] Rajbhandari S, Rasley J, Ruwase O, et al. Zero: Memory optimizations toward training trillion parameter models[C]//SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020: 1-16.

[2] Rajbhandari S, Ruwase O, Rasley J, et al. Zero-infinity: Breaking the GPU memory wall for extreme scale deep learning[C]//Proceedings of the international conference for high performance computing, networking, storage and analysis. 2021: 1-14.

[3] Lv K, Yang Y, Liu T, et al. Full parameter fine-tuning for large language models with limited resources[J]. arXiv preprint arXiv:2306.09782, 2023.