omputer vision is a branch of computer science that allows computers to understand and reason about the content of photos or videos - from identifying the type of scene or objects in the scene - where they start and end, approximately how far away they are, and even how they are moving - to understanding higher level relationships such as what objects are interacting with one another and how, what actions are taking place, and even descent guesses about how the scene is likely to progress.
The field started in the 1960s-70s and focused on looking for handcrafted patterns for depth/motion understanding as well as classification. Around the start of the decade of 2010-2020, Computer Vision experienced a revolution when the spread of compute and data allowed Machine Learning models to rapidly unlock large jumps in accuracy and precision.
Today, Computer Vision is important for everything from enabling autonomous vehicles to understand the world they are navigating, to using mobile phones to identify skin cancer, read a menu abroad in a foreign language, or identifying defective products in a manufacturing line.
We hope that you will find the following resources interesting and helpful in starting to learn about Computer Vision. We always want to improve our selection and curation process by including other topics not covered in this shortlist. To that end, we kindly encourage sending your feedback and suggestions to selects-feedback@acm.org. We look forward to your guidance on how we can continue to improve ACM Selects together.
Read more about ACM's ongoing efforts to provide resources for students and professionals through the ACM Learning Center.
Computer Vision Foundations
Building Rome in a Day
First published in Communications of the ACM, Vol. 54, No. 10, Oct 2011.
A great example of a systems paper for a classical Computer Vision problem that is still incredibly relevant today: understanding the 3D geometry and creating computer vision pipelines on large repositories of data ‘from the wild’. In this paper, the 3D structure of city landmarks was reconstructed given many photos taken around that area from different camera devices.
The Science of Shape: Revolutionizing Graphics and Vision in the Third Dimension
First published in XRDS: Crossroads, The ACM Magazine for Students, Sep 2007.
Computer Vision is fundamentally built on a strong foundation of geometry, numerical algorithms, and linear algebra and shares a lot of its mathematical foundations with Computer Graphics. Justin Solomon is the head of the Geometric Data Processing group in MIT CSAIL and author of the textbook Numerical Algorithms. Solomon wrote The science of shape: revolutionizing graphics and vision with the third dimension as a freshman at Stanford, expressing his enthusiasm at the study of 3D geometry for Computer Vision and Graphics applications.
It’s all about Image
First published in Communications of the ACM, Vol. 60, No. 6, May 2017.
This Communications of the ACM article provides a detailed overview for how artificial neural networks and deep learning have accelerated computer image recognition. We recommend this as a starting point for understanding both the key techniques and varying applications of computer vision.
Machine Learning enters Computer Vision
In the decade starting in 2010, the spread of compute and dataset collection allowed Deep Learning to unlock incredible new capabilities in Computer Vision. The resources in this section discuss this process and its implications.
Technical Perspective: What led Computer Vision to Deep Learning?
First published in Communications of the ACM, Vol. 60, No. 6, May 2017.
In 2012 Krizhevksy, Sutskever and Hinton published a landmark paper that shaped the trajectory of modern computer vision. Here Jitendra Malik (Arthur J. Click Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley) shares his technical perspective on the paper’s lasting impact. We recommend this article for history context on how artificial neural network-based techniques influenced computer vision from a key expert in the field.
ImageNet: Where Have We Gone? Where Are We Going? with Fei-Fei Li
First published as an ACM Tech Talk, September 2017.
In 2009-2010 computer vision underwent a revolution from algorithms similar mostly to ‘feature engineering’, or, hand-crafted pattern matching to learning those patterns using machine learning. Here Fei-Fei Li talks about the seminal dataset that propelled this revolution, how it was created, machine learning concepts, where we’ve come, and where we’re going.
Scene Understanding by Labeling Pixels
First published in Communications of the ACM, Vol. 57, No. 11, Nov 2014.
An early overview paper on an important Computer Vision task enabled by Deep Learning: segmentation of an image to its semantic parts.
Emerging Trends
Synthesis, the Convergence of Computer Vision and Graphics
First published as an ACM Tech Talk, August 2018.
In his talk, “From media to meaning and from meaning to media”, Blaise Agüera y Arcas discusses a domain which takes traditional CV classification problems and flips it on its head - creating generative models to represent media. This talk highlights several important concepts: generative models are an increasingly important and feasible application domain, the fields of graphics and Computer Vision are growing closer together, and the technology we work towards can have large social implications and impact - which are important to keep in mind.
Mobile and Edge Computing
Pervasive use of computer vision is often constrained by compute and memory available for the task, and advances in deep learning are driving computational demands even further.
Model quantization is a one of the popular approaches used to deploy deep learning models in edge and mobile devices. Quantization involves converting the floating point numbers in computer vision models to integers (typically int8) to reduce model size and computational costs for floating-point inference, without compromising prediction accuracy.
TensorFlow and PyTorch both provide tools to easily quantize computer vision models.
- TensorFlow Lite framework uses data serialization and quantization techniques to compress and optimize TensorFlow models. [Read More]
- PyTorch 1.3 supports quantization modes for doing computations and memory accesses with lower precision data. [Read More]
Computer Vision in HCI: Unlocking New Input Types and Immersive Experiences
The ability to recognize and interpret human actions such as facial expression or gestures using computer vision can allow us to interact with computers in a more natural way. Here are a few examples of such computer vision powered interfaces.
Eyewear Computers for Human-Computer Interaction
First published in Interactions, Vol. 23, No. 3, April 2016.
Envisioning, designing, and implementing the user interface require a comprehensive understanding of interaction technologies. In this forum we scout trends and discuss new technologies with the potential to influence interaction design.
Vision-based Hand-gesture Applications
First published in Communications of the ACM, Vol. 54, No. 2, February 2011.
Body posture and finger pointing are a natural modality for human-machine interaction, but first the system must know what it's seeing.
Using Computer Vision: Three Levels
Now that you know about Computer Vision, it’s history, and its applications, perhaps you would like to use it in your projects. The following section will cover how to get started with computer vision at three levels of technical depth: as a user of SaaS APIs, as a developer integrating deep learning models in your app, and as a student aiming to do research in Computer Vision.
Use Computer Vision as a Service
Building and maintaining robust computer vision models for applications such as image classification often require teams of experts, large datasets and high performance computers, which is prohibitive to many. If you are a developer who wants to quickly use computer vision within your application - several companies including Amazon, Clarifai, Google and Microsoft offer state-of-the-art visual recognition models that can easily be integrated within your applications via REST APIs.
Develop using Existing Packages, Libraries, and Models
The best way to learn how to build your own computer vision applications is by following code samples and tutorials offered by popular libraries and frameworks.
- If your goal is to build real-time computer vision applications such as tracking objects in video streams, we recommend OpenCV - the most popular computer vision framework. [OpenCV Tutorials]
- Keras runs on top of TensorFlow and offers simple, easy to use Python APIs for developing deep learning based applications such as image classification and segmentation. [Keras Vision Examples]
- TensorFlow and PyTorch also offer pre-trained models for applications such as image classification and object detection. [TensorFlow Model Collection] [PyTorch Model Collection]
Understand CV Fundamentals: Become a Practitioner or Researcher
If you are interested in understanding the theoretical and functional fundamentals of computer vision, or pursuing research in this domain, here are a few educational resources for you.
Stanford CS231N
Stanford University’s CS231n was unanimously selected by the editors as their favorite resource for getting started with applied computer vision and deep learning using Convolutional Neural Networks. In this course, you will learn the fundamentals of applying CNNs to common computer vision applications such as image classification, object detection/recognition, and visual segmentation.
[Spring 2020 Course Materials]
Deep Learning for Vision Systems
Published through Manning, available on O'Reilly. Ebook access available to ACM members. Please refer to the following FAQ for any issues accessing the O'Reilly learning platform.
How does the computer learn to understand what it sees? Deep Learning for Vision Systems answers that by applying deep learning to computer vision. Using only high school algebra, this book illuminates the concepts behind visual intuition. You'll understand how to use deep learning architectures to build vision system applications for image generation and facial recognition.