
Beyond the Lens: How AI Can Redefine Vision Systems for Manufacturing
Share
AI has long been associated with cloud-based computing, requiring specialized infrastructure. However, advances in hardware — like NVIDIA’s Jetson series — are pushing AI out of the data center and into compact, low-power devices that can run sophisticated models out in the field at “edge” (aka without an internet connection).
IRH decided to spend some time with the Jetson Orin Nano Series, released last year, to determine if we might be able to leverage this and similar technologies in our low-cost automation (LCA) solutions. Here are the specs for the Jetson we looked at:
Jetson Orin Nano 8GB: Features a 1024-core NVIDIA Ampere architecture GPU with 32 Tensor Cores, a 6-core Arm Cortex-A78AE CPU, and 8GB of 128-bit LPDDR5 memory, delivering up to 40 TOPS of AI performance.
The Jetson Orin Nano is compatible with NVIDIA’s AI software stack and supports other AI frameworks, allowing it to run powerful models in real-world applications without cloud dependency. Because the AI app itself runs on this machine with no need to contact the cloud, it works without an internet connection. One of the most amazing things about it is that the power consumption is a minuscule 15w! By no means the only solution like this on the market today, it is indicative of a long-term trend towards small, powerful systems that are now becoming AI capable.
When thinking about ideal AI applications for this device, object detection immediately floated to the top. Using just a simple webcam, object detection can identify and locate objects at a rate comparable to the frame rate of the camera (about 30 frames per second). For the sake of discussion, let’s call this proposed AI Vision system “Cebu” as a code name.
Seeing the Light
Traditional vision systems for manufacturing often rely on older technology and can be very expensive. They’re typically tied to special cameras and lighting systems offered by specific vendors, meaning less flexibility and at greater expense. These vision systems also require specialized programming to function, a sort of step by step logic block that can do things like give a pass or fail to a part coming down a conveyor.
Using AI object detection for this task could potentially solve several of these pain points and do it for considerably less cost. The idea is simple. A camera (something as basic as a webcam could work here) captures an image of the workspace and the AI analyzes it to recognize objects, figure out their positions, and understand how they are rotated, informing how it tells the robot to pick up each item. This enables the robot to pick objects up correctly, even if they are scattered or placed at different angles.
As an example, if we wanted Cebu to sort cans from bottles, we would train it using sample photos of the bottles and cans. All of the painstaking programming required by an older vision system would instead be handled by the AI (say it with us: Automation, Simplified!). It isn’t hard to see the potential here.
Seeing It Through
When choosing an object detection model, we considered several options including SSD (Single Shot MultiBox Detector) and Faster R-CNN. While these models offer good accuracy, they can be slow or require more computational power. We also looked at NVIDIA’s DeepStream, which is optimized for real-time AI on Jetson devices but is more suited for complex multi-camera applications. YOLO (You Only Look Once) stood out as the best choice for this initial look because it balances speed and accuracy. Unlike some models that track objects across frames, YOLO processes each frame independently, which simplifies computation and allows for near-instantaneous detections. Because of this, YOLO is extremely fast and well-suited for real-time applications where rapid detection is prioritized over object tracking. Additionally, each detection is accompanied by a confidence score and location information, which support the object detection.
A challenge, YOLOv8, which we started with, does not really give orientation but there are ways to infer it. More about that below.
Using a pre-trained YOLO model, we loaded the model into the script to capture frames from the camera. The script processed each frame, ran the detection model, and extracted object locations and orientations. With just a few lines of code, we could visualize the detected objects by drawing bounding boxes and labels on the images.
To make the system more efficient and demonstrate the use of C++ (if needed), we converted the model to ONNX format. This allowed us to take advantage of optimized inference with ONNX Runtime, which is well-supported in both Python and C++. This also made it easier to deploy the system in environments where Python dependencies might be a limitation. We’re also exploring C# development since some of our developers have experience with it, and .NET Core allows for a subset of C# functionality on Linux.
Seeing is Believing
Jumping up from YOLOv8 to YOLOv11 was a key improvement. YOLOv8 uses axis-aligned bounding boxes, which work well for general object detection but may struggle with objects that are rotated. YOLOv11 introduces oriented bounding boxes, allowing the model to detect objects with a more accurate angle of rotation. This is particularly useful in pick-and-place applications, where knowing the precise orientation of an object helps the robot grasp it correctly without a lot of additional work. Once the detections and orientations are processed, the data (object positions and orientations) are sent to the robot over Ethernet in near real time. The robot then knows where the object is in space and time and can pick and place per the customer’s requirements.
During initial testing, we found that we were getting some motion blur in the frames, which meant that the detections were a bit less reliable than we had hoped for. We were doing this testing in an office environment with typical low lighting. By adding some very simple lighting, our detections improved significantly and the system is now running reliably. The addition of a bit of light (even from simple fluorescents) increases the webcam shutter speed, lowering the amount of motion blur. A step up in cameras might provide us with more control over the exposure, including shutter speed, while still maintaining that simplicity of using a basic and inexpensive camera.
In just a few weeks, IRH has taken Cebu from an idea to a working prototype and is now tuning the system to deliver to a customer for a high speed pick and place function using three 6-axis robots.
And this is only the beginning. There are endless possibilities for developing an integrated AI Vision system coupled with robotics and other types of automation. This could be a system that learns as it goes, tweaking its own operation to improve performance and accuracy.
Combining AI Vision with an AI Agent to solve factory floor problems? Get in touch! Let's work together to make automation, simplified!