How computers understand photos and video — pixels, faces, text in pictures, and real uses you already know.
To a computer, a photo is not a memory — it is a grid of numbers. Vision AI learns patterns in those numbers, the same way you learned that a banana is yellow and curved.
What is a pixel? Zoom in on any digital photo and you see tiny coloured squares. Each square is a pixel. The computer stores a number (or a few numbers) for each pixel — how bright it is, and what colour.
Colour in simple terms: Many images use three numbers per pixel — how much red, green, and blue (RGB). Mix those and you get the colour you see on screen.
Humans vs computers: You recognise a friend’s face in a crowd using memory and context. A computer starts with raw numbers and must learn which number patterns mean “face,” “car,” or “crack in the wall.”
Three main jobs vision AI can do:
Image basics — size and file types
Resolution is pixels wide × tall (e.g. 1920×1080). More pixels = more detail and more work for the computer.
Grayscale = one brightness number per pixel. Factories often use it to count dark blobs on a light tray.
JPEG, PNG: Common photo formats. JPEG smaller for photos; PNG for sharp edges and screenshots.
Figure — Pick one job per project; do not do all three on day one.
Who labels training photos?
| Who | Good for | Watch out |
|---|---|---|
| Students | Leaves, demos | Agree on rules |
| Factory workers | Scratch vs OK | Tired eyes; rotate |
| Experts | Medical outlines | Privacy, law |
Figure — The computer never “sees” like you do — it reads number patterns.
Figure — Zoom in enough and every picture looks like coloured squares.
Vision jobs — types and what they are used for
| Job type | What you get | Used for | Example |
|---|---|---|---|
| Classification | One label for whole image | Sort photos, quality OK / not OK | “Beach” search in phone gallery |
| Detection | Boxes around each thing | Count people, find cars, track ball | Security camera person overlay |
| Segmentation | Colour each pixel by type | Road vs sidewalk, tumour outline | Self-driving research maps |
| Face detection | Find where faces are | Camera focus, blur background | Phone portrait mode |
| Face recognition | Match face to identity | Unlock device (needs care + consent) | Face unlock on phone |
| OCR (read text) | Text from image | Scans, receipts, signs | Deposit a cheque in banking app |
| Pose / gesture | Where body joints are | Fitness apps, sign language research | Dance game scoring |
What training data looks like
To teach vision AI, people collect labelled images:
If you only train on bright sunny photos, the system may fail on dark rainy photos. That is not “stupid AI” — it is missing examples.
Figure — Same as Module 3: learn from labelled examples.
Transfer learning — why it saves time
A model trained on millions of general photos already knows edges and textures. You teach a small new head for your labels — bruised apple, rust spot — with hundreds of photos, not millions.
Bad labels teach bad lessons
| Mistake | Result |
|---|---|
| “Dog” on cat photo | Confuses both |
| Only sunny photos | Fails at night |
| Same photo in train and test | Fake high score |
Would a motion sensor be simpler? What breaks the camera approach?
Would you use classification, detection, or segmentation? What would you label in photos? Who checks mistakes before acting?
Special programs scan small windows over the photo. They learn edges first, then shapes, then whole objects — like learning letters, then words, then sentences.
You do not need the math. The important ideas are:
Transfer learning (shortcut): Start from a model already trained on millions of general photos (cats, cars, chairs). Then teach it your smaller job — “bruised apple” vs “good apple” — with fewer pictures of your own.
Things that hurt vision AI: dark rooms, blur, shiny reflections, hidden objects, and labels that disagree (one person says “OK,” another says “defect”).
Figure — Early layers see simple parts; later layers combine them into meaning.
Figure — Like studying with practice questions, then taking a new exam.
Common data problems — and what to do
| Problem | What goes wrong | What helps |
|---|---|---|
| Too few photos | Guesses random on new images | Collect more; use transfer learning |
| All photos look the same | Works in lab, fails in real room | Add night, blur, different angles |
| Wrong labels | AI learns the wrong lesson | Two people label; spot-check |
| Class imbalance | 99% “OK” → always says OK | Collect more defect photos; measure fairly |
| Leaky test set | Near-duplicates in train and test | Split by time or camera, not random only |
Long before big neural networks, engineers used simple picture steps. Many factory and medical systems still use them because they are fast, cheap, and easy to explain.
Why simple steps still matter: If lighting is controlled (same lamp, same distance), you can count white pills or measure a screw head without training a huge model. You can explain every step to an inspector.
Typical pipeline (classic vision):
Figure — Same flow in many factories: prepare image → measure → decision.
Figure — Pills on tray: white blobs after threshold.
Simple filter types — what they do
| Filter / step | What it does | Used for | Example |
|---|---|---|---|
| Blur | Smooths tiny speckles | Cleaner image before counting | Remove camera noise on grey tray |
| Sharpen | Makes edges crisper | See fine cracks (careful — also sharpens noise) | Inspecting metal surface |
| Edge finder | Highlights outlines | Measure width, find shape | Is the screw head round? |
| Brightness / contrast | Darken or brighten | Same rule morning and afternoon | Factory belt under changing sun |
| Threshold | Black and white only | Count white blobs | Pills on dark tray |
| Crop / resize | Cut or shrink image | Faster processing; focus on belt only | Ignore factory background |
| Colour filter | Keep one colour range | Find red defects on grey part | Tomato sorting by redness |
Rules vs AI — when to pick which
| Situation | Often best choice | Why |
|---|---|---|
| Same camera, same lighting, same object | Rules + threshold | Fast, cheap, easy to audit |
| Objects vary in look or background | Trained vision AI | Hard to write rules for every case |
| Need to explain every decision to law | Rules first; AI as helper | “Pixel count > 500” is clearer than “layer 7 said so” |
| Phone app for billions of photos | Big pre-trained AI | Scale and variety need deep learning |
What would have to stay the same every day? What would break the system?
You use computer vision more than you think — unlocking your phone, scanning homework, filtering selfies, and getting alerts when a camera sees a person. The ideas from Topics 1–3 show up in all of them.
Same loop everywhere: Camera captures image → software finds patterns → something happens (unlock, beep, stop belt, draw box on screen).
Good uses save time, catch defects, or help people with disabilities (text-to-speech on signs).
Risky uses need extra care: face recognition in public, emotion guessing in hiring, fully automatic medical decisions.
Where vision AI shows up — types and uses
| Where | What vision does | Why people use it | Example product |
|---|---|---|---|
| Phone | Face unlock, photo search, filters, QR scan | Convenience, fun, security | Gallery search, portrait mode |
| Car | Backup lines, lane hints, sign read | Help driver see danger | Reversing camera, lane assist |
| Factory | Spot scratches, wrong parts, missing cap | Quality 24/7 | Camera over conveyor belt |
| Shop | Recognise items at self-checkout | Faster queues | Camera above basket |
| Home security | Person / animal / package alert | Notify owner, not watch 24/7 | Video doorbell |
| Farming | Weed vs crop, ripeness, pest damage | Target spraying, less waste | Drone over field |
| Healthcare (with doctor) | Highlight region on scan | Second pair of eyes | X-ray assist — doctor decides |
| Accessibility | Describe scene, read text aloud | Help blind or low-vision users | Phone “describe image” feature |
Figure — Same idea as a robot from Module 4: see, think, act.
Uses that need extra care
| Use | Risk if wrong | What responsible teams do |
|---|---|---|
| Face recognition in public | Wrong person accused; privacy harm | Consent, law check, human review, audit logs |
| Emotion AI in hiring | Unfair rejection; pseudo-science | Often avoided; humans interview |
| Medical image only | Missed disease | Doctor makes final call; regulated testing |
| Security “weapon” detection | False alarm, bias | Test on diverse data; human verifies alert |
Vision in your Module 8 project
| Idea | Vision job | Keep small |
|---|---|---|
| Recycling bins | Classify material | ~50 photos per class |
| Plant health | OK vs wilted leaf | Same camera distance |
| Parking slot | Car vs empty | One camera angle |
| Bottle cap | Cap missing? | Rules may be enough |
With IoT (Module 5): Camera → vision says defect → MQTT message → chart or belt stop. Draw the full chain on your poster.
Explain why in one sentence each. Who should be allowed to override the system?
10 easy questions on how machines see pictures. Instant feedback on every answer.
Module 6 in short: photos are grids of numbers; AI learns patterns to recognise things.