• YOLUX: Comments on the use of YOLO (You Only Look Once) to detect defects on luxury leather goods and stains on textiles

    Olivier Vitrac, PhD., HDR | olivier.vitrac@adservio.fr – 2025-10-23

    Summary

    YOLO can work well for defect detection on luxury leather goods and stains on textiles, but you’ll get the best results with a hybrid pipeline and some domain-specific care in data, optics, and training. Below is a technical plan that reflects what typically works in production for small, subtle, highly variable defects.

    Access to all files, read this file in PDF


    1 | YOLO Overview

    🧩 1.1 | Open-source status

    The original YOLOv1 was released by Joseph Redmon et al. in 2016 with full open-source code under a GPL-style license. Since then, many variants have been released by the community, all open-source under permissive licenses:

    VersionDeveloper / OrganizationYearFrameworkLicense / Repo
    YOLOv1–v3Joseph Redmon & Ali Farhadi (U. Washington)2016–2018Darknet (C/CUDA)pjreddie/darknet
    YOLOv4Alexey Bochkovskiy2020DarknetAlexeyAB/darknet
    YOLOv5Ultralytics2020PyTorchultralytics/yolov5
    YOLOv6Meituan2022PyTorchmeituan/YOLOv6
    YOLOv7WongKinYiu2022PyTorchWongKinYiu/yolov7
    YOLOv8Ultralytics2023PyTorchultralytics/ultralytics

    All these are freely available and can be retrained on custom datasets (COCO, Pascal VOC, your own images, etc.).

    📚 References

    • Redmon, J., et al. (2016). You Only Look Once: Unified, Real-Time Object Detection. CVPR.

    • Bochkovskiy, A., et al. (2020). YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934.

    • Ultralytics (2023). YOLOv8 Documentation. https://docs.ultralytics.com


    ⚙️ 1.2 | The method — “You Only Look Once”

    YOLO introduced a single-stage, fully-convolutional approach to object detection.

    Key idea

    Instead of generating region proposals (like R-CNN), YOLO divides the input image into an S×S grid. Each grid cell directly predicts:

    Output tensor

    For example, for COCO (C=80 classes):

    (1)output shape=S×S×(B×5+C)

    where 5 = (x,y,w,h,confidence).

    Example (YOLOv3): S=13,B=3,C=8013×13×(3×5+80)=13×13×95.

    Network architecture

    This enables end-to-end detection in one forward pass — hence the name.


    🧠 1.3 | How the detection model was trained

    Training YOLO is a standard supervised deep learning process using annotated datasets (bounding boxes + labels).

    (a) Datasets

    (b) Loss functions

    Earlier versions used a sum-of-squared-errors loss. Modern YOLOs use compound losses combining:

    For example, in YOLOv5:

    (2)L=λboxLbox+λobjLobj+λclsLcls

    (c) Optimization

    (d) Anchors

    YOLOv2+ introduced anchor boxes (like Faster R-CNN) to capture multi-scale shapes. Recent versions (YOLOv8) use anchor-free decoupled heads for simplicity and performance.


    🚀 1.4 | Performance evolution

    VersionmAP (COCO)SpeedNotes
    YOLOv163.4% VOC~45 fpsFirst real-time detector
    YOLOv357.9%~30 fpsMulti-scale + anchors
    YOLOv5s37.4%~140 fpsPyTorch + modern training
    YOLOv8n37.3%~150 fpsAnchor-free, flexible tasks (detect/segment/pose)

    🧩 1.5 | Training your own YOLO model

    In PyTorch (YOLOv5+):

    You can define your own dataset YAML and fine-tune pretrained weights.


    🧩 1.6 | Invariance properties (details)

    Let’s analyze the invariances (and non-invariances) of YOLO-type convolutional detectors precisely and mechanistically.


    🧩 1.6.1 | Translation invariance (✓ approximate but effective)

    ⚙️ Mechanism
    ✅ Consequence

    YOLO exhibits:

    Note

    Later YOLOs mitigate this with multi-scale features and anchor boxes; translation invariance is not mathematically exact but empirically strong.


    🌀 1.6.2 | Rotation invariance (✗ not inherent)

    ⚙️ Mechanism
    🧠 Remedies

    To improve rotation invariance:

    1. Data augmentation: random rotations, affine transforms (standard in YOLO training).

    2. Rotated bounding boxes: variants like Rotated-YOLO, Oriented-YOLO, or YOLOv8-OBB explicitly predict orientation angles (θ).

    3. Group-equivariant CNNs (G-CNNs): theoretical frameworks using rotationally symmetric filters.

    ✅ Practical result

    Modern YOLOs achieve robustness (through augmentation), not true rotation invariance (i.e., feature maps are not equivariant to rotation).


    🪞 1.6.3 | Reflection / symmetry invariance (± partial)

    ⚙️ Mechanism
    ✅ Remedies

    🔍 1.6.4. Scale invariance (✓ multi-scale, partial)

    Scale is a core invariance for detection.

    ⚙️ Mechanism

    🧠 Summary Table

    InvarianceMechanismDegreeNotes
    TranslationConvolution, grid assignment★★★★☆Excellent but quantized by cell grid
    RotationNone intrinsic★★☆☆☆Needs augmentation or special heads
    Reflection / symmetryAugmentation (flip)★★★☆☆Horizontal flip only
    ScaleFPN / anchors / pyramids★★★★☆Robust across octaves
    Illumination / contrastAugmentation★★☆☆☆Not architectural

    🧩 1.7 | Testing invariance properties

    Formally, a CNN like YOLO is:

    Invariance f(Tx)=f(x) Equivariance f(Tx)=Tf(x)

    YOLO’s convolutions are equivariant to translation, not invariant (detections move consistently with the object).


    🔬 Experimental verification

    You can test this by inference on transformed images:

    You’ll observe:


    2 | When YOLO is a good fit (and when it’s not)

    2.1 supervised vs. unsupervised approaches

    Good fit (supervised)

    Less ideal (unsupervised/anomaly)

    Practical recommendation: Two-stage hybrid Stage A: Unsupervised anomaly heatmap (few hundred “good” images) → candidate regions. Stage B: Supervised YOLO (detect/segment) on mined ROIs for the defects you care about (taxonomy below). This yields high recall on unknowns + stable precision on known classes.


    2.2 | Data acquisition: optics & illumination (critical here)

    Small, low-contrast surface defects are often photometry-limited more than model-limited.


    2.3 | Taxonomy & annotation strategy

    Define a controlled vocabulary with visual criteria and severity grades:


    2.4 | Model choices (YOLO variants & companions)


    2.5 | Training recipe (supervised YOLO)

    Hyperparameters (starting point)

    Losses & imbalance

    Augmentations (be careful here)

    Validation protocol


    2.6 | Inference at scale


    2.7 | Robustness & invariances for your case


    2.8 | Practical pitfalls (seen often)


    2.9 | Suggested minimal POC plan (4–6 weeks)

    1. Design of experiments for capture: 2–3 lighting setups, 1–2 cameras, tile size & overlap grid.

    2. Collect ~1–2k images across ≥6 product SKUs, annotate 8–10 defect classes + masks for stains.

    3. Train YOLOv8-seg (m or s) at 1024, light augmentations; measure per-class Recall@0.5.

    4. If recall on rare defects <90%, add anomaly stage (PaDiM/PatchCore) to propose ROIs, retrain YOLO on mined ROIs.

    5. Calibrate severity in mm and integrate AQL thresholds; pilot on a blind batch of new items.


    2.10 | Deliverables to industrialize


    If you want, I can draft:


    3 | Templates

    3.1 | Dataset spec (YAML) — defects_luxe.yaml

    Intent: a controlled taxonomy for leather & textile defects, paths, and meta needed by Ultralytics (YOLOv8/9). Note: keep class names lowercase, _ separators; you can prune/extend later.

    For segmentation, annotate masks (polygons) for stains and irregular wear (e.g., oil_stain, water_stain, discoloration), optionally also for scratch if you want shape features; for detection-only, use tight bboxes. If you later need orientation, we can add an oriented-box head (OBB) or fit line segments post hoc.


    3.2 Small-defect–oriented training config — yolo_defects_train.yaml

    Intent: a starting recipe tuned for small, subtle defects (higher input size, gentle augs, recall-friendly losses). Use with:

    (Change task=detect and model=yolov8m.pt if you don’t do segmentation.)

    Notes & knobs to try later


    3.3 | Tiling + fusion inference with severity in mm — infer_tiled_severity.py

    Intent: robust detection/segmentation on large images with small defects; overlap-tiling, cross-tile NMS, mm-unit severity scoring, multi-illumination fusion (optional).

    Note

    • Tiling: default tile=1024, overlap=0.35 (increase overlap for micro-defects).

    • Global NMS: merges all tile detections per class (IoU=0.5 default).

    • Severity: simple linear form using length/width (mm) + contrast proxy. Replace with your official AQL mapping:

      (4)S=α,length(mm)+β,width(mm)+γ,contrast+δ,location_weight
    • mm/pixel: pass --mm_per_pixel from your calibration; for multi-SKU, maintain a per-SKU table.

    • Multi-illumination bursts: run the script across each illumination folder and fuse JSONs with a rule like OR-on-presence or score pooling (can add a second pass to merge by IoU across illuminations).


    Quick-start commands

    Train (segmentation)

    Infer (tiled, severity)


    4 | Possible iterations