Spaces:
Sleeping
Sleeping
| # π§ͺ Testing Guide for ViT Auditing Toolkit | |
| Complete guide for testing all features using the provided sample images. | |
| ## π Quick Test Checklist | |
| - [ ] Basic Explainability - Attention Visualization | |
| - [ ] Basic Explainability - GradCAM | |
| - [ ] Basic Explainability - GradientSHAP | |
| - [ ] Counterfactual Analysis - All perturbation types | |
| - [ ] Confidence Calibration - Different bin sizes | |
| - [ ] Bias Detection - Multiple subgroups | |
| - [ ] Model Switching (ViT-Base β ViT-Large) | |
| --- | |
| ## π Tab 1: Basic Explainability Testing | |
| ### Test 1: Attention Visualization | |
| **Image**: `examples/basic_explainability/cat_portrait.jpg` | |
| **Steps**: | |
| 1. Load ViT-Base model | |
| 2. Upload cat_portrait.jpg | |
| 3. Select "Attention Visualization" | |
| 4. Try these layer/head combinations: | |
| - Layer 0, Head 0 (low-level features) | |
| - Layer 6, Head 0 (mid-level patterns) | |
| - Layer 11, Head 0 (high-level semantics) | |
| **Expected Results**: | |
| - β Early layers: Focus on edges, textures | |
| - β Middle layers: Focus on cat features (ears, eyes) | |
| - β Late layers: Focus on discriminative regions (face) | |
| --- | |
| ### Test 2: GradCAM Visualization | |
| **Image**: `examples/basic_explainability/sports_car.jpg` | |
| **Steps**: | |
| 1. Upload sports_car.jpg | |
| 2. Select "GradCAM" method | |
| 3. Click "Analyze Image" | |
| **Expected Results**: | |
| - β Heatmap highlights car body, wheels | |
| - β Prediction confidence > 70% | |
| - β Top class includes "sports car" or "convertible" | |
| --- | |
| ### Test 3: GradientSHAP | |
| **Image**: `examples/basic_explainability/bird_flying.jpg` | |
| **Steps**: | |
| 1. Upload bird_flying.jpg | |
| 2. Select "GradientSHAP" method | |
| 3. Wait for analysis (takes ~10-15 seconds) | |
| **Expected Results**: | |
| - β Attribution map shows bird outline | |
| - β Wings and body highlighted | |
| - β Background has low attribution | |
| --- | |
| ### Test 4: Multiple Objects | |
| **Image**: `examples/basic_explainability/coffee_cup.jpg` | |
| **Steps**: | |
| 1. Upload coffee_cup.jpg | |
| 2. Try all three methods | |
| 3. Compare explanations | |
| **Expected Results**: | |
| - β All methods highlight the cup | |
| - β Consistent predictions across methods | |
| - β Some variation in exact highlighted regions | |
| --- | |
| ## π Tab 2: Counterfactual Analysis Testing | |
| ### Test 5: Face Feature Importance | |
| **Image**: `examples/counterfactual/face_portrait.jpg` | |
| **Steps**: | |
| 1. Upload face_portrait.jpg | |
| 2. Settings: | |
| - Patch size: 32 | |
| - Perturbation: blur | |
| 3. Click "Run Counterfactual Analysis" | |
| **Expected Results**: | |
| - β Face region shows high sensitivity | |
| - β Background regions have low impact | |
| - β Prediction flip rate < 50% | |
| --- | |
| ### Test 6: Vehicle Components | |
| **Image**: `examples/counterfactual/car_side.jpg` | |
| **Steps**: | |
| 1. Upload car_side.jpg | |
| 2. Test each perturbation type: | |
| - Blur | |
| - Blackout | |
| - Gray | |
| - Noise | |
| 3. Compare results | |
| **Expected Results**: | |
| - β Wheels are critical regions | |
| - β Windows/doors moderately important | |
| - β Blackout causes most disruption | |
| --- | |
| ### Test 7: Architectural Elements | |
| **Image**: `examples/counterfactual/building.jpg` | |
| **Steps**: | |
| 1. Upload building.jpg | |
| 2. Patch size: 48 | |
| 3. Perturbation: gray | |
| **Expected Results**: | |
| - β Structural elements highlighted | |
| - β Lower flip rate (buildings are robust) | |
| - β Consistent confidence across patches | |
| --- | |
| ### Test 8: Simple Object Baseline | |
| **Image**: `examples/counterfactual/flower.jpg` | |
| **Steps**: | |
| 1. Upload flower.jpg | |
| 2. Try smallest patch size (16) | |
| 3. Use blackout perturbation | |
| **Expected Results**: | |
| - β Flower center most critical | |
| - β Petals moderately important | |
| - β Background has minimal impact | |
| --- | |
| ## π Tab 3: Confidence Calibration Testing | |
| ### Test 9: High-Quality Image | |
| **Image**: `examples/calibration/clear_panda.jpg` | |
| **Steps**: | |
| 1. Upload clear_panda.jpg | |
| 2. Number of bins: 10 | |
| 3. Run analysis | |
| **Expected Results**: | |
| - β High mean confidence (> 0.8) | |
| - β Low overconfident rate | |
| - β Calibration curve near diagonal | |
| --- | |
| ### Test 10: Complex Scene | |
| **Image**: `examples/calibration/workspace.jpg` | |
| **Steps**: | |
| 1. Upload workspace.jpg | |
| 2. Number of bins: 15 | |
| 3. Compare with panda results | |
| **Expected Results**: | |
| - β Lower mean confidence (multiple objects) | |
| - β Higher variance in predictions | |
| - β More distributed across bins | |
| --- | |
| ### Test 11: Bin Size Comparison | |
| **Image**: `examples/calibration/outdoor_scene.jpg` | |
| **Steps**: | |
| 1. Upload outdoor_scene.jpg | |
| 2. Test with bins: 5, 10, 20 | |
| 3. Compare calibration curves | |
| **Expected Results**: | |
| - β More bins = finer granularity | |
| - β General trend consistent | |
| - β 10 bins usually optimal | |
| --- | |
| ## βοΈ Tab 4: Bias Detection Testing | |
| ### Test 12: Lighting Conditions | |
| **Image**: `examples/bias_detection/dog_daylight.jpg` | |
| **Steps**: | |
| 1. Upload dog_daylight.jpg | |
| 2. Run bias detection | |
| 3. Note confidence for daylight subgroup | |
| **Expected Results**: | |
| - β 4 subgroups generated (original, bright+, bright-, contrast+) | |
| - β Confidence varies across subgroups | |
| - β Original has highest confidence typically | |
| --- | |
| ### Test 13: Indoor vs Outdoor | |
| **Images**: | |
| - `examples/bias_detection/cat_indoor.jpg` | |
| - `examples/bias_detection/bird_outdoor.jpg` | |
| **Steps**: | |
| 1. Test both images separately | |
| 2. Compare confidence distributions | |
| 3. Note any systematic differences | |
| **Expected Results**: | |
| - β Both should predict correctly | |
| - β Confidence may vary | |
| - β Subgroup metrics show variations | |
| --- | |
| ### Test 14: Urban Environment | |
| **Image**: `examples/bias_detection/urban_scene.jpg` | |
| **Steps**: | |
| 1. Upload urban_scene.jpg | |
| 2. Run bias detection | |
| 3. Check for environmental bias | |
| **Expected Results**: | |
| - β Multiple objects detected | |
| - β Varied confidence across subgroups | |
| - β Brightness variations affect predictions | |
| --- | |
| ## π― Cross-Tab Testing | |
| ### Test 15: Same Image, All Tabs | |
| **Image**: `examples/general/pizza.jpg` | |
| **Steps**: | |
| 1. Tab 1: Check predictions and explanations | |
| 2. Tab 2: Test robustness with perturbations | |
| 3. Tab 3: Check confidence calibration | |
| 4. Tab 4: Analyze across subgroups | |
| **Expected Results**: | |
| - β Consistent predictions across tabs | |
| - β High confidence (pizza is clear class) | |
| - β Robust to perturbations | |
| - β Well-calibrated | |
| --- | |
| ### Test 16: Model Comparison | |
| **Image**: `examples/general/laptop.jpg` | |
| **Steps**: | |
| 1. Load ViT-Base, analyze laptop.jpg in Tab 1 | |
| 2. Note top predictions and confidence | |
| 3. Load ViT-Large, analyze same image | |
| 4. Compare results | |
| **Expected Results**: | |
| - β ViT-Large slightly higher confidence | |
| - β Similar top predictions | |
| - β Better attention patterns (Large) | |
| - β Longer inference time (Large) | |
| --- | |
| ### Test 17: Edge Case Testing | |
| **Image**: `examples/general/mountain.jpg` | |
| **Steps**: | |
| 1. Test in all tabs | |
| 2. Note predictions (landscape/nature) | |
| 3. Check explanation quality | |
| **Expected Results**: | |
| - β May predict multiple classes (mountain, valley, landscape) | |
| - β Lower confidence (ambiguous category) | |
| - β Attention spread across scene | |
| --- | |
| ### Test 18: Furniture Classification | |
| **Image**: `examples/general/chair.jpg` | |
| **Steps**: | |
| 1. Basic explainability test | |
| 2. Counterfactual with blur | |
| 3. Check which parts are critical | |
| **Expected Results**: | |
| - β Predicts chair/furniture | |
| - β Legs and seat are critical | |
| - β Background less important | |
| --- | |
| ## π§ Performance Testing | |
| ### Test 19: Load Time | |
| **Steps**: | |
| 1. Clear browser cache | |
| 2. Time model loading | |
| 3. Note first analysis time vs subsequent | |
| **Expected**: | |
| - First load: 5-15 seconds | |
| - Subsequent: < 1 second | |
| - Analysis: 2-5 seconds per image | |
| --- | |
| ### Test 20: Memory Usage | |
| **Steps**: | |
| 1. Open browser dev tools | |
| 2. Monitor memory during analysis | |
| 3. Test with both models | |
| **Expected**: | |
| - ViT-Base: ~2GB RAM | |
| - ViT-Large: ~4GB RAM | |
| - No memory leaks over multiple analyses | |
| --- | |
| ## π Error Handling Testing | |
| ### Test 21: Invalid Inputs | |
| **Steps**: | |
| 1. Try uploading non-image file | |
| 2. Try very large image (> 50MB) | |
| 3. Try corrupted image | |
| **Expected**: | |
| - β Graceful error messages | |
| - β No crashes | |
| - β User-friendly feedback | |
| --- | |
| ### Test 22: Edge Cases | |
| **Steps**: | |
| 1. Try extremely dark/bright images | |
| 2. Try pure noise images | |
| 3. Try text-only images | |
| **Expected**: | |
| - β Model makes predictions | |
| - β Lower confidence expected | |
| - β Explanations still generated | |
| --- | |
| ## π Test Results Template | |
| ```markdown | |
| ## Test Session: [Date] | |
| **Tester**: [Name] | |
| **Model**: ViT-Base / ViT-Large | |
| **Browser**: [Chrome/Firefox/Safari] | |
| **Environment**: [Local/Docker/Cloud] | |
| ### Results Summary: | |
| - Tests Passed: __/22 | |
| - Tests Failed: __/22 | |
| - Critical Issues: __ | |
| - Minor Issues: __ | |
| ### Detailed Results: | |
| #### Test 1: Attention Visualization | |
| - Status: β Pass / β Fail | |
| - Notes: [observations] | |
| [Continue for all tests...] | |
| ### Issues Found: | |
| 1. [Issue description] | |
| - Severity: Critical/Major/Minor | |
| - Steps to reproduce: | |
| - Expected: | |
| - Actual: | |
| ### Recommendations: | |
| - [Improvement suggestions] | |
| ``` | |
| --- | |
| ## π Quick Smoke Test (5 minutes) | |
| Fastest way to verify everything works: | |
| ```bash | |
| # 1. Start app | |
| python app.py | |
| # 2. Load ViT-Base model | |
| # 3. Quick tests: | |
| Tab 1: Upload examples/basic_explainability/cat_portrait.jpg β Analyze | |
| Tab 2: Upload examples/counterfactual/flower.jpg β Analyze | |
| Tab 3: Upload examples/calibration/clear_panda.jpg β Analyze | |
| Tab 4: Upload examples/bias_detection/dog_daylight.jpg β Analyze | |
| # 4. All should complete without errors | |
| ``` | |
| --- | |
| ## π Automated Testing | |
| Run automated tests: | |
| ```bash | |
| # Unit tests | |
| pytest tests/test_phase1_complete.py -v | |
| # Advanced features tests | |
| pytest tests/test_advanced_features.py -v | |
| # All tests with coverage | |
| pytest tests/ --cov=src --cov-report=html | |
| ``` | |
| --- | |
| ## π User Acceptance Testing | |
| **Scenario 1: First-time User** | |
| - Can they understand the interface? | |
| - Can they complete basic analysis? | |
| - Is documentation helpful? | |
| **Scenario 2: Researcher** | |
| - Can they compare multiple methods? | |
| - Can they export results? | |
| - Is explanation quality sufficient? | |
| **Scenario 3: ML Practitioner** | |
| - Can they validate their model? | |
| - Are metrics meaningful? | |
| - Can they identify issues? | |
| --- | |
| ## β Sign-off Criteria | |
| Before considering testing complete: | |
| - [ ] All 22 tests pass | |
| - [ ] No critical bugs | |
| - [ ] Performance acceptable | |
| - [ ] Documentation accurate | |
| - [ ] User feedback positive | |
| - [ ] All tabs functional | |
| - [ ] Both models work | |
| - [ ] Error handling robust | |
| --- | |
| **Happy Testing! π** | |
| For issues or questions, see [CONTRIBUTING.md](CONTRIBUTING.md) | |