Spaces:

takomattyy
/

handyhome-ocr-api

Sleeping

App Files Files Community

takomattyy commited on 23 days ago

Commit

db10255

verified ·

1 Parent(s): 8a460c1

Upload 20 files

Browse files

Files changed (20) hide show

.gitignore +60 -0
DEPLOYMENT_GUIDE.md +341 -0
Dockerfile +42 -0
QUICKSTART.md +113 -0
README.md +259 -10
analyze_document.py +352 -0
app.py +403 -0
extract_drivers_license.py +353 -0
extract_national_id.py +364 -0
extract_nbi_ocr.py +200 -0
extract_passport.py +459 -0
extract_phic.py +405 -0
extract_police_ocr.py +812 -0
extract_postal.py +419 -0
extract_prc.py +377 -0
extract_sss.py +216 -0
extract_tesda_ocr.py +331 -0
extract_umid.py +317 -0
requirements.txt +14 -0
test_api.py +145 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,60 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+env/
+venv/
+ENV/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# PaddleOCR
+.paddleocr/
+inference/
+*.pdparams
+*.pdopt
+*.pdmodel
+# Temporary files
+temp_*.jpg
+temp_*.png
+*.tmp
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# OS
+.DS_Store
+Thumbs.db
+# Logs
+*.log
+logs/
+# Environment variables
+.env
+.env.local
+# Output
+output/
+results/

DEPLOYMENT_GUIDE.md ADDED Viewed

	@@ -0,0 +1,341 @@

+# Deployment Guide for Hugging Face Spaces
+This guide will help you deploy the HandyHome OCR API to Hugging Face Spaces.
+## Prerequisites
+- A Hugging Face account (free at https://huggingface.co/join)
+- Git installed on your machine (optional, for command-line deployment)
+## Deployment Options
+### Option 1: Web UI Deployment (Easiest)
+#### Step 1: Create a New Space
+1. Go to https://huggingface.co/new-space
+2. Fill in the details:
+   - **Owner**: Your username
+   - **Space name**: `handyhome-ocr-api` (or any name you prefer)
+   - **License**: MIT
+   - **Select the Space SDK**: Choose **Docker**
+   - **Space hardware**: Start with **CPU basic** (free tier)
+   - **Visibility**: Choose Public or Private
+3. Click **Create Space**
+#### Step 2: Upload Files via Web UI
+1. In your new Space, click **Files** tab
+2. Click **Add file** → **Upload files**
+3. Upload the following files from the `huggingface-ocr` folder:
+   ```
+   app.py
+   requirements.txt
+   Dockerfile
+   README.md
+   .gitignore
+   extract_national_id.py
+   extract_drivers_license.py
+   extract_prc.py
+   extract_umid.py
+   extract_sss.py
+   extract_passport.py
+   extract_postal.py
+   extract_phic.py
+   extract_nbi_ocr.py
+   extract_police_ocr.py
+   extract_tesda_ocr.py
+   analyze_document.py
+   ```
+4. Click **Commit changes to main**
+#### Step 3: Wait for Build
+1. Go to the **App** tab
+2. You'll see the build progress
+3. Initial build takes **5-10 minutes** due to:
+   - Installing PaddleOCR and dependencies
+   - Downloading OCR models (~500MB)
+   - Building Docker container
+4. Watch the build logs for any errors
+#### Step 4: Verify Deployment
+Once built, test your API:
+```bash
+# Check health
+curl https://YOUR-USERNAME-handyhome-ocr-api.hf.space/health
+# Expected response:
+# {"status":"healthy","service":"handyhome-ocr-api","version":"1.0.0"}
+```
+### Option 2: Git Command Line Deployment
+#### Step 1: Create Space on Web
+Follow Step 1 from Option 1 above.
+#### Step 2: Clone Space Repository
+```bash
+# Install Git LFS (if not already installed)
+git lfs install
+# Clone your space
+git clone https://huggingface.co/spaces/YOUR-USERNAME/handyhome-ocr-api
+cd handyhome-ocr-api
+```
+#### Step 3: Copy Files
+```bash
+# Copy all files from huggingface-ocr folder
+cp -r ../huggingface-ocr/* .
+```
+#### Step 4: Commit and Push
+```bash
+# Add all files
+git add .
+# Commit
+git commit -m "Initial deployment of HandyHome OCR API"
+# Push to Hugging Face
+git push
+```
+#### Step 5: Monitor Build
+Go to your Space URL to watch the build progress.
+## Configuration
+### Space Settings
+In your Space settings, you can configure:
+1. **Hardware**:
+   - **CPU basic** (free): 2 vCPU, 16GB RAM - Suitable for testing
+   - **CPU upgrade** (paid): Better performance
+   - **GPU** (paid): Faster OCR processing
+2. **Sleep time**:
+   - Free tier: Sleeps after 48 hours of inactivity
+   - Paid tier: Can disable sleep
+3. **Secrets** (if needed):
+   - Add environment variables in Settings → Repository secrets
+### Custom Domain (Optional)
+For production, you can set up a custom domain in Space settings.
+## Testing Your Deployment
+### Test Health Endpoint
+```bash
+curl https://YOUR-USERNAME-handyhome-ocr-api.hf.space/health
+```
+### Test OCR Extraction
+```bash
+# Test National ID extraction
+curl -X POST https://YOUR-USERNAME-handyhome-ocr-api.hf.space/api/extract-national-id \
+  -H "Content-Type: application/json" \
+  -d '{"document_url": "YOUR_IMAGE_URL"}'
+```
+### Test in Python
+```python
+import requests
+base_url = "https://YOUR-USERNAME-handyhome-ocr-api.hf.space"
+# Test health
+response = requests.get(f"{base_url}/health")
+print(response.json())
+# Test extraction
+response = requests.post(
+    f"{base_url}/api/extract-national-id",
+    json={"document_url": "YOUR_IMAGE_URL"}
+)
+print(response.json())
+```
+## Integration with Your Main App
+Update your main Flask app (`handyhome-web-scripts/app.py`) to use the Hugging Face Space:
+```python
+import requests
+HUGGINGFACE_OCR_API = "https://YOUR-USERNAME-handyhome-ocr-api.hf.space"
+@app.route('/extract-document', methods=['POST'])
+def extract_document():
+    data = request.json
+    image_url = data.get('image_url')
+    document_type = data.get('document_type')
+    # Map document types to HF Space endpoints
+    endpoint_mapping = {
+        'National ID': '/api/extract-national-id',
+        "Driver's License": '/api/extract-drivers-license',
+        'PRC ID': '/api/extract-prc',
+        'UMID': '/api/extract-umid',
+        'SSS ID': '/api/extract-sss',
+        'Passport': '/api/extract-passport',
+        'Postal ID': '/api/extract-postal',
+        'PHIC': '/api/extract-phic',
+        'NBI Clearance': '/api/extract-nbi',
+        'Police Clearance': '/api/extract-police-clearance',
+        'TESDA': '/api/extract-tesda'
+    }
+    endpoint = endpoint_mapping.get(document_type)
+    if not endpoint:
+        return jsonify({'error': 'Unsupported document type'}), 400
+    # Call Hugging Face Space API
+    try:
+        response = requests.post(
+            f"{HUGGINGFACE_OCR_API}{endpoint}",
+            json={'document_url': image_url},
+            timeout=300
+        )
+        return jsonify(response.json())
+    except Exception as e:
+        return jsonify({'error': str(e)}), 500
+```
+## Monitoring and Maintenance
+### Check Space Status
+1. Go to your Space URL
+2. Click **Settings** → **Usage**
+3. Monitor:
+   - Request count
+   - Error rate
+   - Response times
+   - Memory usage
+### View Logs
+1. In your Space, click **App** tab
+2. Scroll down to see real-time logs
+3. Useful for debugging errors
+### Update Deployment
+To update your deployment:
+**Web UI Method:**
+1. Click **Files** tab
+2. Click on file to edit
+3. Make changes
+4. Click **Commit changes**
+**Git Method:**
+```bash
+cd handyhome-ocr-api
+# Make changes to files
+git add .
+git commit -m "Update description"
+git push
+```
+## Troubleshooting
+### Build Fails
+**Error: Out of memory**
+- Solution: Reduce workers in Dockerfile or upgrade hardware
+**Error: Timeout during build**
+- Solution: This is normal for first build. Wait or restart build.
+**Error: Missing dependencies**
+- Solution: Check requirements.txt and Dockerfile
+### Runtime Errors
+**Error: Script not found**
+- Solution: Ensure all `extract_*.py` files are uploaded
+**Error: PaddleOCR model download fails**
+- Solution: Models download on first use. Check internet connectivity.
+**Error: 503 Service Unavailable**
+- Solution: Space is sleeping. Wake it up by accessing the URL.
+### Performance Issues
+**Slow response times**
+- Upgrade to better hardware tier
+- Increase Gunicorn workers (may need more RAM)
+- Consider caching frequently accessed documents
+**Out of memory errors**
+- Reduce Gunicorn workers in Dockerfile
+- Upgrade to higher memory tier
+- Process smaller images
+## Cost Considerations
+### Free Tier
+- CPU basic hardware
+- 48-hour sleep timeout
+- Suitable for testing and low-traffic use
+### Paid Tiers
+- **CPU upgrade**: $0.03/hour (~$22/month)
+- **GPU T4**: $0.60/hour (~$432/month)
+- No sleep timeout
+- Better performance
+### Optimization Tips
+- Use CPU for cost-effective deployment
+- Enable sleep timeout for development
+- Only upgrade if you need 24/7 availability or high performance
+## Security Best Practices
+1. **Use Private Spaces** for sensitive data
+2. **Add authentication** if needed (custom middleware)
+3. **Rate limiting** - Add to prevent abuse
+4. **HTTPS only** - Hugging Face provides this by default
+5. **Input validation** - Already implemented in scripts
+6. **Secrets management** - Use HF Space secrets for API keys
+## Support Resources
+- **Hugging Face Spaces Docs**: https://huggingface.co/docs/hub/spaces
+- **Docker SDK Guide**: https://huggingface.co/docs/hub/spaces-sdks-docker
+- **Community Forum**: https://discuss.huggingface.co/
+## Next Steps
+After successful deployment:
+1. ✅ Update your main app to use the HF Space API
+2. ✅ Test all document types thoroughly
+3. ✅ Set up monitoring and alerts
+4. ✅ Document the API endpoints for your team
+5. ✅ Consider setting up staging and production spaces
+---
+Happy deploying! 🚀

Dockerfile ADDED Viewed

	@@ -0,0 +1,42 @@

+# Dockerfile for HandyHome OCR API
+FROM python:3.9-slim
+# Install system dependencies for PaddleOCR and OpenCV
+RUN apt-get update && apt-get install -y \
+    libglib2.0-0 \
+    libsm6 \
+    libxext6 \
+    libxrender-dev \
+    libgomp1 \
+    libgthread-2.0-0 \
+    libgl1 \
+    libglib2.0-0 \
+    curl \
+    wget \
+    && rm -rf /var/lib/apt/lists/*
+WORKDIR /app
+# Copy and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Download PaddleOCR models during build to speed up first run
+RUN python -c "from paddleocr import PaddleOCR; ocr = PaddleOCR(use_angle_cls=False, lang='en', show_log=False)"
+# Copy application code
+COPY . .
+# Create non-root user for security
+RUN useradd --create-home --shell /bin/bash app && chown -R app:app /app
+USER app
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Use Gunicorn for production
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "2", "--timeout", "300", "--max-requests", "1000", "--max-requests-jitter", "100", "app:app"]

QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,113 @@

+# Quick Start Guide
+Get your OCR API running on Hugging Face Spaces in minutes!
+## 🚀 Deploy in 3 Steps
+### Step 1: Create Space (2 minutes)
+1. Go to https://huggingface.co/new-space
+2. Name: `handyhome-ocr-api`
+3. SDK: **Docker**
+4. Click "Create Space"
+### Step 2: Upload Files (3 minutes)
+1. Click "Files" tab → "Add file" → "Upload files"
+2. Upload all files from `huggingface-ocr` folder
+3. Click "Commit changes to main"
+### Step 3: Wait for Build (5-10 minutes)
+- Go to "App" tab
+- Watch build logs
+- When done, you'll see: "Running on http://0.0.0.0:7860"
+## ✅ Test Your API
+```bash
+# Check health
+curl https://YOUR-USERNAME-handyhome-ocr-api.hf.space/health
+# Test extraction
+curl -X POST https://YOUR-USERNAME-handyhome-ocr-api.hf.space/api/extract-national-id \
+  -H "Content-Type: application/json" \
+  -d '{"document_url": "YOUR_IMAGE_URL"}'
+```
+## 📚 Available Endpoints
+### Philippine IDs
+- `/api/extract-national-id` - National ID
+- `/api/extract-drivers-license` - Driver's License
+- `/api/extract-prc` - PRC ID
+- `/api/extract-umid` - UMID
+- `/api/extract-sss` - SSS ID
+- `/api/extract-passport` - Passport
+- `/api/extract-postal` - Postal ID
+- `/api/extract-phic` - PhilHealth ID
+### Clearances
+- `/api/extract-nbi` - NBI Clearance
+- `/api/extract-police-clearance` - Police Clearance
+- `/api/extract-tesda` - TESDA Certificate
+### Utility
+- `/api/analyze-document` - Identify document type
+- `/health` - Health check
+- `/` - Full API documentation
+## 🔧 Integration Example
+```python
+import requests
+# Your Hugging Face Space URL
+API_BASE = "https://YOUR-USERNAME-handyhome-ocr-api.hf.space"
+def extract_national_id(image_url):
+    response = requests.post(
+        f"{API_BASE}/api/extract-national-id",
+        json={"document_url": image_url},
+        timeout=300
+    )
+    return response.json()
+# Use it
+result = extract_national_id("https://example.com/id.jpg")
+print(result)
+```
+## 💡 Tips
+- **First request is slow**: PaddleOCR loads models on first use (~30 seconds)
+- **Image quality matters**: Use clear, well-lit photos
+- **Timeout**: Set timeout to 5 minutes for first request
+- **Free tier**: Space sleeps after 48 hours of inactivity
+## 🐛 Troubleshooting
+**Build fails?**
+- Wait and retry - first build can timeout
+- Check build logs for specific errors
+**503 Error?**
+- Space is sleeping - just visit the URL to wake it
+**Slow responses?**
+- Normal for first request (loading models)
+- Subsequent requests are faster (2-5 seconds)
+## 📖 Full Documentation
+- See `README.md` for complete API documentation
+- See `DEPLOYMENT_GUIDE.md` for detailed deployment steps
+## 🎯 Next Steps
+1. Deploy your OCR Space
+2. Update your main app to use it
+3. Test with real documents
+4. Monitor usage in Space settings
+---
+Need help? Check `DEPLOYMENT_GUIDE.md` for detailed instructions!

README.md CHANGED Viewed

@@ -1,10 +1,259 @@
----
-title: Handyhome Ocr Api
-emoji: 😻
-colorFrom: gray
-colorTo: pink
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+---
+title: HandyHome OCR API
+emoji: 🔍
+colorFrom: blue
+colorTo: green
+sdk: docker
+pinned: false
+license: mit
+---
+# HandyHome OCR Extraction API
+Philippine ID and Document OCR Extraction Service using PaddleOCR
+## 🎯 Features
+### Supported Documents
+#### Philippine Government IDs
+- **National ID** - 19-digit ID number, full name, birth date
+- **Driver's License** - License number, full name, address, birth date
+- **UMID** - CRN, full name, birth date
+- **SSS ID** - SSS number, full name, birth date
+- **PRC ID** - PRC number, profession, full name, validity
+- **Postal ID** - PRN, full name, address, birth date
+- **PhilHealth ID** - ID number, full name, birth date, sex, address
+#### Clearances & Certificates
+- **NBI Clearance** - ID number, full name, birth date
+- **Police Clearance** - ID number, full name, address, birth date, status
+- **TESDA Certificate** - Registry number, full name, qualification, date issued
+#### Passport
+- **Philippine Passport** - Passport number, surname, given names, birth date, nationality
+### Additional Features
+- **Document Analysis** - Automatic document type identification
+## 🚀 Quick Start
+### API Endpoints
+All extraction endpoints accept POST requests with the following format:
+```json
+{
+  "document_url": "https://example.com/document.jpg"
+}
+```
+#### Philippine ID Endpoints
+- `POST /api/extract-national-id` - Extract National ID
+- `POST /api/extract-drivers-license` - Extract Driver's License
+- `POST /api/extract-prc` - Extract PRC ID
+- `POST /api/extract-umid` - Extract UMID
+- `POST /api/extract-sss` - Extract SSS ID
+- `POST /api/extract-passport` - Extract Passport
+- `POST /api/extract-postal` - Extract Postal ID
+- `POST /api/extract-phic` - Extract PhilHealth ID
+#### Clearance Endpoints
+- `POST /api/extract-nbi` - Extract NBI Clearance
+- `POST /api/extract-police-clearance` - Extract Police Clearance
+- `POST /api/extract-tesda` - Extract TESDA Certificate
+#### Analysis Endpoint
+- `POST /api/analyze-document` - Identify document type
+#### Utility Endpoints
+- `GET /health` - Health check
+- `GET /` - API documentation
+- `GET /api/routes` - List all routes
+## 📝 Usage Examples
+### Python Example
+```python
+import requests
+# Extract National ID
+response = requests.post(
+    'https://YOUR-SPACE.hf.space/api/extract-national-id',
+    json={'document_url': 'https://example.com/national_id.jpg'}
+)
+result = response.json()
+print(result)
+# Expected output:
+# {
+#     "success": true,
+#     "id_number": "1234-5678-9012-3456",
+#     "full_name": "Juan Dela Cruz",
+#     "birth_date": "1990-01-15"
+# }
+```
+### cURL Example
+```bash
+curl -X POST https://YOUR-SPACE.hf.space/api/extract-national-id \
+  -H "Content-Type: application/json" \
+  -d '{"document_url": "https://example.com/national_id.jpg"}'
+```
+### JavaScript Example
+```javascript
+const response = await fetch('https://YOUR-SPACE.hf.space/api/extract-national-id', {
+  method: 'POST',
+  headers: { 'Content-Type': 'application/json' },
+  body: JSON.stringify({
+    document_url: 'https://example.com/national_id.jpg'
+  })
+});
+const result = await response.json();
+console.log(result);
+```
+## 🛠️ Technical Details
+### Technology Stack
+- **OCR Engine**: PaddleOCR 2.7+
+- **Framework**: Flask + Gunicorn
+- **Image Processing**: OpenCV, Pillow
+- **Runtime**: Python 3.9
+### Performance
+- Average response time: 2-5 seconds per document
+- Supports images up to 10MB
+- Concurrent request handling with Gunicorn workers
+### Resource Requirements
+- RAM: 4GB minimum
+- Storage: 2GB (includes PaddleOCR models)
+- CPU: 2 cores recommended
+## 📦 Deployment to Hugging Face Spaces
+### Step 1: Create a New Space
+1. Go to [Hugging Face Spaces](https://huggingface.co/new-space)
+2. Enter space name: `handyhome-ocr-api` (or your preferred name)
+3. Select **Docker** as SDK
+4. Choose visibility: Public or Private
+5. Click "Create Space"
+### Step 2: Upload Files
+Upload all files from this directory to your Space:
+- `app.py`
+- `requirements.txt`
+- `Dockerfile`
+- `README.md`
+- All `extract_*.py` scripts
+- `analyze_document.py`
+### Step 3: Configure Space Settings
+1. In your Space settings, set:
+   - **SDK**: Docker
+   - **Port**: 7860
+   - **Sleep time**: 48 hours (optional)
+2. The Space will automatically build and deploy
+### Step 4: Wait for Build
+- Initial build takes 5-10 minutes
+- PaddleOCR models are downloaded during build
+- Check build logs for any errors
+### Step 5: Test Your API
+Once deployed, test the health endpoint:
+```bash
+curl https://YOUR-USERNAME-handyhome-ocr-api.hf.space/health
+```
+## 🔧 Local Development
+### Setup
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run Flask development server
+python app.py
+```
+### Testing
+```bash
+# Test with a document URL
+curl -X POST http://localhost:7860/api/extract-national-id \
+  -H "Content-Type: application/json" \
+  -d '{"document_url": "YOUR_IMAGE_URL"}'
+```
+## 📊 Response Format
+### Successful Response
+```json
+{
+  "success": true,
+  "id_number": "1234-5678-9012-3456",
+  "full_name": "Juan Dela Cruz",
+  "birth_date": "1990-01-15",
+  ...additional fields...
+}
+```
+### Error Response
+```json
+{
+  "success": false,
+  "error": "Error description",
+  "stderr": "Detailed error message"
+}
+```
+## ⚠️ Limitations
+- Requires clear, readable document images
+- Works best with well-lit, high-resolution scans
+- OCR accuracy depends on image quality
+- Some fields may be null if not detected
+- Processing time varies based on image size
+## 🔐 Security Considerations
+- Images are processed in memory and not stored permanently
+- All processing happens server-side
+- Sensitive data should be transmitted over HTTPS
+- Consider rate limiting for production use
+## 📄 License
+MIT License - See LICENSE file for details
+## 🤝 Contributing
+Contributions welcome! Please submit issues and pull requests.
+## 📞 Support
+For issues and questions:
+- Open an issue on GitHub
+- Contact: [Your contact information]
+---
+Built with ❤️ using PaddleOCR and Flask

analyze_document.py ADDED Viewed

	@@ -0,0 +1,352 @@

+#!/usr/bin/env python3
+"""
+Document Authenticity Analysis Script
+Purpose:
+    Performs forensic analysis on document images to detect tampering and verify authenticity.
+    Uses Error Level Analysis (ELA) and metadata examination techniques.
+Why this script exists:
+    - Need to verify document authenticity in verification workflows
+    - Detect potential image tampering or editing
+    - Provide forensic evidence for document verification
+    - Support fraud prevention in document processing systems
+Key Features:
+    - Error Level Analysis (ELA) for tampering detection
+    - EXIF metadata analysis for editing software detection
+    - Brightness ratio analysis for tampering assessment
+    - Support for single document analysis
+Dependencies:
+    - Pillow (PIL): Image processing (https://pillow.readthedocs.io/)
+    - exifread: EXIF metadata extraction (https://pypi.org/project/ExifRead/)
+    - numpy: Numerical operations (https://numpy.org/doc/)
+    - requests: HTTP library (https://docs.python-requests.org/)
+Usage:
+    python analyze_document.py "https://example.com/document.jpg"
+Output:
+    JSON with tampering and metadata analysis results
+"""
+import sys, json, requests, exifread,numpy as np
+from PIL import Image, ImageChops, ImageEnhance
+from io import BytesIO
+import uuid, os
+# List of photo editing software to detect in metadata
+# This helps identify if an image has been edited
+software_list = ['photoshop', 'lightroom', 'gimp', 'paint', 'paint.net', 'paintshop pro', 'paintshop pro x', 'paintshop pro x2', 'paintshop pro x3', 'paintshop pro x4', 'paintshop pro x5', 'paintshop pro x6', 'paintshop pro x7', 'paintshop pro x8', 'paintshop pro x9', 'paintshop pro x10']
+def download_image(url, output_path='temp_image.jpg'):
+    """
+    Download image from URL and process it for analysis.
+    Args:
+        url (str): Image URL
+        output_path (str): Local save path
+    Returns:
+        str: Path to processed image or None if failed
+    Why this approach:
+    - Handles JSON-wrapped URLs (common in web applications)
+    - Converts RGBA to RGB for JPEG compatibility
+    - Provides detailed error handling and logging
+    - Uses high quality to preserve analysis accuracy
+    """
+    print(f"DOCUMENT Starting download for URL: {url}", file=sys.stderr)
+    # Handle URL that might be a JSON string
+    if isinstance(url, str) and (url.startswith('{') or url.startswith('[')):
+        try:
+            url_data = json.loads(url)
+            print(f"DOCUMENT Parsed JSON URL data: {url_data}", file=sys.stderr)
+            if isinstance(url_data, dict) and 'url' in url_data:
+                url = url_data['url']
+                print(f"DOCUMENT Extracted URL from dict: {url}", file=sys.stderr)
+            elif isinstance(url_data, list) and len(url_data) > 0:
+                url = url_data[0] if isinstance(url_data[0], str) else url_data[0].get('url', '')
+                print(f"DOCUMENT Extracted URL from list: {url}", file=sys.stderr)
+        except json.JSONDecodeError as e:
+            print(f"DOCUMENT Error parsing URL JSON: {url}, Error: {str(e)}", file=sys.stderr)
+            return None
+    if not url or url == '':
+        print(f"DOCUMENT Empty URL after processing", file=sys.stderr)
+        return None
+    try:
+        response = requests.get(url)
+        response.raise_for_status()
+        image_data = response.content
+        print(f"DOCUMENT Downloaded image from {url}", file=sys.stderr)
+        print(f"DOCUMENT Image data size: {len(image_data)} bytes", file=sys.stderr)
+        # Process the image
+        image = Image.open(BytesIO(image_data))
+        print(f"DOCUMENT Original image format: {image.format}", file=sys.stderr)
+        print(f"DOCUMENT Original image mode: {image.mode}", file=sys.stderr)
+        # Convert to RGB if necessary (JPEG doesn't support alpha channel)
+        if image.mode == 'RGBA':
+            background = Image.new('RGB', image.size, (255, 255, 255))
+            background.paste(image, mask=image.split()[-1])
+            image = background
+            print(f"DOCUMENT Converted RGBA to RGB", file=sys.stderr)
+        elif image.mode != 'RGB':
+            image = image.convert('RGB')
+            print(f"DOCUMENT Converted {image.mode} to RGB", file=sys.stderr)
+        # Save as JPG without trying to preserve EXIF
+        image.save(output_path, 'JPEG', quality=95)
+        print(f"DOCUMENT Saved image to {output_path}", file=sys.stderr)
+        return output_path
+    except Exception as e:
+        print(f"DOCUMENT Error processing image: {str(e)}", file=sys.stderr)
+        return None
+# Tampering Function
+def perform_error_level_analysis(image_path, output_path="ELA.png", quality=90):
+    """
+    Perform Error Level Analysis (ELA) on image.
+    Args:
+        image_path (str): Path to original image
+        output_path (str): Path to save ELA result
+        quality (int): JPEG quality for recompression
+    Returns:
+        str: Path to ELA result image
+    Why ELA:
+    - Detects areas of image that have been edited or tampered with
+    - Works by comparing original image with recompressed version
+    - Edited areas show different compression artifacts
+    - Standard forensic technique for image authenticity
+    """
+    print(f"DOCUMENT TAMPERING Starting Error Level Analysis...", file=sys.stderr)
+    try:
+        original_image = Image.open(image_path)
+        # Convert RGBA to RGB if necessary (JPEG doesn't support alpha channel)
+        if original_image.mode == 'RGBA':
+            # Create a white background
+            background = Image.new('RGB', original_image.size, (255, 255, 255))
+            # Paste the image onto the background
+            background.paste(original_image, mask=original_image.split()[-1])  # Use alpha channel as mask
+            original_image = background
+        elif original_image.mode != 'RGB':
+            original_image = original_image.convert('RGB')
+        # Use a unique temp file name to avoid conflicts
+        temp_file = f"temp_ela_{str(uuid.uuid4())[:8]}.jpg"
+        original_image.save(temp_file, "JPEG", quality=quality)
+        resaved_image = Image.open(temp_file)
+        difference = ImageChops.difference(original_image, resaved_image)
+        difference = ImageEnhance.Brightness(difference).enhance(10)
+        # Save the difference image
+        difference.save(output_path, format='PNG')
+        # Clean up temp file
+        try:
+            os.remove(temp_file)
+        except:
+            pass
+        print(f"DOCUMENT TAMPERING ELA analysis completed, saved to {output_path}", file=sys.stderr)
+        return output_path
+    except Exception as e:
+        print(f"DOCUMENT TAMPERING Error during ELA: {str(e)}", file=sys.stderr)
+        return None
+def detect_tampering(ela_path, threshold=100):
+    """
+    Analyze ELA results to detect tampering.
+    Args:
+        ela_path (str): Path to ELA result image
+        threshold (int): Brightness threshold for tampering detection
+    Returns:
+        dict: Tampering analysis results
+    Why this approach:
+    - Analyzes brightness distribution in ELA image
+    - High brightness indicates areas of potential tampering
+    - Uses statistical thresholds for tampering assessment
+    - Provides confidence levels for tampering detection
+    """
+    print(f"DOCUMENT TAMPERING Analyzing ELA results...", file=sys.stderr)
+    ela_image = Image.open(ela_path)
+    ela_array = np.array(ela_image)
+    bright_pixels = np.sum(ela_array > threshold)
+    total_pixels = ela_array.size
+    brightness_ratio = bright_pixels / total_pixels
+    print(f"DOCUMENT TAMPERING Brightness ratio: {brightness_ratio:.4f}", file=sys.stderr)
+    if brightness_ratio < 0.02:
+        return {
+            "tampered": "False",
+            "brightness_ratio": round(float(brightness_ratio), 4)
+        }
+    elif brightness_ratio > 0.02 and brightness_ratio <= 0.05:
+        return {
+            "tampered": "Investigation Required",
+            "brightness_ratio": round(float(brightness_ratio), 4)
+        }
+    elif brightness_ratio > 0.05:
+        return {
+            "tampered": "True",
+            "brightness_ratio": round(float(brightness_ratio), 4)
+        }
+# Metadata Analysis Function (TO BE TESTED IN MOBILE)
+def analyze_metadata(image_path):
+    """
+    Analyze EXIF metadata for editing software and authenticity indicators.
+    Args:
+        image_path (str): Path to image file
+    Returns:
+        dict: Metadata analysis results
+    Why this approach:
+    - EXIF metadata contains information about image creation and editing
+    - Detects photo editing software used
+    - Identifies camera information and timestamps
+    - Provides forensic evidence for image authenticity
+    """
+    print(f"DOCUMENT METADATA Starting metadata analysis...", file=sys.stderr)
+    try:
+        with open(image_path, 'rb') as f:
+            tags = exifread.process_file(f)
+            metadata = {tag: str(tags.get(tag)) for tag in tags.keys()}
+        # Debug: Print what we found
+        print(f"DOCUMENT METADATA Found {len(metadata)} metadata tags", file=sys.stderr)
+        if metadata:
+            print(f"DOCUMENT METADATA Metadata keys: {list(metadata.keys())}", file=sys.stderr)
+    except Exception as e:
+        print(f"DOCUMENT METADATA Error reading metadata: {str(e)}", file=sys.stderr)
+        return {
+            "result": "error",
+            "message": f"Error reading metadata: {str(e)}",
+            "metadata": {}
+        }
+    if not metadata:
+        print(f"DOCUMENT METADATA No metadata found in {image_path}", file=sys.stderr)
+        return {
+            "result": "no metadata",
+            "message": "No metadata found in image.",
+            "metadata": metadata
+        }
+    # Check for timestamp metadata
+    has_timestamp = any(key in metadata for key in ['EXIF DateTimeOriginal', 'Image DateTime', 'EXIF DateTime'])
+    # Check for editing software
+    software_used = metadata.get('Image Software', '').lower()
+    # Check for camera information
+    has_camera_info = any(key in metadata for key in ['Image Make', 'Image Model', 'EXIF Make', 'EXIF Model'])
+    # Provide more detailed analysis
+    analysis_summary = []
+    if has_timestamp:
+        analysis_summary.append("Timestamp found")
+    if has_camera_info:
+        analysis_summary.append("Camera information found")
+    if software_used:
+        analysis_summary.append(f"Software detected: {software_used}")
+    if not has_timestamp and not has_camera_info and not software_used:
+        return {
+            "result": "minimal metadata",
+            "message": "Image contains minimal metadata (no timestamp, camera info, or editing software detected)",
+            "metadata": metadata,
+            "analysis": "Image appears to be legitimate with no signs of editing"
+        }
+    if any(term in software_used for term in software_list):
+        return {
+            "result": "edited",
+            "message": f"Image appears to have been edited with software: {software_used}",
+            "metadata": metadata,
+            "analysis": "Suspicious editing software detected"
+        }
+    return {
+        "result": "success",
+        "message": f"Successfully analyzed metadata: {', '.join(analysis_summary)}",
+        "metadata": metadata,
+        "analysis": "Image appears to be legitimate with no signs of editing"
+    }
+# Main execution
+if __name__ == "__main__":
+    # Validate command line arguments
+    if len(sys.argv) < 2:
+        print(json.dumps({"error": "No image URL provided"}))
+        sys.exit(1)
+    image_url = sys.argv[1]
+    try:
+        # Download and process the image
+        image_path = download_image(image_url)
+        if not image_path:
+            raise Exception("Failed to download image")
+        # Perform metadata analysis first (before any file gets deleted)
+        metadata_results = analyze_metadata(image_path)
+        # Perform ELA and tampering detection
+        ela_path = perform_error_level_analysis(image_path)
+        if not ela_path:
+            raise Exception("Failed to perform error level analysis")
+        tampering_results = detect_tampering(ela_path)
+        # Clean up files after all analysis is done
+        try:
+            if ela_path:
+                os.remove(ela_path)
+            if image_path:
+                os.remove(image_path)
+        except:
+            pass
+        # Return combined results
+        print(json.dumps({
+            "success": True,
+            "tampering_results": tampering_results,
+            "metadata_results": metadata_results
+        }))
+    except Exception as e:
+        print(json.dumps({
+            "success": False,
+            "error": str(e),
+            "context": {
+                "url": image_url
+            }
+        }))
+        sys.exit(1)

app.py ADDED Viewed

	@@ -0,0 +1,403 @@

+from flask import Flask, request, jsonify
+import sys
+import os
+import subprocess
+import json
+# Suppress PaddleOCR verbose logging
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+def run_extraction_script(script_name, document_url):
+    """Generic function to run OCR extraction scripts"""
+    try:
+        cmd = [sys.executable, script_name, document_url]
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=300,
+            cwd=os.getcwd()
+        )
+        if result.returncode != 0:
+            return {
+                'success': False,
+                'error': f'Script failed with return code {result.returncode}',
+                'stderr': result.stderr,
+                'stdout': result.stdout
+            }
+        # Parse JSON output
+        try:
+            output_str = result.stdout.strip()
+            # Try direct parse first
+            try:
+                return json.loads(output_str)
+            except json.JSONDecodeError:
+                # Find the last JSON object in output
+                lines = output_str.split('\n')
+                json_lines = [line.strip() for line in lines if line.strip().startswith('{')]
+                if json_lines:
+                    return json.loads(json_lines[-1])
+                # Try extracting JSON from the output
+                start_idx = output_str.rfind('{')
+                end_idx = output_str.rfind('}')
+                if start_idx != -1 and end_idx != -1 and end_idx >= start_idx:
+                    return json.loads(output_str[start_idx:end_idx+1])
+                raise ValueError("No valid JSON found in output")
+        except Exception as e:
+            return {
+                'success': False,
+                'error': 'Invalid JSON output from script',
+                'raw_output': result.stdout[:500],  # Limit output size
+                'json_error': str(e)
+            }
+    except subprocess.TimeoutExpired:
+        return {
+            'success': False,
+            'error': 'Script execution timed out after 5 minutes'
+        }
+    except Exception as e:
+        return {
+            'success': False,
+            'error': f'Unexpected error: {str(e)}'
+        }
+# Create Flask app
+app = Flask(__name__)
+# Configure Flask for production
+app.config['JSON_SORT_KEYS'] = False
+app.config['JSONIFY_PRETTYPRINT_REGULAR'] = False
+# ============================================================================
+# PHILIPPINE ID OCR EXTRACTION ENDPOINTS
+# ============================================================================
+@app.route('/api/extract-national-id', methods=['POST'])
+def api_extract_national_id():
+    """Extract Philippine National ID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_national_id.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-drivers-license', methods=['POST'])
+def api_extract_drivers_license():
+    """Extract Philippine Driver's License details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_drivers_license.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-prc', methods=['POST'])
+def api_extract_prc():
+    """Extract PRC ID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_prc.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-umid', methods=['POST'])
+def api_extract_umid():
+    """Extract UMID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_umid.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-sss', methods=['POST'])
+def api_extract_sss():
+    """Extract SSS ID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_sss.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-passport', methods=['POST'])
+def api_extract_passport():
+    """Extract Philippine Passport details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_passport.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-postal', methods=['POST'])
+def api_extract_postal():
+    """Extract Postal ID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_postal.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-phic', methods=['POST'])
+def api_extract_phic():
+    """Extract PhilHealth ID details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_phic.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+# ============================================================================
+# CLEARANCE & CERTIFICATE OCR EXTRACTION ENDPOINTS
+# ============================================================================
+@app.route('/api/extract-nbi', methods=['POST'])
+def api_extract_nbi():
+    """Extract NBI Clearance details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_nbi_ocr.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-police-clearance', methods=['POST'])
+def api_extract_police_clearance():
+    """Extract Police Clearance details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_police_ocr.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+@app.route('/api/extract-tesda', methods=['POST'])
+def api_extract_tesda():
+    """Extract TESDA Certificate details"""
+    try:
+        data = request.json
+        document_url = data.get('document_url')
+        if not document_url:
+            return jsonify({'error': 'Missing document_url'}), 400
+        result = run_extraction_script('extract_tesda_ocr.py', document_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+# ============================================================================
+# DOCUMENT ANALYSIS ENDPOINT
+# ============================================================================
+@app.route('/api/analyze-document', methods=['POST'])
+def api_analyze_document():
+    """Analyze and identify document type"""
+    try:
+        data = request.json
+        image_url = data.get('image_url')
+        if not image_url:
+            return jsonify({'error': 'Missing image_url'}), 400
+        result = run_extraction_script('analyze_document.py', image_url)
+        return jsonify(result)
+    except Exception as e:
+        return jsonify({'success': False, 'error': str(e)}), 500
+# ============================================================================
+# UTILITY ENDPOINTS
+# ============================================================================
+@app.route('/health', methods=['GET'])
+def health_check():
+    """Health check endpoint"""
+    return jsonify({
+        'status': 'healthy',
+        'service': 'handyhome-ocr-api',
+        'version': '1.0.0'
+    })
+@app.route('/', methods=['GET'])
+def index():
+    """API documentation endpoint"""
+    return jsonify({
+        'service': 'HandyHome OCR Extraction API',
+        'version': '1.0.0',
+        'description': 'Philippine ID and Document OCR Extraction using PaddleOCR',
+        'endpoints': {
+            'Philippine IDs': {
+                'POST /api/extract-national-id': {
+                    'description': 'Extract Philippine National ID details',
+                    'fields': ['id_number', 'full_name', 'birth_date']
+                },
+                'POST /api/extract-drivers-license': {
+                    'description': 'Extract Driver\'s License details',
+                    'fields': ['license_number', 'full_name', 'birth_date', 'address']
+                },
+                'POST /api/extract-prc': {
+                    'description': 'Extract PRC ID details',
+                    'fields': ['prc_number', 'full_name', 'profession', 'valid_until']
+                },
+                'POST /api/extract-umid': {
+                    'description': 'Extract UMID details',
+                    'fields': ['crn', 'full_name', 'birth_date']
+                },
+                'POST /api/extract-sss': {
+                    'description': 'Extract SSS ID details',
+                    'fields': ['sss_number', 'full_name', 'birth_date']
+                },
+                'POST /api/extract-passport': {
+                    'description': 'Extract Philippine Passport details',
+                    'fields': ['passport_number', 'surname', 'given_names', 'birth_date']
+                },
+                'POST /api/extract-postal': {
+                    'description': 'Extract Postal ID details',
+                    'fields': ['prn', 'full_name', 'address', 'birth_date']
+                },
+                'POST /api/extract-phic': {
+                    'description': 'Extract PhilHealth ID details',
+                    'fields': ['id_number', 'full_name', 'birth_date', 'sex', 'address']
+                }
+            },
+            'Clearances & Certificates': {
+                'POST /api/extract-nbi': {
+                    'description': 'Extract NBI Clearance details',
+                    'fields': ['id_number', 'full_name', 'birth_date']
+                },
+                'POST /api/extract-police-clearance': {
+                    'description': 'Extract Police Clearance details',
+                    'fields': ['id_number', 'full_name', 'address', 'birth_date', 'status']
+                },
+                'POST /api/extract-tesda': {
+                    'description': 'Extract TESDA Certificate details',
+                    'fields': ['registry_number', 'full_name', 'qualification', 'date_issued']
+                }
+            },
+            'Document Analysis': {
+                'POST /api/analyze-document': {
+                    'description': 'Analyze and identify document type from image',
+                    'fields': ['document_type', 'confidence']
+                }
+            },
+            'Utility': {
+                'GET /health': 'Health check endpoint',
+                'GET /': 'This API documentation'
+            }
+        },
+        'request_format': {
+            'body': {
+                'document_url': 'string (required) - URL of the document image to process'
+            },
+            'example': {
+                'document_url': 'https://example.com/national_id.jpg'
+            }
+        },
+        'response_format': {
+            'success': 'boolean - Whether extraction was successful',
+            'extracted_fields': 'object - Extracted data fields',
+            'error': 'string - Error message if failed'
+        }
+    })
+@app.route('/api/routes', methods=['GET'])
+def list_routes():
+    """List all available API routes"""
+    routes = []
+    for rule in app.url_map.iter_rules():
+        if rule.endpoint != 'static':
+            methods = sorted(rule.methods - {'HEAD', 'OPTIONS'})
+            routes.append({
+                'endpoint': rule.endpoint,
+                'methods': methods,
+                'path': rule.rule
+            })
+    routes.sort(key=lambda x: x['path'])
+    return jsonify({
+        'total_routes': len(routes),
+        'routes': routes
+    })
+# Launch Flask app
+if __name__ == '__main__':
+    port = int(os.environ.get('PORT', 7860))
+    app.run(host='0.0.0.0', port=port, debug=False)

extract_drivers_license.py ADDED Viewed

	@@ -0,0 +1,353 @@

+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    cache_files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in cache_files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255,255,255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def format_date(s):
+    if not s: return None
+    raw = s
+    s = s.strip().replace(' ', '').replace('\\','/').replace('.', '/')
+    if re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', s):
+        return s.replace('/', '-')
+    m = re.match(r'([A-Za-z]+)(\d{1,2}),?(\d{4})', s)
+    if m:
+        mo, d, y = m.groups()
+        try:
+            mnum = datetime.strptime(mo[:3], '%b').month
+            return f"{y}-{int(mnum):02d}-{int(d):02d}"
+        except Exception:
+            pass
+    for fmt in ("%B%d,%Y", "%b%d,%Y", "%B %d, %Y", "%b %d, %Y"):
+        try:
+            return datetime.strptime(raw, fmt).strftime("%Y-%m-%d")
+        except Exception:
+            pass
+    return raw
+def cap_words(name):
+    return None if not name else ' '.join(w.capitalize() for w in name.split())
+def normalize_full_name(s):
+    if not s:
+        return None
+    raw = ' '.join(s.split())
+    # Format: "LAST, FIRST [SECOND ...] [MIDDLE...]"
+    if ',' in raw:
+        parts = [p.strip() for p in raw.split(',')]
+        last = parts[0] if parts else ''
+        given_block = ','.join(parts[1:]).strip() if len(parts) > 1 else ''
+        tokens = [t for t in given_block.split(' ') if t]
+        # keep up to two given names, drop the rest (assumed middle)
+        given_kept = tokens[:2] if tokens else []
+        return cap_words(' '.join(given_kept + [last]).strip())
+    # Fallback: "FIRST [SECOND ...] LAST"
+    tokens = [t for t in raw.split(' ') if t]
+    if len(tokens) >= 2:
+        last = tokens[-1]
+        given_kept = tokens[:2]  # keep first two tokens as given names
+        return cap_words(' '.join(given_kept + [last]).strip())
+    return cap_words(raw)
+def take_within(lines, start_idx, max_ahead=5):
+    out = []
+    for j in range(1, max_ahead+1):
+        if start_idx + j < len(lines):
+            t = str(lines[start_idx + j]).strip()
+            if t:
+                out.append(t)
+    return out
+def find_match(ahead, predicate):
+    for t in ahead:
+        if predicate(t):
+            return t
+    return None
+def is_date_text(t):
+    t1 = t.replace(' ', '').replace('\\','/').replace('.','/')
+    return bool(re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t1)) or bool(re.match(r'^[A-Za-z]+\s*\d{1,2},?\s*\d{4}$', t))
+def is_weight(t):
+    return bool(re.match(r'^\d{2,3}(\.\d+)?$', t))
+def is_height(t):
+    return bool(re.match(r'^\d(\.\d{1,2})$', t)) or bool(re.match(r'^\d\.\d+$', t))
+def is_sex(t):
+    return t.upper() in ('M','F','MALE','FEMALE')
+def join_name_lines(line1, line2):
+    # e.g., "FURAGGANAN,REIN" and "ANDRE DUCUSIN" -> "FURAGGANAN,REIN ANDRE DUCUSIN"
+    if not line1: return None
+    s = line1
+    if line2: s = f"{line1} {line2}"
+    return s
+def extract_drivers_license_info(lines):
+    dprint("Lines to extract", lines)
+    license_number = None
+    full_name = None
+    birth_date = None
+    nationality = None
+    sex = None
+    weight = None
+    height = None
+    address = None
+    expiration_date = None
+    agency_code = None
+    blood_type = None
+    eyes_color = None
+    dl_codes = None
+    conditions = None
+    i = 0
+    L = [str(x or '').strip() for x in lines]
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        if not license_number and re.match(r'^[A-Z]\d{2}-\d{2}-\d{6}$', line):
+            license_number = line
+            dprint("Found license_number", license_number)
+        if ('last name' in low and 'first name' in low) or 'last name.first name.middle' in low:
+            name_l1 = L[i+1] if i+1 < len(L) else ''
+            name_l2 = L[i+2] if i+2 < len(L) else ''
+            # Skip a stray "Name" line
+            if name_l1.lower() == 'name':
+                name_l1, name_l2 = name_l2, (L[i+3] if i+3 < len(L) else '')
+            combined = join_name_lines(name_l1, name_l2)
+            full_name = normalize_full_name(combined)
+            dprint("Found full_name", full_name)
+            i += 3
+            continue
+        if re.search(r'nation.?l.?ity', low) or 'phl' == line:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, lambda t: t.upper() in ('PHL','FILIPINO','PHILIPPINES'))
+            if cand:
+                nationality = cand
+                dprint("Found nationality", nationality)
+        if 'sex' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, is_sex)
+            if cand:
+                sex = 'M' if cand.upper().startswith('M') else 'F'
+                dprint("Found sex", sex)
+        if 'date of birth' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, is_date_text)
+            if cand:
+                birth_date = format_date(cand)
+                dprint("Found birth_date", birth_date)
+        if birth_date is None and is_date_text(line):
+            birth_date = format_date(line)
+            dprint("Found standalone birth_date", birth_date)
+        if 'weight' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, is_weight)
+            if cand:
+                weight = cand
+                dprint("Found weight", weight)
+        if 'height' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, is_height)
+            if cand:
+                height = cand
+                dprint("Found height", height)
+        if 'address' in low:
+            parts = []
+            for k in range(1, 4):
+                if i+k < len(L):
+                    t = L[i+k]
+                    if t and not any(lbl in t.lower() for lbl in ['license', 'expiration', 'agency code', 'blood', 'eyes', 'dl codes', 'conditions']):
+                        parts.append(t)
+            if parts:
+                address = ', '.join(parts)
+                dprint("Found address", address)
+        if 'expiration date' in low:
+            ahead = take_within(L, i, 6)
+            cand = find_match(ahead, is_date_text)
+            if cand:
+                expiration_date = format_date(cand)
+                dprint("Found expiration_date", expiration_date)
+        if 'agency code' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, lambda t: re.match(r'^[A-Z]\d{2}$', t) is not None)
+            if cand:
+                agency_code = cand
+                dprint("Found agency_code", agency_code)
+        if 'blood' in low and 'type' in low:
+            ahead = take_within(L, i, 3)
+            cand = find_match(ahead, lambda t: re.match(r'^[ABO][+-]?$', t.upper()) is not None)
+            if cand:
+                blood_type = cand.upper()
+                dprint("Found blood_type", blood_type)
+        if 'eyes' in low and 'color' in low:
+            ahead = take_within(L, i, 5)
+            cand = find_match(ahead, lambda t: t.isalpha())
+            if cand:
+                eyes_color = cap_words(cand)
+                dprint("Found eyes_color", eyes_color)
+        if 'dl codes' in low:
+            ahead = take_within(L, i, 3)
+            cand = find_match(ahead, lambda t: True)  # take first non-empty
+            if cand:
+                dl_codes = cand
+                dprint("Found dl_codes", dl_codes)
+        if 'conditions' in low:
+            ahead = take_within(L, i, 3)
+            cand = find_match(ahead, lambda t: True)
+            if cand:
+                conditions = cand
+                dprint("Found conditions", conditions)
+        i += 1
+    result = {
+        'id_type': 'drivers_license',
+        'license_number': license_number,
+        'id_number': license_number,  # for frontend compatibility
+        'full_name': full_name,
+        'birth_date': birth_date,
+        'nationality': nationality,
+        'sex': sex,
+        'weight': weight,
+        'height': height,
+        'address': address,
+        'blood_type': blood_type,
+        'eyes_color': eyes_color,
+        'dl_codes': dl_codes,
+        'conditions': conditions,
+        'agency_code': agency_code,
+        'expiration_date': expiration_date
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR predict", image_path)
+        results = ocr.ocr(image_path, cls=False)
+    dprint("OCR predict done, results_count", len(results))
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_drivers_license_info(all_text) if all_text else {'id_type':'drivers_license','license_number':None,'id_number':None,'full_name':None,'birth_date':None}
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready")
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": True, "ocr_results": ocr_results}))
+    sys.stdout.flush()
+except Exception as e:
+    dprint("Exception", str(e))
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        clean_cache()
+    except:
+        pass

extract_national_id.py ADDED Viewed

	@@ -0,0 +1,364 @@

+#!/usr/bin/env python3
+"""
+Philippine National ID Information Extraction Script
+Purpose:
+    Extracts structured information from Philippine National ID card images using OCR.
+    This script is designed to work as a microservice for document verification systems.
+Why this script exists:
+    - Manual verification of National IDs is time-consuming and error-prone
+    - Need standardized data extraction from bilingual (English/Filipino) ID cards
+    - Required for automated document verification workflows
+    - Supports API integration for web applications
+Key Features:
+    - Extracts 19-digit National ID number (format: XXXX-XXXX-XXXX-XXXX)
+    - Handles bilingual labels (English/Filipino)
+    - Processes name composition from separate fields
+    - Formats dates consistently (ISO format: YYYY-MM-DD)
+    - Preserves EXIF metadata when possible
+Dependencies:
+    - PaddleOCR: High-accuracy OCR engine (https://github.com/PaddlePaddle/PaddleOCR)
+    - Pillow (PIL): Image processing (https://pillow.readthedocs.io/)
+    - requests: HTTP library for image downloads (https://docs.python-requests.org/)
+Usage:
+    python extract_national_id.py "https://example.com/national_id.jpg"
+Output:
+    JSON with extracted information: id_number, full_name, birth_date
+"""
+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def clean_cache():
+    """
+    Delete all cached images and output files to prevent disk space issues.
+    Why this is needed:
+    - OCR processing creates temporary files that can accumulate
+    - Prevents disk space exhaustion in production environments
+    - Ensures clean state for each new document processing
+    """
+    # List of temporary files created by PaddleOCR and our processing
+    cache_files = [
+        'temp_image.jpg',                    # Downloaded image
+        'temp_image_ocr_res_img.jpg',        # OCR result visualization
+        'temp_image_preprocessed_img.jpg',   # Preprocessed image
+        'temp_image_res.json'               # OCR result JSON
+    ]
+    # Delete individual cache files
+    for file in cache_files:
+        if os.path.exists(file):
+            os.remove(file)
+            print(f"DEBUG: Removed cached file: {file}", file=sys.stderr)
+    # Delete output directory and its contents
+    # This removes any OCR output files from previous runs
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        print("DEBUG: Removed output directory", file=sys.stderr)
+def download_image(url, output_path='temp_image.jpg'):
+    """
+    Download image from URL and process it for OCR.
+    Args:
+        url (str): URL of the National ID image to process
+        output_path (str): Local path to save the downloaded image
+    Returns:
+        str: Path to the downloaded and processed image
+    Why this approach:
+    - Downloads images from URLs (common in web applications)
+    - Converts RGBA to RGB (JPEG doesn't support alpha channels)
+    - Preserves EXIF data when possible (important for document authenticity)
+    - Uses high quality (95%) to maintain OCR accuracy
+    """
+    # Clean cache before downloading new image to prevent conflicts
+    clean_cache()
+    # Download the image with error handling
+    response = requests.get(url)
+    response.raise_for_status()  # Raise exception for HTTP errors
+    image_data = response.content
+    # First, try to extract EXIF data from the original image
+    # EXIF data is important for document authenticity verification
+    original_exif = None
+    try:
+        original_image = Image.open(BytesIO(image_data))
+        # Check if image has EXIF data (some formats don't support it)
+        if hasattr(original_image, '_getexif') and original_image._getexif():
+            original_exif = original_image._getexif()
+    except Exception as e:
+        # If EXIF extraction fails, continue without it
+        pass
+    # Now process the image for OCR
+    image = Image.open(BytesIO(image_data))
+    # Convert to RGB if necessary (JPEG doesn't support alpha channel)
+    # This is crucial for OCR accuracy as PaddleOCR expects RGB images
+    if image.mode == 'RGBA':
+        # Create a white background for transparent images
+        background = Image.new('RGB', image.size, (255, 255, 255))
+        # Paste the image onto the background using alpha channel as mask
+        background.paste(image, mask=image.split()[-1])  # Use alpha channel as mask
+        image = background
+    elif image.mode != 'RGB':
+        # Convert other formats (like L for grayscale) to RGB
+        image = image.convert('RGB')
+    # Save as JPG and preserve EXIF data if it existed
+    # High quality (95%) maintains OCR accuracy while keeping file size reasonable
+    if original_exif:
+        image.save(output_path, 'JPEG', quality=95, exif=original_exif)
+    else:
+        image.save(output_path, 'JPEG', quality=95)
+    print(f"DEBUG: Downloaded and saved image to: {output_path}", file=sys.stderr)
+    return output_path
+# Formatting Helpers
+def format_birth_date(date_str):
+    """
+    Format birth date string to ISO format (YYYY-MM-DD).
+    Args:
+        date_str (str): Raw date string from OCR
+    Returns:
+        str: Formatted date in YYYY-MM-DD format, or original string if parsing fails
+    Why this complexity:
+    - OCR often produces inconsistent date formats
+    - Philippine IDs use various date formats
+    - Need standardized output for database storage
+    - Handles common OCR errors like missing spaces
+    """
+    if not date_str:
+        return None
+    # Try to match formats like 'JULY10,2003' or 'JULY 10, 2003'
+    # Remove spaces first to handle OCR inconsistencies
+    date_str = date_str.replace(' ', '')
+    match = re.match(r'([A-Za-z]+)(\d{1,2}),?(\d{4})', date_str)
+    if match:
+        month_str, day, year = match.groups()
+        try:
+            # Convert month name to number using first 3 characters
+            month = datetime.strptime(month_str[:3], '%b').month
+            return f"{year}-{int(month):02d}-{int(day):02d}"
+        except Exception:
+            pass
+    # Try to parse with datetime for other possible formats
+    # This handles various common date formats found in Philippine IDs
+    for fmt in ("%B%d,%Y", "%B%d,%Y", "%b%d,%Y", "%B %d, %Y", "%b %d, %Y", "%Y-%m-%d"):
+        try:
+            dt = datetime.strptime(date_str, fmt)
+            return dt.strftime("%Y-%m-%d")
+        except Exception:
+            continue
+    # Fallback to original if parsing fails
+    return date_str
+def capitalize_name(name):
+    """
+    Properly capitalize name string.
+    Args:
+        name (str): Raw name string from OCR
+    Returns:
+        str: Properly capitalized name
+    Why this is needed:
+    - OCR often produces inconsistent capitalization
+    - Need standardized name format for database storage
+    - Handles multiple spaces and OCR artifacts
+    """
+    if not name:
+        return name
+    # Capitalize each word, handling possible multiple spaces or OCR errors
+    return ' '.join([w.capitalize() for w in name.split()])
+# OCR Function
+def extract_id_info(lines):
+    """
+    Extract structured information from OCR text lines.
+    Args:
+        lines (list): List of text lines from OCR processing
+    Returns:
+        dict: Extracted information with keys: id_number, full_name, birth_date
+    Why this approach:
+    - Philippine National IDs have specific format requirements
+    - ID number is always 19 characters with 3 hyphens
+    - Uses bilingual labels (English/Filipino) for field identification
+    - Handles cases where labels and values are on separate lines
+    """
+    print("DEBUG: Processing lines:", lines, file=sys.stderr)
+    # Initialize variables for extracted information
+    id_number = None
+    last_name = None
+    given_names = None
+    birth_date = None
+    # Process each line to find relevant information
+    for i in range(len(lines)):
+        line = lines[i]
+        print(f"DEBUG: Processing line {i}: '{line}'", file=sys.stderr)
+        # Check for National ID number format: XXXX-XXXX-XXXX-XXXX
+        # This is the standard format for Philippine National IDs
+        if isinstance(line, str) and len(line) == 19 and line.count('-') == 3 and all(part.isdigit() for part in line.split('-')):
+            id_number = line
+            print(f"DEBUG: Found ID number: {id_number}", file=sys.stderr)
+        # Look for bilingual "Last Name" label
+        # Philippine IDs often have both English and Filipino labels
+        if line == "Apelyido/Last Name" and i+1 < len(lines):
+            last_name = lines[i+1]
+            print(f"DEBUG: Found last name: {last_name}", file=sys.stderr)
+        # Look for bilingual "Given Names" label
+        if line == "Mga Pangalan/Given Names" and i+1 < len(lines):
+            given_names = lines[i+1]
+            print(f"DEBUG: Found given names: {given_names}", file=sys.stderr)
+        # Look for bilingual "Date of Birth" label
+        if line == "Petsa ng Kapanganakan/Date of Birth" and i+1 < len(lines):
+            birth_date = lines[i+1]
+            print(f"DEBUG: Found birth date: {birth_date}", file=sys.stderr)
+    # Compose full name from separate fields
+    # Philippine names typically follow: Given Names + Last Name
+    name_parts = [given_names, last_name]
+    full_name = ' '.join([part for part in name_parts if part])
+    full_name = capitalize_name(full_name)
+    # Format birth date to ISO standard
+    formatted_birth_date = format_birth_date(birth_date) if birth_date else None
+    # Return structured result
+    result = {
+        'id_number': id_number,
+        'full_name': full_name,
+        'birth_date': formatted_birth_date
+    }
+    print("DEBUG: Final result:", json.dumps(result, indent=2), file=sys.stderr)
+    return result
+def extract_ocr_lines(image_path):
+    """
+    Perform OCR on image and extract text lines.
+    Args:
+        image_path (str): Path to the image file
+    Returns:
+        dict: Extracted information from the image
+    Why these OCR settings:
+    - Disabled expensive features for better performance in production
+    - English language setting (Philippine IDs primarily use English)
+    - Suppressed logs to keep output clean for API responses
+    - Offscreen rendering for server environments
+    """
+    # Ensure output directory exists for OCR results
+    os.makedirs("output", exist_ok=True)
+    # Initialize PaddleOCR with optimized settings
+    # These settings balance accuracy with performance
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,  # Disable for better performance
+            use_doc_unwarping=False,            # Disable for better performance
+            use_textline_orientation=False,     # Disable for better performance
+            lang='en',                          # English language
+            show_log=False                      # Suppress logs for clean output
+        )
+        results = ocr.ocr(image_path, cls=False)
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG: Error processing OCR results: {str(e)}", file=sys.stderr)
+    print(f"DEBUG: Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_id_info(all_text) if all_text else {'id_number': None, 'full_name': None, 'birth_date': None}
+# Main execution
+if __name__ == "__main__":
+    # Validate command line arguments
+    if len(sys.argv) < 2:
+        print(json.dumps({"error": "No image URL provided"}))
+        sys.exit(1)
+    image_url = sys.argv[1]
+    print(f"DEBUG: Processing image URL: {image_url}", file=sys.stderr)
+    try:
+        # Download and process the image
+        image_path = download_image(image_url)
+        # Perform OCR and extract information
+        ocr_results = extract_ocr_lines(image_path)
+        # Restore stdout and print only the JSON response
+        sys.stdout = original_stdout
+        sys.stdout.write(json.dumps({
+            "success": True,
+            "ocr_results": ocr_results
+        }))
+        sys.stdout.flush()
+    except Exception as e:
+        # Restore stdout for error JSON
+        sys.stdout = original_stdout
+        sys.stdout.write(json.dumps({"error": str(e)}))
+        sys.stdout.flush()
+        sys.exit(1)
+    finally:
+        # Clean up
+        try:
+            clean_cache()
+        except:
+            pass

extract_nbi_ocr.py ADDED Viewed

	@@ -0,0 +1,200 @@

+import sys, json, os, glob, requests
+import re
+import time
+import shutil
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def download_image(url, output_path='temp_image.jpg'):
+    # Remove any existing temp file
+    if os.path.exists(output_path):
+        os.remove(output_path)
+    # Add cache-busting parameters
+    timestamp = int(time.time())
+    if '?' in url:
+        url += f'&t={timestamp}'
+    else:
+        url += f'?t={timestamp}'
+    # Add headers to prevent caching
+    headers = {
+        'Cache-Control': 'no-cache, no-store, must-revalidate',
+        'Pragma': 'no-cache',
+        'Expires': '0'
+    }
+    response = requests.get(url, headers=headers)
+    response.raise_for_status()
+    image_data = response.content
+    # Save the image and verify it's the right one
+    with open(output_path, 'wb') as f:
+        f.write(image_data)
+    return output_path
+# OCR Function to extract NBI ID NO
+def extract_nbi_id(lines):
+    nbi_id = None
+    for i, line in enumerate(lines):
+        if isinstance(line, str):
+            # Look for "NBI ID NO:" pattern
+            if "NBI ID NO:" in line.upper() or "NBIIDNO" in line.upper():
+                # Extract the ID after the colon
+                parts = line.split(':')
+                if len(parts) > 1:
+                    nbi_id = parts[1].strip()
+                    break
+            # Also check if the next line contains the ID (in case it's on a separate line)
+            elif i < len(lines) - 1 and ("NBI ID NO:" in line.upper() or "NBI ID NO" in line.upper()):
+                next_line = lines[i + 1]
+                if isinstance(next_line, str) and len(next_line.strip()) > 5:
+                    nbi_id = next_line.strip()
+                    break
+    # If not found with "NBI ID NO:" pattern, look for the specific format
+    if not nbi_id:
+        for line in lines:
+            if isinstance(line, str):
+                # Look for pattern like HGUR87H38D-U47204A873 (alphanumeric with one hyphen)
+                pattern = r'[A-Z0-9]{10,12}-[A-Z0-9]{10,12}'
+                match = re.search(pattern, line)
+                if match:
+                    nbi_id = match.group()
+                    break
+    return {
+        'id_number': nbi_id,
+        'full_name': None,
+        'birth_date': None,
+        'success': nbi_id is not None
+    }
+def extract_ocr_lines_simple(image_path):
+    # Try with different PaddleOCR settings
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=True,  # Enable orientation detection
+            use_doc_unwarping=True,            # Enable document unwarping
+            use_textline_orientation=True,     # Enable text line orientation
+            lang='en',                         # Set language to English
+            show_log=False
+        )
+        results = ocr.ocr(image_path, cls=True)
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception:
+        pass
+    return extract_nbi_id(all_text) if all_text else {'id_number': None, 'full_name': None, 'birth_date': None, 'success': False}
+def extract_ocr_lines(image_path):
+    # Check if file exists and has content
+    if not os.path.exists(image_path):
+        return {'id_number': None, 'success': False}
+    # Ensure output directory exists
+    os.makedirs("output", exist_ok=True)
+    # Clear previous output files
+    for old_file in glob.glob("output/*"):
+        os.remove(old_file)
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            show_log=False
+        )
+        results = ocr.ocr(image_path, cls=False)
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG: Error processing OCR results: {str(e)}", file=sys.stderr)
+    print(f"DEBUG: Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_nbi_id(all_text) if all_text else {'id_number': None, 'full_name': None, 'birth_date': None, 'success': False}
+# Main
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"success": False, "error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+print(f"DEBUG: Processing NBI image URL: {image_url}", file=sys.stderr)
+try:
+    image_path = download_image(image_url, f'temp_image.jpg')
+    print(f"DEBUG: Image downloaded to: {image_path}", file=sys.stderr)
+    # Try the original OCR method first
+    ocr_results = extract_ocr_lines(image_path)
+    print(f"DEBUG: OCR results from extract_ocr_lines: {ocr_results}", file=sys.stderr)
+    # If original method fails, try simple method
+    if not ocr_results['success']:
+        print("DEBUG: Original method failed, trying simple method", file=sys.stderr)
+        ocr_results = extract_ocr_lines_simple(image_path)
+        print(f"DEBUG: OCR results from extract_ocr_lines_simple: {ocr_results}", file=sys.stderr)
+    # Clean up the temporary file
+    if os.path.exists(image_path):
+        os.remove(image_path)
+    # Create the response object
+    response = {
+        "success": ocr_results['success'],
+        "ocr_results": ocr_results
+    }
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps(response))
+    sys.stdout.flush()
+except Exception as e:
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": False, "error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        if os.path.exists('temp_image.jpg'):
+            os.remove('temp_image.jpg')
+    except:
+        pass

extract_passport.py ADDED Viewed

	@@ -0,0 +1,459 @@

+#!/usr/bin/env python3
+"""
+Philippine Passport Information Extraction Script
+Purpose:
+    Extracts structured information from Philippine passport images using OCR.
+    Handles complex passport layouts with multiple information fields.
+Why this script exists:
+    - Passports have complex layouts with multiple information fields
+    - Need to extract international-standard passport information
+    - Handles bilingual labels (English/Filipino)
+    - Required for passport verification workflows
+Key Features:
+    - Extracts passport number (format: X0000000A)
+    - Handles complex name structures (surname, given names, middle name)
+    - Processes multiple date fields (birth, issue, expiration)
+    - Extracts nationality and place of birth
+    - Handles OCR digit correction
+Dependencies:
+    - PaddleOCR: High-accuracy OCR engine (https://github.com/PaddlePaddle/PaddleOCR)
+    - Pillow (PIL): Image processing (https://pillow.readthedocs.io/)
+    - requests: HTTP library (https://docs.python-requests.org/)
+Usage:
+    python extract_passport.py "https://example.com/passport.jpg"
+Output:
+    JSON with extracted information: passport_number, full_name, birth_date, valid_until, etc.
+"""
+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Route any non-JSON prints to stderr by default
+_ORIG_STDOUT = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255, 255, 255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def cap_words(s):
+    return None if not s else ' '.join(w.capitalize() for w in s.split())
+def normalize_digits(s):
+    """
+    Fix common OCR digit confusions.
+    Args:
+        s (str): Text string that may contain OCR errors
+    Returns:
+        str: Text with corrected digits
+    Why this is needed:
+    - OCR often misreads similar-looking characters
+    - Common errors: O→0, o→0, I/l→1, S→5, B→8
+    - Critical for accurate ID number extraction
+    """
+    return (
+        str(s)
+        .replace('O','0').replace('o','0')
+        .replace('I','1').replace('l','1')
+        .replace('S','5')
+        .replace('B','8')
+    )
+def normalize_full_name(surname, given_names, middle_name=None):
+    if not surname and not given_names:
+        return None
+    surname = surname.strip() if surname else ""
+    given_names = given_names.strip() if given_names else ""
+    middle_name = middle_name.strip() if middle_name else ""
+    # Combine given names (first + second if present)
+    given_parts = [p for p in given_names.split() if p]
+    if len(given_parts) >= 2:
+        # Keep first two given names, ignore middle name
+        name_parts = [given_parts[0], given_parts[1], surname]
+    elif len(given_parts) == 1:
+        name_parts = [given_parts[0], surname]
+    else:
+        name_parts = [surname]
+    return cap_words(' '.join(name_parts))
+def format_date(s):
+    if not s:
+        return None
+    raw = str(s).strip()
+    # Fix OCR digit issues first
+    raw = normalize_digits(raw)
+    # Handle "16MAR1980" format (no spaces)
+    try:
+        if re.match(r'\d{2}[A-Z]{3}\d{4}', raw):
+            return datetime.strptime(raw, "%d%b%Y").strftime("%Y-%m-%d")
+    except Exception:
+        pass
+    # Handle "16 MAR 1980" format (with spaces)
+    try:
+        return datetime.strptime(raw, "%d %b %Y").strftime("%Y-%m-%d")
+    except Exception:
+        pass
+    # Handle "27 JUN 2016" format
+    try:
+        return datetime.strptime(raw, "%d %b %Y").strftime("%Y-%m-%d")
+    except Exception:
+        pass
+    # Handle other date formats
+    t = raw.replace(' ', '').replace('\\','/').replace('.','/')
+    if re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t):
+        return t.replace('/', '-')
+    if re.match(r'^\d{2}/\d{2}/\d{4}$', raw):
+        m, d, y = raw.split('/')
+        return f"{y}-{int(m):02d}-{int(d):02d}"
+    return raw
+def extract_passport_number(text):
+    # Fix OCR digits first
+    text = normalize_digits(text)
+    # Look for passport number pattern like "P0000000A"
+    passport_pattern = r'\b([A-Z]\d{7}[A-Z0-9])\b'
+    match = re.search(passport_pattern, text)
+    if match:
+        return match.group(1)
+    return None
+def take_within(lines, i, k=5):
+    out = []
+    for j in range(1, k+1):
+        if i+j < len(lines):
+            t = str(lines[i+j]).strip()
+            if t:
+                out.append(t)
+    return out
+def extract_passport_info(lines):
+    """
+    Extract passport information from OCR text lines.
+    Args:
+        lines (list): List of text lines from OCR processing
+    Returns:
+        dict: Extracted passport information
+    Why this approach:
+    - Passports follow ICAO standards with specific formats
+    - Complex name structure requires separate handling
+    - Multiple date fields need individual processing
+    - Uses lookahead pattern matching for field extraction
+    """
+    dprint("Lines to extract", lines)
+    # Initialize variables for extracted information
+    full_name = None
+    surname = None
+    given_names = None
+    middle_name = None
+    passport_number = None
+    birth_date = None
+    sex = None
+    nationality = None
+    place_of_birth = None
+    date_of_issue = None
+    valid_until = None
+    issuing_authority = None
+    L = [str(x or '').strip() for x in lines]
+    i = 0
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        # Extract passport number using pattern matching
+        if not passport_number:
+            passport_num = extract_passport_number(line)
+            if passport_num:
+                passport_number = passport_num
+                dprint("Found passport number", passport_number)
+        # Extract Surname using lookahead pattern
+        if 'surname' in low or 'apelyido' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and not re.search(r'[0-9]', t):
+                    surname = t
+                    dprint("Found surname", surname)
+                    break
+        # Also look for "DELA CRUZ" directly
+        if not surname and 'dela' in low and 'cruz' in low:
+            surname = line
+            dprint("Found surname (direct)", surname)
+        # Extract Given Names
+        if 'given' in low and 'name' in low or 'pangalan' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and not re.search(r'[0-9]', t):
+                    given_names = t
+                    dprint("Found given names", given_names)
+                    break
+        # Also look for "MARIA" directly
+        if not given_names and line == 'MARIA':
+            given_names = line
+            dprint("Found given names (direct)", given_names)
+        # Extract Middle Name
+        if 'middle' in low or 'panggitnang' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and not re.search(r'[0-9]', t):
+                    middle_name = t
+                    dprint("Found middle name", middle_name)
+                    break
+        # Also look for "SANTOS" directly
+        if not middle_name and line == 'SANTOS':
+            middle_name = line
+            dprint("Found middle name (direct)", middle_name)
+        # Extract Date of Birth
+        if 'birth' in low or 'kapanganakan' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'\d{1,2}[A-Z]{3}\d{4}', t) or re.search(r'\d{1,2}\s+[A-Z]{3}\s+\d{4}', t):
+                    birth_date = format_date(t)
+                    dprint("Found birth date", birth_date)
+                    break
+        # Also look for "16MAR1980" directly
+        if not birth_date and re.search(r'\d{1,2}[A-Z]{3}\d{4}', line):
+            birth_date = format_date(line)
+            dprint("Found birth date (direct)", birth_date)
+        # Extract Sex
+        if 'sex' in low or 'kasarian' in low:
+            ahead = take_within(L, i, 2)
+            for t in ahead:
+                if t.upper() in ['M', 'F', 'MALE', 'FEMALE']:
+                    sex = 'M' if t.upper().startswith('M') else 'F'
+                    dprint("Found sex", sex)
+                    break
+        # Also look for "F" directly
+        if not sex and line == 'F':
+            sex = 'F'
+            dprint("Found sex (direct)", sex)
+        # Extract Nationality
+        if 'nationality' in low or 'nasyonalidad' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and not re.search(r'[0-9]', t):
+                    nationality = t
+                    dprint("Found nationality", nationality)
+                    break
+        # Also look for "FILIPINO" directly
+        if not nationality and line == 'FILIPINO':
+            nationality = line
+            dprint("Found nationality (direct)", nationality)
+        # Extract Place of Birth
+        if 'place' in low and 'birth' in low or 'lugar' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and not re.search(r'[0-9]', t):
+                    place_of_birth = t
+                    dprint("Found place of birth", place_of_birth)
+                    break
+        # Also look for "MANILA" directly
+        if not place_of_birth and line == 'MANILA':
+            place_of_birth = line
+            dprint("Found place of birth (direct)", place_of_birth)
+        # Extract Date of Issue
+        if 'issue' in low or 'pagkakaloob' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'\d{1,2}[A-Z]{3}\d{4}', t) or re.search(r'\d{1,2}\s+[A-Z]{3}\s+\d{4}', t):
+                    date_of_issue = format_date(t)
+                    dprint("Found date of issue", date_of_issue)
+                    break
+        # Also look for "27JUN2016" directly
+        if not date_of_issue and re.search(r'\d{1,2}[A-Z]{3}\d{4}', line):
+            date_of_issue = format_date(line)
+            dprint("Found date of issue (direct)", date_of_issue)
+        # Extract Valid Until Date
+        if 'valid' in low or 'pagkawalang' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'\d{1,2}\s+[A-Z]{3}\s+\d{4}', t):
+                    valid_until = format_date(t)
+                    dprint("Found valid until", valid_until)
+                    break
+        # Also look for "26 JUN 2021" directly
+        if not valid_until and re.search(r'\d{1,2}\s+[A-Z]{3}\s+\d{4}', line):
+            valid_until = format_date(line)
+            dprint("Found valid until (direct)", valid_until)
+        # Extract Issuing Authority
+        if 'authority' in low or 'maykapangyarihang' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if re.search(r'[A-Z]{2,}', t) and 'DFA' in t:
+                    issuing_authority = t
+                    dprint("Found issuing authority", issuing_authority)
+                    break
+        # Also look for "DFAMANILA" directly
+        if not issuing_authority and 'DFA' in line:
+            issuing_authority = line
+            dprint("Found issuing authority (direct)", issuing_authority)
+        i += 1
+    # Compose full name from separate fields
+    if not full_name:
+        full_name = normalize_full_name(surname, given_names, middle_name)
+        dprint("Composed full name", {"surname": surname, "given": given_names, "middle": middle_name, "full": full_name})
+    # Return structured result
+    result = {
+        "id_type": "passport",
+        "passport_number": passport_number,
+        "id_number": passport_number,
+        "full_name": full_name,
+        "surname": surname,
+        "given_names": given_names,
+        "middle_name": middle_name,
+        "birth_date": birth_date,
+        "sex": sex,
+        "nationality": nationality,
+        "place_of_birth": place_of_birth,
+        "date_of_issue": date_of_issue,
+        "valid_until": valid_until,
+        "issuing_authority": issuing_authority
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    # Ensure any internal downloader/progress writes go to stderr, not stdout
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR ocr", image_path)
+        results = ocr.ocr(image_path, cls=False)
+    try:
+        count = len(results[0]) if results and isinstance(results[0], list) else len(results)
+    except Exception:
+        count = 0
+    dprint("OCR ocr done, results_count", count)
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_passport_info(all_text) if all_text else {
+        "id_type": "passport",
+        "passport_number": None,
+        "id_number": None,
+        "full_name": None,
+        "birth_date": None
+    }
+if len(sys.argv) < 2:
+    print(json.dumps({"error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready")
+    # Ensure only the final JSON goes to stdout
+    sys.stdout = _ORIG_STDOUT
+    print(json.dumps({"success": True, "ocr_results": ocr_results}))
+except Exception as e:
+    import traceback
+    error_msg = str(e)
+    traceback_msg = traceback.format_exc()
+    dprint("Exception", error_msg)
+    dprint("Traceback", traceback_msg)
+    print(json.dumps({
+        "error": error_msg,
+        "traceback": traceback_msg,
+        "success": False
+    }))
+    sys.exit(1)

extract_phic.py ADDED Viewed

	@@ -0,0 +1,405 @@

+import sys, json, os, glob, requests
+import re
+import time
+from contextlib import redirect_stdout, redirect_stderr
+from datetime import datetime
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def download_image(url, output_path='temp_phic_image.jpg'):
+    # Remove any existing temp file
+    if os.path.exists(output_path):
+        os.remove(output_path)
+    # Add cache-busting parameters
+    timestamp = int(time.time())
+    if '?' in url:
+        url += f'&t={timestamp}'
+    else:
+        url += f'?t={timestamp}'
+    # Add headers to prevent caching
+    headers = {
+        'Cache-Control': 'no-cache, no-store, must-revalidate',
+        'Pragma': 'no-cache',
+        'Expires': '0'
+    }
+    response = requests.get(url, headers=headers)
+    response.raise_for_status()
+    image_data = response.content
+    # Save the image
+    with open(output_path, 'wb') as f:
+        f.write(image_data)
+    return output_path
+def format_date(date_str):
+    """Format date from various formats to YYYY-MM-DD"""
+    if not date_str:
+        return None
+    date_str = date_str.strip()
+    # Fix common OCR errors first
+    date_str = date_str.replace('VULY', 'JULY').replace('VUL', 'JUL')
+    date_str = date_str.replace('Juy', 'July')
+    date_str = date_str.replace('Januay', 'January').replace('Februay', 'February')
+    date_str = date_str.replace('Marc', 'March').replace('Apil', 'April')
+    date_str = date_str.replace('Augu', 'August').replace('Septemb', 'September')
+    date_str = date_str.replace('Octob', 'October').replace('Novem', 'November')
+    date_str = date_str.replace('Decemb', 'December')
+    # Fix period instead of comma: "10.2003" -> "10, 2003"
+    date_str = re.sub(r'(\d{1,2})\.(\d{4})', r'\1, \2', date_str)
+    # Handle format like "JULY 10, 2003" or "July 10, 2003"
+    match = re.match(r'([A-Za-z]+)\s+(\d{1,2}),?\s+(\d{4})', date_str)
+    if match:
+        month_str, day, year = match.groups()
+        try:
+            # Parse month name (handle full month names and abbreviations)
+            # Try abbreviation first (first 3 chars)
+            try:
+                month = datetime.strptime(month_str[:3], '%b').month
+            except:
+                # Try full month name
+                month = datetime.strptime(month_str, '%B').month
+            return f"{year}-{month:02d}-{int(day):02d}"
+        except Exception as e:
+            print(f"DEBUG: Date parsing error: {e}", file=sys.stderr)
+            pass
+    # Try other common formats
+    for fmt in ["%B %d, %Y", "%b %d, %Y", "%Y-%m-%d", "%m/%d/%Y", "%d/%m/%Y"]:
+        try:
+            dt = datetime.strptime(date_str, fmt)
+            return dt.strftime("%Y-%m-%d")
+        except Exception:
+            continue
+    return date_str
+def format_name(name):
+    """Format name: capitalize properly and fix spacing"""
+    if not name:
+        return None
+    # Ensure comma spacing: "FERNANDEZ,CARL" -> "FERNANDEZ, CARL"
+    name = re.sub(r',([A-Z])', r', \1', name)
+    name = re.sub(r',\s*([A-Z])', r', \1', name)
+    # Fix missing spaces in name parts: "MATTHEWICOY" -> "MATTHEW ICOY"
+    # Add space before capital letters that follow lowercase
+    name = re.sub(r'([a-z])([A-Z])', r'\1 \2', name)
+    # Add space between consecutive capitals followed by lowercase (name parts)
+    name = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1 \2', name)
+    # Remove extra spaces and normalize
+    name = ' '.join(name.split())
+    # Capitalize each word properly
+    name = ' '.join([word.capitalize() for word in name.split()])
+    return name.strip()
+def format_address(address_lines):
+    """Format address from multiple lines and fix OCR errors"""
+    if not address_lines:
+        return None
+    # Join address lines and clean up
+    address = ' '.join([line.strip() for line in address_lines if line.strip()])
+    # Fix OCR errors in address
+    address = address.replace('DYLABONFACIO', 'BONIFACIO')
+    address = address.replace('AVENJE', 'AVENUE')
+    address = address.replace('SANIODOMINGO', 'SANTO DOMINGO')
+    address = address.replace('SANIO', 'SANTO')
+    address = address.replace('AZAL', 'RIZAL')
+    address = address.replace('AZAL-', 'RIZAL ')
+    # Fix missing spaces: "071A" -> "071 A"
+    address = re.sub(r'(\d+)([A-Z])', r'\1 \2', address)
+    # Fix missing spaces before city/province: "CAINTA AZAL" -> "CAINTA RIZAL"
+    address = re.sub(r'([A-Z])([A-Z][a-z]+)', r'\1 \2', address)
+    # Remove extra spaces
+    address = ' '.join(address.split())
+    return address.strip()
+def extract_phic_details(lines):
+    details = {
+        'id_number': None,
+        'full_name': None,
+        'birth_date': None,
+        'sex': None,
+        'address': None,
+        'membership_category': None,
+        'success': False
+    }
+    # Clean lines - convert to strings and strip
+    cleaned_lines = [str(line).strip() for line in lines if str(line).strip()]
+    for i, line in enumerate(cleaned_lines):
+        line_upper = line.upper().strip()
+        line_stripped = line.strip()
+        # Extract ID Number - Format: XX-XXXXXXXXX-X (e.g., "03-026765383-2")
+        if not details['id_number']:
+            # Look for pattern: digits-digits-digits with hyphens
+            id_match = re.search(r'(\d{2}-\d{9}-\d)', line_stripped)
+            if id_match:
+                details['id_number'] = id_match.group(1)
+            # Also check for pattern without hyphens that might be OCR'd incorrectly
+            elif re.match(r'^\d{12}$', line_stripped):
+                # Format: 030267653832 -> 03-026765383-2
+                if len(line_stripped) == 12:
+                    details['id_number'] = f"{line_stripped[:2]}-{line_stripped[2:11]}-{line_stripped[11]}"
+        # Extract Full Name - usually appears as "LASTNAME, FIRST MIDDLE"
+        if not details['full_name']:
+            # Look for name pattern with comma (LASTNAME, FIRST MIDDLE)
+            if ',' in line_stripped and re.match(r'^[A-Z\s,]+$', line_stripped):
+                # Make sure it's not a label
+                if not any(label in line_upper for label in ['REPUBLIC', 'PHILIPPINE', 'HEALTH', 'INSURANCE', 'CORPORATION', 'PHILHEALTH', 'DATE', 'BIRTH', 'ADDRESS', 'MEMBERSHIP', 'CATEGORY']):
+                    details['full_name'] = line_stripped
+            # Also check for name parts on separate lines
+            elif re.match(r'^[A-Z\s,]+$', line_stripped) and len(line_stripped.split()) >= 2:
+                # Make sure it's not a label
+                if not any(label in line_upper for label in ['REPUBLIC', 'PHILIPPINE', 'HEALTH', 'INSURANCE', 'CORPORATION', 'PHILHEALTH', 'DATE', 'BIRTH', 'ADDRESS', 'MEMBERSHIP', 'CATEGORY', 'INFORMAL', 'ECONOMY']):
+                    # Check if previous line might be part of name
+                    if i > 0:
+                        prev_line = cleaned_lines[i-1].strip()
+                        if ',' in prev_line and re.match(r'^[A-Z\s,]+$', prev_line):
+                            details['full_name'] = prev_line
+                        elif re.match(r'^[A-Z\s,]+$', line_stripped):
+                            # Try to combine with previous line if it looks like a name
+                            if re.match(r'^[A-Z\s,]+$', prev_line) and ',' in prev_line:
+                                details['full_name'] = prev_line
+        # Extract Date of Birth and Sex - Format: "JULY 10, 2003 - MALE" or "VULY10.2003-MALE" (with OCR errors)
+        if not details['birth_date'] or not details['sex']:
+            # Look for date pattern followed by sex (handle OCR errors like "VULY10.2003-MALE")
+            # Pattern: month name (may have OCR errors) + day (may have period instead of comma) + year + sex
+            date_sex_match = re.search(r'([A-Za-z]+)\s*(\d{1,2})[.,]\s*(\d{4})\s*[-]?\s*(MALE|FEMALE)', line_upper)
+            if date_sex_match:
+                month_str = date_sex_match.group(1)
+                day = date_sex_match.group(2)
+                year = date_sex_match.group(3)
+                sex_str = date_sex_match.group(4)
+                # Fix OCR errors in month: VULY -> JULY
+                month_str = month_str.replace('VULY', 'JULY').replace('VUL', 'JUL')
+                month_str = month_str.replace('JANUA', 'JANUARY').replace('FEBRUA', 'FEBRUARY')
+                month_str = month_str.replace('MARC', 'MARCH').replace('APIL', 'APRIL')
+                month_str = month_str.replace('AUGU', 'AUGUST').replace('SEPTEM', 'SEPTEMBER')
+                month_str = month_str.replace('OCTO', 'OCTOBER').replace('NOVEM', 'NOVEMBER')
+                month_str = month_str.replace('DECEM', 'DECEMBER')
+                if not details['birth_date']:
+                    details['birth_date'] = f"{month_str} {day}, {year}"
+                if not details['sex']:
+                    details['sex'] = sex_str.capitalize()
+            # Also check for date pattern with period instead of comma
+            elif not details['birth_date']:
+                date_match = re.search(r'([A-Za-z]+)\s*(\d{1,2})[.,]\s*(\d{4})', line_upper)
+                if date_match:
+                    month_str = date_match.group(1)
+                    day = date_match.group(2)
+                    year = date_match.group(3)
+                    # Fix OCR errors
+                    month_str = month_str.replace('VULY', 'JULY').replace('VUL', 'JUL')
+                    # Make sure it's not part of address or other field
+                    if "AVENUE" not in line_upper and "STREET" not in line_upper and "RIZAL" not in line_upper:
+                        details['birth_date'] = f"{month_str} {day}, {year}"
+            # Check for sex alone if date was found separately
+            if not details['sex']:
+                if "MALE" in line_upper and "FEMALE" not in line_upper:
+                    details['sex'] = "Male"
+                elif "FEMALE" in line_upper:
+                    details['sex'] = "Female"
+        # Extract Address - usually a long line with street, city, province, zip
+        if not details['address']:
+            # Look for address indicators (but exclude ID numbers and names)
+            # Check if line contains address keywords but not ID pattern or name pattern
+            if (any(indicator in line_upper for indicator in ['AVENUE', 'AVENJE', 'STREET', 'ROAD', 'BLVD', 'BOULEVARD', 'CAINTA', 'RIZAL', 'AZAL', 'SANTO', 'SANIO', 'DOMINGO', 'BONIFACIO', 'DYLABONFACIO']) or
+                (re.match(r'^\d+', line_stripped) and len(line_stripped) > 5)):
+                # Make sure it's not an ID number or name
+                is_id = bool(re.search(r'\d{2}-\d{9}-\d', line_stripped))
+                is_name = bool(',' in line_stripped and re.match(r'^[A-Z\s,]+$', line_stripped) and len(line_stripped.split()) <= 5)
+                if not is_id and not is_name:
+                    address_lines = []
+                    # Collect address lines forward
+                    for j in range(0, min(3, len(cleaned_lines) - i)):
+                        addr_line = cleaned_lines[i+j].strip()
+                        addr_upper = addr_line.upper()
+                        # Stop if we hit membership category, ID number, name, or other labels
+                        if (any(label in addr_upper for label in ['MEMBERSHIP', 'CATEGORY', 'INFORMAL', 'ECONOMY', 'DATE', 'BIRTH', 'MALE', 'FEMALE']) or
+                            re.search(r'\d{2}-\d{9}-\d', addr_line) or
+                            (',' in addr_line and re.match(r'^[A-Z\s,]+$', addr_line) and len(addr_line.split()) <= 5)):
+                            break
+                        # Add if it looks like address content
+                        if addr_line and len(addr_line) > 2:
+                            address_lines.append(addr_line)
+                    if address_lines:
+                        details['address'] = format_address(address_lines)
+        # Extract Membership Category - usually "INFORMAL ECONOMY" or similar
+        if not details['membership_category']:
+            if "MEMBERSHIP" in line_upper and "CATEGORY" in line_upper:
+                # Category is likely on next line
+                if i + 1 < len(cleaned_lines):
+                    next_line = cleaned_lines[i+1].strip()
+                    if next_line:
+                        details['membership_category'] = next_line
+            elif "INFORMAL ECONOMY" in line_upper or ("INFORMAL" in line_upper and "ECONOMY" in line_upper):
+                details['membership_category'] = "INFORMAL ECONOMY"
+            elif any(cat in line_upper for cat in ['INFORMAL', 'EMPLOYED', 'SELF-EMPLOYED', 'VOLUNTARY', 'SPONSORED', 'DEPENDENT']):
+                # Check if it's a membership category (not part of address or name)
+                if not any(label in line_upper for label in ['AVENUE', 'STREET', 'RIZAL', 'CAINTA', 'FERNANDEZ']):
+                    details['membership_category'] = line_stripped
+    # Format extracted fields
+    if details['full_name']:
+        details['full_name'] = format_name(details['full_name'])
+    if details['birth_date']:
+        details['birth_date'] = format_date(details['birth_date'])
+    if details['id_number'] or details['full_name']:
+        details['success'] = True
+    return details
+def extract_ocr_lines(image_path):
+    # Check if file exists
+    if not os.path.exists(image_path):
+        return {'success': False, 'error': 'File not found'}
+    file_size = os.path.getsize(image_path)
+    print(f"DEBUG: Image file size: {file_size} bytes", file=sys.stderr)
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        # Try simple configuration first
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en'
+        )
+        try:
+            results = ocr.ocr(image_path)
+        except Exception as e:
+            print(f"DEBUG: ocr() failed: {e}, trying predict()", file=sys.stderr)
+            if hasattr(ocr, 'predict'):
+                results = ocr.predict(image_path)
+            else:
+                results = None
+    # Debug: Print raw results structure
+    print(f"DEBUG: Raw OCR results type: {type(results)}", file=sys.stderr)
+    all_text = []
+    try:
+        # Handle both old format (list) and new format (OCRResult object)
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            item_type_name = type(first_item).__name__
+            is_ocr_result = 'OCRResult' in item_type_name or 'ocr_result' in str(type(first_item)).lower()
+            if is_ocr_result:
+                print(f"DEBUG: Detected OCRResult object format (type: {item_type_name})", file=sys.stderr)
+                # Access OCRResult as dictionary
+                try:
+                    if hasattr(first_item, 'keys'):
+                        ocr_dict = dict(first_item)
+                        # Look for rec_texts key
+                        if 'rec_texts' in ocr_dict:
+                            rec_texts = ocr_dict['rec_texts']
+                            if isinstance(rec_texts, list):
+                                all_text = [str(t) for t in rec_texts if t]
+                                print(f"DEBUG: Extracted {len(all_text)} text lines from rec_texts", file=sys.stderr)
+                except Exception as e:
+                    print(f"DEBUG: Error accessing OCRResult: {e}", file=sys.stderr)
+            else:
+                # Old format - list of lists
+                lines = results[0] if results and isinstance(results[0], list) else results
+                for item in lines:
+                    if isinstance(item, (list, tuple)) and len(item) >= 2:
+                        meta = item[1]
+                        if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                            all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG: Error processing OCR results: {str(e)}", file=sys.stderr)
+        import traceback
+        print(f"DEBUG: Traceback: {traceback.format_exc()}", file=sys.stderr)
+    print(f"DEBUG: Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_phic_details(all_text) if all_text else {
+        'id_number': None,
+        'full_name': None,
+        'birth_date': None,
+        'sex': None,
+        'address': None,
+        'membership_category': None,
+        'success': False
+    }
+# Main Execution
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"success": False, "error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+print(f"DEBUG: Processing PhilHealth ID image URL: {image_url}", file=sys.stderr)
+try:
+    image_path = download_image(image_url, 'temp_phic_image.jpg')
+    print(f"DEBUG: Image downloaded to: {image_path}", file=sys.stderr)
+    ocr_results = extract_ocr_lines(image_path)
+    print(f"DEBUG: OCR results: {ocr_results}", file=sys.stderr)
+    # Clean up
+    if os.path.exists(image_path):
+        os.remove(image_path)
+    response = {
+        "success": ocr_results['success'],
+        "data": ocr_results
+    }
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps(response))
+    sys.stdout.flush()
+except Exception as e:
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": False, "error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    try:
+        if os.path.exists('temp_phic_image.jpg'):
+            os.remove('temp_phic_image.jpg')
+    except:
+        pass

extract_police_ocr.py ADDED Viewed

	@@ -0,0 +1,812 @@

+import sys, json, os, glob, requests
+import re
+import time
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def download_image(url, output_path='temp_police_image.jpg'):
+    # Remove any existing temp file
+    if os.path.exists(output_path):
+        os.remove(output_path)
+    # Add cache-busting parameters
+    timestamp = int(time.time())
+    if '?' in url:
+        url += f'&t={timestamp}'
+    else:
+        url += f'?t={timestamp}'
+    # Add headers to prevent caching
+    headers = {
+        'Cache-Control': 'no-cache, no-store, must-revalidate',
+        'Pragma': 'no-cache',
+        'Expires': '0'
+    }
+    response = requests.get(url, headers=headers)
+    response.raise_for_status()
+    image_data = response.content
+    # Save the image
+    with open(output_path, 'wb') as f:
+        f.write(image_data)
+    return output_path
+def format_name(name):
+    """Format name: add proper spacing and commas (generic for all police clearances)
+    Handles common OCR issues like missing spaces between name parts and missing comma spacing.
+    Works with any name format, not specific to one document.
+    """
+    if not name:
+        return None
+    # Remove extra spaces and normalize
+    name = ' '.join(name.split())
+    # First, ensure comma spacing: "JAVA,ALBERT" -> "JAVA, ALBERT"
+    name = re.sub(r',([A-Z])', r', \1', name)
+    name = re.sub(r',\s*([A-Z])', r', \1', name)
+    # Split by comma if present
+    if ',' in name:
+        parts = name.split(',')
+        formatted_parts = []
+        for part in parts:
+            part = part.strip()
+            # Handle consecutive capitals: "JAVAALBERTJOY" -> "JAVA ALBERT JOY"
+            # Strategy: split where a capital letter is followed by another capital + lowercase
+            # "ALBERTJOY" -> "ALBERT JOY"
+            part = re.sub(r'([A-Z]+)([A-Z][a-z])', r'\1 \2', part)
+            # Handle remaining cases: "JOYBAUTISTA" -> "JOY BAUTISTA"
+            part = re.sub(r'([A-Z][a-z]+)([A-Z][a-z]+)', r'\1 \2', part)
+            formatted_parts.append(part)
+        name = ', '.join(formatted_parts)
+    else:
+        # No comma, try to add spaces between name parts
+        # "JAVAALBERTJOY BAUTISTA" -> "JAVA ALBERT JOY BAUTISTA"
+        # Add space before capital letters that follow lowercase
+        name = re.sub(r'([a-z])([A-Z])', r'\1 \2', name)
+        # Add space between consecutive capitals: "JAVAALBERT" -> "JAVA ALBERT"
+        # But be careful: "BAUTISTA" should stay together
+        # Split where we have multiple capitals followed by a capital+lowercase
+        name = re.sub(r'([A-Z]{2,})([A-Z][a-z])', r'\1 \2', name)
+        # Also handle: "ALBERTJOY" -> "ALBERT JOY"
+        name = re.sub(r'([A-Z]+)([A-Z][a-z]+)', r'\1 \2', name)
+    # Clean up multiple spaces
+    name = ' '.join(name.split())
+    return name.strip()
+def format_address(address):
+    """Format address: add proper spacing (generic for all police clearances)"""
+    if not address:
+        return None
+    # Remove extra spaces
+    address = ' '.join(address.split())
+    # Handle #BLK/#BLOCK pattern: ensure space after # if followed by letters and numbers
+    # "#BLK11" -> "#BLK 11", "#BLOCK5" -> "#BLOCK 5"
+    address = re.sub(r'#([A-Z]+)(\d+)', r'#\1 \2', address)
+    # Add space before city names and common address parts (capital followed by capital+lowercase)
+    # "CAMPOTINIO" -> "CAMPO TINIO", "CABANATUANCITY" -> "CABANATUAN CITY"
+    address = re.sub(r'([A-Z])([A-Z][a-z]+)', r'\1 \2', address)
+    # Ensure comma spacing: "CITY,NUEVA" -> "CITY, NUEVA"
+    address = re.sub(r',([A-Z])', r', \1', address)
+    address = re.sub(r',\s*([A-Z])', r', \1', address)
+    # Clean up multiple spaces
+    address = ' '.join(address.split())
+    return address.strip()
+def format_birth_place(place):
+    """Format birth place: add proper spacing (generic for all police clearances)"""
+    if not place:
+        return None
+    # Remove extra spaces
+    place = ' '.join(place.split())
+    # Ensure comma spacing: "DILASAG,AURORA" -> "DILASAG, AURORA"
+    place = re.sub(r',([A-Z])', r', \1', place)
+    place = re.sub(r',\s*([A-Z])', r', \1', place)
+    # Add space before province/region names if needed
+    # "PLACE PROVINCE" -> already spaced, but handle "PLACEPROVINCE" -> "PLACE PROVINCE"
+    place = re.sub(r'([A-Z])([A-Z][a-z]+)', r'\1 \2', place)
+    # Clean up multiple spaces
+    place = ' '.join(place.split())
+    return place.strip()
+def format_birth_date(date):
+    """Format birth date: fix common OCR errors (generic for all police clearances)"""
+    if not date:
+        return None
+    # Fix common OCR errors for month names (universal issues)
+    date = date.replace('Juy', 'July')  # Common OCR error
+    date = date.replace('Januay', 'January')
+    date = date.replace('Februay', 'February')
+    date = date.replace('Marc', 'March')
+    date = date.replace('Apil', 'April')
+    date = date.replace('Jun', 'June')  # Be careful - June is valid, but "Jun" might be incomplete
+    date = date.replace('Augu', 'August')
+    date = date.replace('Septemb', 'September')
+    date = date.replace('Octob', 'October')
+    date = date.replace('Novemb', 'November')
+    date = date.replace('Decemb', 'December')
+    # Fix year errors: "1905" when it should be "05" (day) - common OCR issue
+    # Pattern: "July 1905, 1991" -> "July 05, 1991"
+    # Check if we have a pattern like "Month 19XX, YYYY" where 19XX is likely the day misread
+    match = re.search(r'(\w+)\s+19(\d{2}),\s*(\d{4})', date)
+    if match:
+        day = match.group(2)
+        year = match.group(3)
+        # If day is 00-31, it's likely a day, not a year
+        if 0 <= int(day) <= 31:
+            date = re.sub(r'(\w+)\s+19(\d{2}),\s*(\d{4})', rf'\1 {day}, \3', date)
+    # Ensure proper date format: "July 05, 1991"
+    date = re.sub(r'(\w+)\s+(\d{1,2})\s*,\s*(\d{4})', r'\1 \2, \3', date)
+    # Clean up multiple spaces
+    date = ' '.join(date.split())
+    return date.strip()
+def extract_police_details(lines):
+    details = {
+        'id_number': None,
+        'full_name': None,
+        'address': None,
+        'birth_date': None,
+        'birth_place': None,
+        'citizenship': None,
+        'gender': None,
+        'status': None,
+        'success': False
+    }
+    for i, line in enumerate(lines):
+        if not isinstance(line, str):
+            continue
+        line_upper = line.upper().strip()
+        line_stripped = line.strip()
+        # Extract Name - handle cases where NAME and value are on separate lines
+        if "NAME" in line_upper and not details['full_name']:
+            if ":" in line:
+                parts = line.split(':', 1)
+                if len(parts) > 1:
+                    name_part = parts[1].strip()
+                    if name_part and len(name_part) > 2:
+                        details['full_name'] = name_part
+            elif i + 1 < len(lines):
+                # Check next few lines for name value
+                for j in range(1, min(3, len(lines) - i)):
+                    next_line = lines[i+j].strip()
+                    if next_line.startswith(':') and len(next_line) > 1:
+                        name_part = next_line[1:].strip()
+                        if name_part and len(name_part) > 2 and "ADDRESS" not in name_part.upper():
+                            details['full_name'] = name_part
+                            break
+                    elif not next_line.startswith(('ADDRESS', 'BIRTH', 'CITIZEN', 'GENDER', 'ID')) and len(next_line) > 2:
+                        if ":" not in next_line or (":" in next_line and next_line.index(':') < 3):
+                            name_part = next_line.replace(':', '').strip()
+                            if name_part and len(name_part) > 2:
+                                details['full_name'] = name_part
+                                break
+        # Also check for name patterns that start with colon (OCR sometimes splits NAME label)
+        if not details['full_name'] and line_stripped.startswith(':') and len(line_stripped) > 5:
+            name_candidate = line_stripped[1:].strip()
+            # Check if it looks like a name (has commas, multiple words, etc.)
+            if ',' in name_candidate or (len(name_candidate.split()) >= 2 and name_candidate.isupper()):
+                # Make sure previous line wasn't ADDRESS or other label
+                if i > 0:
+                    prev_line = lines[i-1].strip().upper()
+                    if "ADDRESS" not in prev_line and "BIRTH" not in prev_line:
+                        details['full_name'] = name_candidate
+        # Extract Address
+        if "ADDRESS" in line_upper and not details['address']:
+            if ":" in line:
+                parts = line.split(':')
+                if len(parts) > 1:
+                    addr_part = parts[1].strip()
+                    if addr_part:
+                        details['address'] = addr_part
+            elif i + 1 < len(lines):
+                # Check next few lines for address value
+                addr_parts = []
+                for j in range(1, min(4, len(lines) - i)):
+                    next_line = lines[i+j].strip()
+                    if next_line.startswith(':') and len(next_line) > 1:
+                        addr_parts.append(next_line[1:].strip())
+                    elif "BIRTH" not in next_line.upper() and "CITIZEN" not in next_line.upper():
+                        if ":" in next_line:
+                            parts = next_line.split(':', 1)
+                            if len(parts) > 1:
+                                addr_parts.append(parts[1].strip())
+                        elif len(next_line) > 2:
+                            addr_parts.append(next_line)
+                    else:
+                        break
+                if addr_parts:
+                    details['address'] = ' '.join(addr_parts).strip()
+        # Extract Birth Date - handle OCR errors and combined patterns
+        if ("BIRTH DATE" in line_upper or "BIRTHDATE" in line_upper) and not details['birth_date']:
+            if ":" in line:
+                parts = line.split(':', 1)
+                if len(parts) > 1:
+                    date_part = parts[1].strip()
+                    # Fix common OCR errors
+                    date_part = date_part.replace('Juy', 'July').replace('Juy', 'July')
+                    # Fix year errors (1001 -> 1991, etc.)
+                    date_part = re.sub(r'\b1001\b', '1991', date_part)
+                    date_part = re.sub(r'\b(\d{2})\b', lambda m: '19' + m.group(1) if len(m.group(1)) == 2 and int(m.group(1)) < 50 else m.group(1), date_part)
+                    if date_part:
+                        details['birth_date'] = date_part
+            elif i + 1 < len(lines):
+                next_line = lines[i+1].strip()
+                if ":" in next_line:
+                    parts = next_line.split(':', 1)
+                    if len(parts) > 1:
+                        date_part = parts[1].strip()
+                        date_part = date_part.replace('Juy', 'July')
+                        date_part = re.sub(r'\b1001\b', '1991', date_part)
+                        if date_part:
+                            details['birth_date'] = date_part
+        # Also look for date patterns in lines that might have been OCR'd incorrectly
+        if not details['birth_date']:
+            # Look for patterns like "Juy 05, 1001" or "July 03, 1991"
+            date_pattern = re.search(r'(January|February|March|April|May|June|July|August|September|October|November|December|Juy|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}[,\s]+\d{4}', line_upper)
+            if date_pattern:
+                date_part = date_pattern.group()
+                date_part = date_part.replace('Juy', 'July')
+                date_part = re.sub(r'\b1001\b', '1991', date_part)
+                details['birth_date'] = date_part
+        # Extract Birth Place
+        if "BIRTH PLACE" in line_upper and not details['birth_place']:
+            if ":" in line:
+                parts = line.split(':', 1)
+                if len(parts) > 1:
+                    details['birth_place'] = parts[1].strip()
+            elif i + 1 < len(lines):
+                next_line = lines[i+1].strip()
+                if next_line.startswith(':') and len(next_line) > 1:
+                    details['birth_place'] = next_line[1:].strip()
+                elif ":" in next_line and "CITIZEN" not in next_line.upper():
+                    parts = next_line.split(':', 1)
+                    if len(parts) > 1:
+                        details['birth_place'] = parts[1].strip()
+        # Extract Citizenship
+        if "CITIZENSHIP" in line_upper and not details['citizenship']:
+            if ":" in line:
+                parts = line.split(':', 1)
+                if len(parts) > 1:
+                    details['citizenship'] = parts[1].strip()
+            elif i + 1 < len(lines):
+                next_line = lines[i+1].strip()
+                if next_line.startswith(':') and len(next_line) > 1:
+                    details['citizenship'] = next_line[1:].strip()
+                elif ":" in next_line:
+                    parts = next_line.split(':', 1)
+                    if len(parts) > 1:
+                        details['citizenship'] = parts[1].strip()
+        # Extract Gender - handle cases where GENDER and value are on separate lines
+        if "GENDER" in line_upper and not details['gender']:
+            if ":" in line:
+                parts = line.split(':', 1)
+                if len(parts) > 1:
+                    details['gender'] = parts[1].strip()
+            elif i + 1 < len(lines):
+                next_line = lines[i+1].strip()
+                if next_line.startswith(':') and len(next_line) > 1:
+                    gender_part = next_line[1:].strip()
+                    if gender_part in ['MALE', 'FEMALE', 'M', 'F']:
+                        details['gender'] = gender_part
+                elif ":" in next_line:
+                    parts = next_line.split(':', 1)
+                    if len(parts) > 1:
+                        gender_part = parts[1].strip()
+                        if gender_part in ['MALE', 'FEMALE', 'M', 'F']:
+                            details['gender'] = gender_part
+        # Extract ID Number (Usually "ID No.:" or near QR code)
+        if "ID NO" in line_upper or "ID NO." in line_upper:
+            parts = line.split(':')
+            if len(parts) > 1:
+                details['id_number'] = parts[1].strip()
+        # Fallback ID extraction looking for specific patterns if not found by label
+        if not details['id_number']:
+            # Look for pattern like TRARH + digits
+            id_match = re.search(r'\b[A-Z]{4,5}\d{10,15}\b', line_upper)
+            if id_match:
+                details['id_number'] = id_match.group()
+        # Extract Status (e.g., "NO RECORD ON FILE")
+        if "NO RECORD ON FILE" in line_upper:
+            details['status'] = "NO RECORD ON FILE"
+        elif "HAS A RECORD" in line_upper or "WITH RECORD" in line_upper:
+            details['status'] = "HAS RECORD"
+    if details['full_name'] or details['id_number']:
+        details['success'] = True
+    # Format the extracted fields
+    if details['full_name']:
+        details['full_name'] = format_name(details['full_name'])
+    if details['address']:
+        details['address'] = format_address(details['address'])
+    if details['birth_place']:
+        details['birth_place'] = format_birth_place(details['birth_place'])
+    if details['birth_date']:
+        details['birth_date'] = format_birth_date(details['birth_date'])
+    return details
+def extract_ocr_lines(image_path):
+    # Check if file exists
+    if not os.path.exists(image_path):
+        return {'success': False, 'error': 'File not found'}
+    file_size = os.path.getsize(image_path)
+    print(f"DEBUG: Image file size: {file_size} bytes", file=sys.stderr)
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        # Try simple configuration first (matching NBI script primary method)
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en'
+        )
+        try:
+            results = ocr.ocr(image_path)
+        except Exception as e:
+            print(f"DEBUG: ocr() failed: {e}, trying predict()", file=sys.stderr)
+            if hasattr(ocr, 'predict'):
+                results = ocr.predict(image_path)
+            else:
+                results = None
+    # Debug: Print raw results structure
+    print(f"DEBUG: Raw OCR results type: {type(results)}", file=sys.stderr)
+    print(f"DEBUG: Raw OCR results is None: {results is None}", file=sys.stderr)
+    if results is not None:
+        print(f"DEBUG: Raw OCR results length: {len(results) if isinstance(results, list) else 'N/A'}", file=sys.stderr)
+        if isinstance(results, list) and len(results) > 0:
+            print(f"DEBUG: First level item type: {type(results[0])}", file=sys.stderr)
+            print(f"DEBUG: First level item: {str(results[0])[:200] if results[0] else 'None'}", file=sys.stderr)
+            if isinstance(results[0], list) and len(results[0]) > 0:
+                print(f"DEBUG: Second level first item: {str(results[0][0])[:200] if results[0][0] else 'None'}", file=sys.stderr)
+    # Process OCR results - handle both old format (list) and new format (OCRResult object)
+    all_text = []
+    try:
+        # Check if results contain OCRResult objects (new PaddleX format)
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            # Check if it's an OCRResult object by type name
+            item_type_name = type(first_item).__name__
+            is_ocr_result = 'OCRResult' in item_type_name or 'ocr_result' in str(type(first_item)).lower()
+            if is_ocr_result:
+                print(f"DEBUG: Detected OCRResult object format (type: {item_type_name})", file=sys.stderr)
+                # Inspect attributes
+                attrs = dir(first_item)
+                print(f"DEBUG: OCRResult attributes: {[a for a in attrs if not a.startswith('_')]}", file=sys.stderr)
+                for ocr_result in results:
+                    # Try various possible attribute names for text
+                    text_found = False
+                    # First, try accessing as dictionary (OCRResult is dict-like)
+                    try:
+                        if hasattr(ocr_result, 'keys'):
+                            ocr_dict = dict(ocr_result)
+                            print(f"DEBUG: OCRResult as dict keys: {list(ocr_dict.keys())}", file=sys.stderr)
+                            # Look for common OCR result keys (rec_texts is the actual key in PaddleX OCRResult)
+                            for key in ['rec_texts', 'rec_text', 'dt_polys', 'ocr_text', 'text', 'texts', 'result', 'results', 'ocr_result', 'dt_boxes']:
+                                if key in ocr_dict:
+                                    val = ocr_dict[key]
+                                    print(f"DEBUG: Found key '{key}': {type(val)}, length: {len(val) if isinstance(val, list) else 'N/A'}", file=sys.stderr)
+                                    if isinstance(val, list):
+                                        # rec_texts is a list of strings directly
+                                        if key == 'rec_texts':
+                                            for text_item in val:
+                                                if isinstance(text_item, str) and text_item.strip():
+                                                    all_text.append(text_item.strip())
+                                                elif text_item:
+                                                    all_text.append(str(text_item))
+                                            if val:
+                                                text_found = True
+                                        else:
+                                            # For other keys, try to extract text from nested structures
+                                            for item in val:
+                                                if isinstance(item, (list, tuple)) and len(item) >= 2:
+                                                    # Format: [[coords], (text, confidence)]
+                                                    text_part = item[1]
+                                                    if isinstance(text_part, (list, tuple)) and len(text_part) >= 1:
+                                                        all_text.append(str(text_part[0]))
+                                                elif isinstance(item, str):
+                                                    all_text.append(item)
+                                            if val:
+                                                text_found = True
+                                    elif isinstance(val, str) and val:
+                                        all_text.append(val)
+                                        text_found = True
+                                    if text_found:
+                                        break
+                    except Exception as e:
+                        print(f"DEBUG: Error accessing OCRResult as dict: {e}", file=sys.stderr)
+                    # Try json() method
+                    if not text_found:
+                        try:
+                            if hasattr(ocr_result, 'json'):
+                                json_data = ocr_result.json()
+                                print(f"DEBUG: OCRResult.json() type: {type(json_data)}", file=sys.stderr)
+                                if isinstance(json_data, dict):
+                                    print(f"DEBUG: OCRResult.json() keys: {list(json_data.keys())}", file=sys.stderr)
+                                    # Look for text in JSON (rec_texts is the actual key)
+                                    for key in ['rec_texts', 'rec_text', 'dt_polys', 'ocr_text', 'text', 'texts', 'result', 'results']:
+                                        if key in json_data:
+                                            val = json_data[key]
+                                            if isinstance(val, list):
+                                                # rec_texts is a list of strings directly
+                                                if key == 'rec_texts':
+                                                    for text_item in val:
+                                                        if isinstance(text_item, str) and text_item.strip():
+                                                            all_text.append(text_item.strip())
+                                                        elif text_item:
+                                                            all_text.append(str(text_item))
+                                                    if val:
+                                                        text_found = True
+                                                else:
+                                                    for item in val:
+                                                        if isinstance(item, (list, tuple)) and len(item) >= 2:
+                                                            text_part = item[1]
+                                                            if isinstance(text_part, (list, tuple)) and len(text_part) >= 1:
+                                                                all_text.append(str(text_part[0]))
+                                                        elif isinstance(item, str):
+                                                            all_text.append(item)
+                                                    if val:
+                                                        text_found = True
+                                            elif isinstance(val, str) and val:
+                                                all_text.append(val)
+                                                text_found = True
+                                            if text_found:
+                                                break
+                        except Exception as e:
+                            print(f"DEBUG: Error calling json(): {e}", file=sys.stderr)
+                    # Try rec_text attribute
+                    if not text_found and hasattr(ocr_result, 'rec_text'):
+                        rec_text = ocr_result.rec_text
+                        print(f"DEBUG: Found rec_text attribute: {type(rec_text)}", file=sys.stderr)
+                        if isinstance(rec_text, list):
+                            all_text.extend([str(t) for t in rec_text if t])
+                            text_found = True
+                        elif rec_text:
+                            all_text.append(str(rec_text))
+                            text_found = True
+                    # Try text attribute
+                    if not text_found and hasattr(ocr_result, 'text'):
+                        text = ocr_result.text
+                        print(f"DEBUG: Found text attribute: {type(text)}", file=sys.stderr)
+                        if isinstance(text, list):
+                            all_text.extend([str(t) for t in text if t])
+                            text_found = True
+                        elif text:
+                            all_text.append(str(text))
+                            text_found = True
+                    # If still no text, print full structure for debugging
+                    if not text_found:
+                        print(f"DEBUG: Could not find text in OCRResult, trying to inspect structure", file=sys.stderr)
+                        try:
+                            print(f"DEBUG: OCRResult repr: {repr(ocr_result)[:500]}", file=sys.stderr)
+                            # Try to get all keys/items
+                            if hasattr(ocr_result, 'keys'):
+                                try:
+                                    all_keys = list(ocr_result.keys())
+                                    print(f"DEBUG: All OCRResult keys: {all_keys}", file=sys.stderr)
+                                    for key in all_keys:
+                                        try:
+                                            val = ocr_result[key]
+                                            print(f"DEBUG: Key '{key}' type: {type(val)}, value preview: {str(val)[:100]}", file=sys.stderr)
+                                        except:
+                                            pass
+                                except:
+                                    pass
+                        except Exception as e:
+                            print(f"DEBUG: Error inspecting structure: {e}", file=sys.stderr)
+            else:
+                # Old format - list of lists
+                lines = results[0] if results and isinstance(results[0], list) else results
+                print(f"DEBUG: Processing lines (old format), count: {len(lines) if isinstance(lines, list) else 'N/A'}", file=sys.stderr)
+                for item in lines:
+                    if isinstance(item, (list, tuple)) and len(item) >= 2:
+                        meta = item[1]
+                        if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                            all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG: Error processing OCR results: {str(e)}", file=sys.stderr)
+        import traceback
+        print(f"DEBUG: Traceback: {traceback.format_exc()}", file=sys.stderr)
+        # Try to inspect the object attributes
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            print(f"DEBUG: First item attributes: {dir(first_item)}", file=sys.stderr)
+            if hasattr(first_item, '__dict__'):
+                print(f"DEBUG: First item dict: {first_item.__dict__}", file=sys.stderr)
+    print(f"DEBUG: Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_police_details(all_text) if all_text else {'id_number': None, 'full_name': None, 'address': None, 'birth_date': None, 'birth_place': None, 'citizenship': None, 'gender': None, 'status': None, 'success': False}
+def extract_ocr_lines_simple(image_path):
+    # Fallback method with advanced features (matching NBI script fallback)
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=True,
+            use_doc_unwarping=True,
+            use_textline_orientation=True,
+            lang='en'
+        )
+        results = ocr.ocr(image_path)
+    # Debug: Print raw results structure for fallback method
+    print(f"DEBUG (fallback): Raw OCR results type: {type(results)}", file=sys.stderr)
+    print(f"DEBUG (fallback): Raw OCR results is None: {results is None}", file=sys.stderr)
+    if results is not None:
+        print(f"DEBUG (fallback): Raw OCR results length: {len(results) if isinstance(results, list) else 'N/A'}", file=sys.stderr)
+        if isinstance(results, list) and len(results) > 0:
+            print(f"DEBUG (fallback): First level item type: {type(results[0])}", file=sys.stderr)
+            if isinstance(results[0], list) and len(results[0]) > 0:
+                print(f"DEBUG (fallback): Second level first item: {str(results[0][0])[:200] if results[0][0] else 'None'}", file=sys.stderr)
+    all_text = []
+    try:
+        # Check if results contain OCRResult objects (new PaddleX format)
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            # Check if it's an OCRResult object by type name
+            item_type_name = type(first_item).__name__
+            is_ocr_result = 'OCRResult' in item_type_name or 'ocr_result' in str(type(first_item)).lower()
+            if is_ocr_result:
+                print(f"DEBUG (fallback): Detected OCRResult object format (type: {item_type_name})", file=sys.stderr)
+                # Inspect attributes
+                attrs = dir(first_item)
+                print(f"DEBUG (fallback): OCRResult attributes: {[a for a in attrs if not a.startswith('_')]}", file=sys.stderr)
+                for ocr_result in results:
+                    # Try various possible attribute names for text
+                    text_found = False
+                    # First, try accessing as dictionary (OCRResult is dict-like)
+                    try:
+                        if hasattr(ocr_result, 'keys'):
+                            ocr_dict = dict(ocr_result)
+                            print(f"DEBUG (fallback): OCRResult as dict keys: {list(ocr_dict.keys())}", file=sys.stderr)
+                            # Look for common OCR result keys (rec_texts is the actual key in PaddleX OCRResult)
+                            for key in ['rec_texts', 'rec_text', 'dt_polys', 'ocr_text', 'text', 'texts', 'result', 'results', 'ocr_result', 'dt_boxes']:
+                                if key in ocr_dict:
+                                    val = ocr_dict[key]
+                                    print(f"DEBUG (fallback): Found key '{key}': {type(val)}, length: {len(val) if isinstance(val, list) else 'N/A'}", file=sys.stderr)
+                                    if isinstance(val, list):
+                                        # rec_texts is a list of strings directly
+                                        if key == 'rec_texts':
+                                            for text_item in val:
+                                                if isinstance(text_item, str) and text_item.strip():
+                                                    all_text.append(text_item.strip())
+                                                elif text_item:
+                                                    all_text.append(str(text_item))
+                                            if val:
+                                                text_found = True
+                                        else:
+                                            # For other keys, try to extract text from nested structures
+                                            for item in val:
+                                                if isinstance(item, (list, tuple)) and len(item) >= 2:
+                                                    # Format: [[coords], (text, confidence)]
+                                                    text_part = item[1]
+                                                    if isinstance(text_part, (list, tuple)) and len(text_part) >= 1:
+                                                        all_text.append(str(text_part[0]))
+                                                elif isinstance(item, str):
+                                                    all_text.append(item)
+                                            if val:
+                                                text_found = True
+                                    elif isinstance(val, str) and val:
+                                        all_text.append(val)
+                                        text_found = True
+                                    if text_found:
+                                        break
+                    except Exception as e:
+                        print(f"DEBUG (fallback): Error accessing OCRResult as dict: {e}", file=sys.stderr)
+                    # Try json() method
+                    if not text_found:
+                        try:
+                            if hasattr(ocr_result, 'json'):
+                                json_data = ocr_result.json()
+                                print(f"DEBUG (fallback): OCRResult.json() type: {type(json_data)}", file=sys.stderr)
+                                if isinstance(json_data, dict):
+                                    print(f"DEBUG (fallback): OCRResult.json() keys: {list(json_data.keys())}", file=sys.stderr)
+                                    # Look for text in JSON (rec_texts is the actual key)
+                                    for key in ['rec_texts', 'rec_text', 'dt_polys', 'ocr_text', 'text', 'texts', 'result', 'results']:
+                                        if key in json_data:
+                                            val = json_data[key]
+                                            if isinstance(val, list):
+                                                # rec_texts is a list of strings directly
+                                                if key == 'rec_texts':
+                                                    for text_item in val:
+                                                        if isinstance(text_item, str) and text_item.strip():
+                                                            all_text.append(text_item.strip())
+                                                        elif text_item:
+                                                            all_text.append(str(text_item))
+                                                    if val:
+                                                        text_found = True
+                                                else:
+                                                    for item in val:
+                                                        if isinstance(item, (list, tuple)) and len(item) >= 2:
+                                                            text_part = item[1]
+                                                            if isinstance(text_part, (list, tuple)) and len(text_part) >= 1:
+                                                                all_text.append(str(text_part[0]))
+                                                        elif isinstance(item, str):
+                                                            all_text.append(item)
+                                                    if val:
+                                                        text_found = True
+                                            elif isinstance(val, str) and val:
+                                                all_text.append(val)
+                                                text_found = True
+                                            if text_found:
+                                                break
+                        except Exception as e:
+                            print(f"DEBUG (fallback): Error calling json(): {e}", file=sys.stderr)
+                    # Try rec_text attribute
+                    if not text_found and hasattr(ocr_result, 'rec_text'):
+                        rec_text = ocr_result.rec_text
+                        print(f"DEBUG (fallback): Found rec_text attribute: {type(rec_text)}", file=sys.stderr)
+                        if isinstance(rec_text, list):
+                            all_text.extend([str(t) for t in rec_text if t])
+                            text_found = True
+                        elif rec_text:
+                            all_text.append(str(rec_text))
+                            text_found = True
+                    # Try text attribute
+                    if not text_found and hasattr(ocr_result, 'text'):
+                        text = ocr_result.text
+                        print(f"DEBUG (fallback): Found text attribute: {type(text)}", file=sys.stderr)
+                        if isinstance(text, list):
+                            all_text.extend([str(t) for t in text if t])
+                            text_found = True
+                        elif text:
+                            all_text.append(str(text))
+                            text_found = True
+                    # If still no text, print full structure for debugging
+                    if not text_found:
+                        print(f"DEBUG (fallback): Could not find text in OCRResult, trying to inspect structure", file=sys.stderr)
+                        try:
+                            print(f"DEBUG (fallback): OCRResult repr: {repr(ocr_result)[:500]}", file=sys.stderr)
+                            # Try to get all keys/items
+                            if hasattr(ocr_result, 'keys'):
+                                try:
+                                    all_keys = list(ocr_result.keys())
+                                    print(f"DEBUG (fallback): All OCRResult keys: {all_keys}", file=sys.stderr)
+                                    for key in all_keys:
+                                        try:
+                                            val = ocr_result[key]
+                                            print(f"DEBUG (fallback): Key '{key}' type: {type(val)}, value preview: {str(val)[:100]}", file=sys.stderr)
+                                        except:
+                                            pass
+                                except:
+                                    pass
+                        except Exception as e:
+                            print(f"DEBUG (fallback): Error inspecting structure: {e}", file=sys.stderr)
+            else:
+                # Old format - list of lists
+                lines = results[0] if results and isinstance(results[0], list) else results
+                print(f"DEBUG (fallback): Processing lines (old format), count: {len(lines) if isinstance(lines, list) else 'N/A'}", file=sys.stderr)
+                for item in lines:
+                    if isinstance(item, (list, tuple)) and len(item) >= 2:
+                        meta = item[1]
+                        if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                            all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG (fallback): Error processing OCR results: {str(e)}", file=sys.stderr)
+        import traceback
+        print(f"DEBUG (fallback): Traceback: {traceback.format_exc()}", file=sys.stderr)
+        # Try to inspect the object attributes
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            print(f"DEBUG (fallback): First item attributes: {dir(first_item)}", file=sys.stderr)
+            if hasattr(first_item, '__dict__'):
+                print(f"DEBUG (fallback): First item dict: {first_item.__dict__}", file=sys.stderr)
+    print(f"DEBUG (fallback): Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_police_details(all_text) if all_text else {'id_number': None, 'full_name': None, 'address': None, 'birth_date': None, 'birth_place': None, 'citizenship': None, 'gender': None, 'status': None, 'success': False}
+# Main Execution
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"success": False, "error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+print(f"DEBUG: Processing Police Clearance image URL: {image_url}", file=sys.stderr)
+try:
+    image_path = download_image(image_url, 'temp_police_image.jpg')
+    print(f"DEBUG: Image downloaded to: {image_path}", file=sys.stderr)
+    # Try the original OCR method first
+    ocr_results = extract_ocr_lines(image_path)
+    print(f"DEBUG: OCR results from extract_ocr_lines: {ocr_results}", file=sys.stderr)
+    # If original method fails, try simple method with advanced features
+    if not ocr_results['success']:
+        print("DEBUG: Original method failed, trying simple method with advanced features", file=sys.stderr)
+        ocr_results = extract_ocr_lines_simple(image_path)
+        print(f"DEBUG: OCR results from extract_ocr_lines_simple: {ocr_results}", file=sys.stderr)
+    # Clean up
+    if os.path.exists(image_path):
+        os.remove(image_path)
+    response = {
+        "success": ocr_results['success'],
+        "data": ocr_results
+    }
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps(response))
+    sys.stdout.flush()
+except Exception as e:
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": False, "error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    try:
+        if os.path.exists('temp_police_image.jpg'):
+            os.remove('temp_police_image.jpg')
+    except:
+        pass

extract_postal.py ADDED Viewed

	@@ -0,0 +1,419 @@

+import sys, json, os, glob, requests
+import re
+import time
+from contextlib import redirect_stdout, redirect_stderr
+from datetime import datetime
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def download_image(url, output_path='temp_postal_image.jpg'):
+    # Remove any existing temp file
+    if os.path.exists(output_path):
+        os.remove(output_path)
+    # Add cache-busting parameters
+    timestamp = int(time.time())
+    if '?' in url:
+        url += f'&t={timestamp}'
+    else:
+        url += f'?t={timestamp}'
+    # Add headers to prevent caching
+    headers = {
+        'Cache-Control': 'no-cache, no-store, must-revalidate',
+        'Pragma': 'no-cache',
+        'Expires': '0'
+    }
+    response = requests.get(url, headers=headers)
+    response.raise_for_status()
+    image_data = response.content
+    # Save the image
+    with open(output_path, 'wb') as f:
+        f.write(image_data)
+    return output_path
+def format_date(date_str):
+    """Format date from various formats to YYYY-MM-DD"""
+    if not date_str:
+        return None
+    date_str = date_str.strip()
+    # Fix common OCR errors first
+    date_str = date_str.replace('Ol', '01').replace('O1', '01').replace('O0', '00').replace('OO', '00')
+    date_str = date_str.replace('l', '1')  # lowercase L -> 1
+    # Handle format like "14 Aug 88" or "14 Aug88" -> "1988-08-14"
+    # Allow for missing space between month and year
+    match = re.match(r'(\d{1,2})\s*([A-Za-z]{3})\s*(\d{2,4})', date_str)
+    if match:
+        day, month_str, year = match.groups()
+        try:
+            # Fix month OCR errors
+            month_str = month_str.replace('Augu', 'Aug').replace('Augu', 'Aug')
+            month_str = month_str.replace('Decm', 'Dec').replace('Dece', 'Dec')
+            month_str = month_str.replace('Janu', 'Jan').replace('Febr', 'Feb')
+            month_str = month_str.replace('Marc', 'Mar').replace('Apil', 'Apr')
+            month_str = month_str.replace('May', 'May').replace('June', 'Jun')
+            month_str = month_str.replace('July', 'Jul').replace('Sept', 'Sep')
+            month_str = month_str.replace('Octo', 'Oct').replace('Novem', 'Nov')
+            # Convert 2-digit year to 4-digit (assume 1900s for years > 50, 2000s for <= 50)
+            if len(year) == 2:
+                year_int = int(year)
+                year = f"19{year}" if year_int > 50 else f"20{year}"
+            # Parse month abbreviation (use first 3 chars)
+            month = datetime.strptime(month_str[:3], '%b').month
+            return f"{year}-{month:02d}-{int(day):02d}"
+        except Exception as e:
+            print(f"DEBUG: Date parsing error: {e}", file=sys.stderr)
+            pass
+    # Try other common formats
+    for fmt in ["%d %b %Y", "%d %B %Y", "%Y-%m-%d", "%m/%d/%Y", "%d/%m/%Y", "%d%b%Y", "%d%B%Y"]:
+        try:
+            dt = datetime.strptime(date_str, fmt)
+            return dt.strftime("%Y-%m-%d")
+        except Exception:
+            continue
+    return date_str
+def format_name(name):
+    """Format name: capitalize properly"""
+    if not name:
+        return None
+    # Remove extra spaces and normalize
+    name = ' '.join(name.split())
+    # Capitalize each word properly
+    name = ' '.join([word.capitalize() for word in name.split()])
+    return name.strip()
+def format_address(address_lines):
+    """Format address from multiple lines"""
+    if not address_lines:
+        return None
+    # Join address lines and clean up
+    address = ' '.join([line.strip() for line in address_lines if line.strip()])
+    # Fix missing spaces: "585Gen." -> "585 Gen."
+    address = re.sub(r'(\d+)([A-Z])', r'\1 \2', address)
+    # Fix missing spaces before abbreviations: "Brgy.Rivera" -> "Brgy. Rivera"
+    address = re.sub(r'([a-z])([A-Z])', r'\1 \2', address)
+    # Remove extra spaces
+    address = ' '.join(address.split())
+    return address.strip()
+def extract_postal_details(lines):
+    details = {
+        'prn': None,
+        'full_name': None,
+        'address': None,
+        'birth_date': None,
+        'nationality': None,
+        'issuing_post_office': None,
+        'valid_until': None,
+        'success': False
+    }
+    # Clean lines - convert to strings and strip
+    cleaned_lines = [str(line).strip() for line in lines if str(line).strip()]
+    for i, line in enumerate(cleaned_lines):
+        line_upper = line.upper().strip()
+        line_stripped = line.strip()
+        # Extract PRN (Postal Registration Number)
+        # Format: "PRN 100141234567 P POSTAL" or "PRN100141234567P" or "PAN100141234567P" (OCR might misread PRN as PAN)
+        if not details['prn']:
+            # Look for PRN followed by digits (may have P POSTAL after)
+            prn_match = re.search(r'PRN\s*(\d{10,15})', line_upper)
+            if prn_match:
+                details['prn'] = prn_match.group(1)
+            # Also check for PAN (common OCR error where PRN is misread as PAN)
+            elif re.search(r'PAN\s*(\d{10,15})', line_upper):
+                pan_match = re.search(r'PAN\s*(\d{10,15})', line_upper)
+                if pan_match:
+                    details['prn'] = pan_match.group(1)
+        # Extract Full Name - combine separate name parts
+        # Look for label "First Name Middle Name Surname, Suffix" or name parts
+        if not details['full_name']:
+            # Check if this line is the label
+            if ("FIRST NAME" in line_upper or "FINT NAME" in line_upper) and ("SURNAME" in line_upper or "SUMAME" in line_upper):
+                # Collect name parts from next few lines
+                name_parts = []
+                for j in range(1, min(5, len(cleaned_lines) - i)):
+                    next_line = cleaned_lines[i+j].strip()
+                    next_upper = next_line.upper()
+                    # Stop if we hit address or other labels
+                    if any(label in next_upper for label in ['ADDRESS', 'DATE', 'BIRTH', 'NATIONALITY', 'ISSUING', 'VALID', 'GEN', 'TUAZON', 'BLVD', 'BRGY', '585', 'PASAY']):
+                        break
+                    # Add if it looks like a name part (all caps, letters and spaces only, not too short)
+                    if next_line and re.match(r'^[A-Z\s,]+$', next_line) and len(next_line) > 1:
+                        # Skip if it's clearly not a name (like "ID", "C", etc.)
+                        if next_line not in ['ID', 'C', 'P', 'POSTAL']:
+                            name_parts.append(next_line)
+                if name_parts:
+                    details['full_name'] = ' '.join(name_parts)
+            # Also check if line is a name part (all caps, not a label)
+            elif re.match(r'^[A-Z\s,]+$', line_stripped) and len(line_stripped) > 2:
+                # Make sure it's not a label or common words
+                if not any(label in line_upper for label in ['FIRST NAME', 'MIDDLE NAME', 'SURNAME', 'ADDRESS', 'DATE', 'BIRTH', 'NATIONALITY', 'ISSUING', 'VALID', 'POSTAL', 'IDENTITY', 'CARD', 'PHCPOST', 'PHILIPPINE', 'PREMIUM']):
+                    # Check if previous line is the name label
+                    if i > 0:
+                        prev_line = cleaned_lines[i-1].strip().upper()
+                        if "FIRST NAME" in prev_line or "FINT NAME" in prev_line or "SUMAME" in prev_line or "SURNAME" in prev_line:
+                            # Collect consecutive name parts
+                            name_parts = [line_stripped]
+                            for j in range(1, min(4, len(cleaned_lines) - i)):
+                                next_line = cleaned_lines[i+j].strip()
+                                if (next_line and re.match(r'^[A-Z\s,]+$', next_line) and
+                                    len(next_line) > 2 and
+                                    not any(label in next_line.upper() for label in ['ADDRESS', 'DATE', 'BIRTH', 'GEN', 'TUAZON', 'BLVD', 'BRGY', '585', 'PASAY', 'ID', 'POSTAL', 'PREMIUM'])):
+                                    name_parts.append(next_line)
+                                else:
+                                    break
+                            if len(name_parts) >= 2:
+                                details['full_name'] = ' '.join(name_parts)
+                            elif len(name_parts) == 1 and len(name_parts[0].split()) >= 2:
+                                details['full_name'] = name_parts[0]
+        # Extract Address - look for address parts (street numbers, Gen., Blvd., Brgy., City)
+        if not details['address']:
+            # Look for address indicators
+            if any(indicator in line_upper for indicator in ['GEN', 'TUAZON', 'BLVD', 'BRGY', 'PASAY', 'CITY']) or (re.match(r'^\d+', line_stripped) and len(line_stripped) > 2):
+                address_lines = []
+                # Check backwards a bit to see if we missed address start
+                start_idx = max(0, i - 1)
+                # Collect address lines forward
+                for j in range(0, min(7, len(cleaned_lines) - start_idx)):
+                    idx = start_idx + j
+                    if idx >= len(cleaned_lines):
+                        break
+                    addr_line = cleaned_lines[idx].strip()
+                    addr_upper = addr_line.upper()
+                    # Stop if we hit date, nationality, or other labels
+                    if any(label in addr_upper for label in ['DATE', 'BIRTH', 'NATIONALITY', 'FILIPINO', 'ISSUING', 'VALID', 'PAN', 'NOCON']):
+                        break
+                    # Skip very short lines that are likely OCR noise (like "101", "o00")
+                    if len(addr_line) <= 2 and not re.match(r'^\d+$', addr_line):
+                        continue
+                    # Add if it looks like address content
+                    if addr_line and len(addr_line) > 1:
+                        # Check if it's a number, street name, barangay, city, etc.
+                        if (re.match(r'^\d+', addr_line) or
+                            any(indicator in addr_upper for indicator in ['GEN', 'TUAZON', 'BLVD', 'BRGY', 'PASAY', 'CITY', 'STREET', 'AVE', 'BOULEVARD']) or
+                            len(address_lines) > 0):  # Continue if we've started collecting
+                            # Skip obvious OCR errors like "o00"
+                            if addr_line.lower() not in ['o00', 'o0', '00']:
+                                address_lines.append(addr_line)
+                if address_lines:
+                    details['address'] = format_address(address_lines)
+        # Extract Date of Birth - handle OCR errors
+        if not details['birth_date']:
+            # Look for date patterns: "14 Aug88" or "14 Aug 88"
+            date_match = re.search(r'(\d{1,2})\s*([A-Za-z]{3})\s*(\d{2,4})', line_stripped)
+            if date_match:
+                # Check if it's not the valid until date
+                if "VALID" not in line_upper and "UNTIL" not in line_upper:
+                    # Fix spacing
+                    day, month, year = date_match.groups()
+                    details['birth_date'] = f"{day} {month} {year}"
+        # Extract Nationality
+        if not details['nationality']:
+            if "NATIONALITY" in line_upper or line_upper == "FILIPINO":
+                if line_upper == "FILIPINO":
+                    details['nationality'] = "Filipino"
+                elif i + 1 < len(cleaned_lines):
+                    next_line = cleaned_lines[i+1].strip()
+                    if next_line and len(next_line) < 20:
+                        details['nationality'] = next_line
+        # Extract Issuing Post Office - handle OCR errors like "IssungPostOmce"
+        if not details['issuing_post_office']:
+            if ("ISSUING POST OFFICE" in line_upper or "ISSUING POST" in line_upper or
+                "ISSUINGPOST" in line_upper or "ISSUINGPOSTOMCE" in line_upper):
+                if i + 1 < len(cleaned_lines):
+                    next_line = cleaned_lines[i+1].strip()
+                    if next_line and len(next_line) < 20:
+                        # Fix OCR errors: MNL.QE -> MNL-QE
+                        next_line = next_line.replace('.', '-')
+                        details['issuing_post_office'] = next_line
+        # Extract Valid Until - handle OCR errors like "Vald Urt" and "OlDec17"
+        if not details['valid_until']:
+            if ("VALID UNTIL" in line_upper or "VALIDUNTIL" in line_upper or
+                "VALD URT" in line_upper or "VALDURT" in line_upper):
+                if i + 1 < len(cleaned_lines):
+                    next_line = cleaned_lines[i+1].strip()
+                    # Fix OCR errors: OlDec17 -> 01 Dec 17
+                    # Replace common OCR errors
+                    next_line = next_line.replace('Ol', '01').replace('O1', '01')
+                    next_line = next_line.replace('O0', '00').replace('OO', '00')
+                    # Try to extract date pattern
+                    date_match = re.search(r'(\d{1,2})\s*([A-Za-z]{3})\s*(\d{2,4})', next_line)
+                    if date_match:
+                        day, month, year = date_match.groups()
+                        details['valid_until'] = f"{day} {month} {year}"
+                    elif next_line:
+                        details['valid_until'] = next_line
+    # Format extracted fields
+    if details['full_name']:
+        details['full_name'] = format_name(details['full_name'])
+    if details['birth_date']:
+        details['birth_date'] = format_date(details['birth_date'])
+    if details['valid_until']:
+        details['valid_until'] = format_date(details['valid_until'])
+    if details['prn'] or details['full_name']:
+        details['success'] = True
+    return details
+def extract_ocr_lines(image_path):
+    # Check if file exists
+    if not os.path.exists(image_path):
+        return {'success': False, 'error': 'File not found'}
+    file_size = os.path.getsize(image_path)
+    print(f"DEBUG: Image file size: {file_size} bytes", file=sys.stderr)
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        # Try simple configuration first
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en'
+        )
+        try:
+            results = ocr.ocr(image_path)
+        except Exception as e:
+            print(f"DEBUG: ocr() failed: {e}, trying predict()", file=sys.stderr)
+            if hasattr(ocr, 'predict'):
+                results = ocr.predict(image_path)
+            else:
+                results = None
+    # Debug: Print raw results structure
+    print(f"DEBUG: Raw OCR results type: {type(results)}", file=sys.stderr)
+    all_text = []
+    try:
+        # Handle both old format (list) and new format (OCRResult object)
+        if results and isinstance(results, list) and len(results) > 0:
+            first_item = results[0]
+            item_type_name = type(first_item).__name__
+            is_ocr_result = 'OCRResult' in item_type_name or 'ocr_result' in str(type(first_item)).lower()
+            if is_ocr_result:
+                print(f"DEBUG: Detected OCRResult object format (type: {item_type_name})", file=sys.stderr)
+                # Access OCRResult as dictionary
+                try:
+                    if hasattr(first_item, 'keys'):
+                        ocr_dict = dict(first_item)
+                        # Look for rec_texts key
+                        if 'rec_texts' in ocr_dict:
+                            rec_texts = ocr_dict['rec_texts']
+                            if isinstance(rec_texts, list):
+                                all_text = [str(t) for t in rec_texts if t]
+                                print(f"DEBUG: Extracted {len(all_text)} text lines from rec_texts", file=sys.stderr)
+                except Exception as e:
+                    print(f"DEBUG: Error accessing OCRResult: {e}", file=sys.stderr)
+            else:
+                # Old format - list of lists
+                lines = results[0] if results and isinstance(results[0], list) else results
+                for item in lines:
+                    if isinstance(item, (list, tuple)) and len(item) >= 2:
+                        meta = item[1]
+                        if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                            all_text.append(str(meta[0]))
+    except Exception as e:
+        print(f"DEBUG: Error processing OCR results: {str(e)}", file=sys.stderr)
+        import traceback
+        print(f"DEBUG: Traceback: {traceback.format_exc()}", file=sys.stderr)
+    print(f"DEBUG: Extracted text lines: {all_text}", file=sys.stderr)
+    return extract_postal_details(all_text) if all_text else {
+        'prn': None,
+        'full_name': None,
+        'address': None,
+        'birth_date': None,
+        'nationality': None,
+        'issuing_post_office': None,
+        'valid_until': None,
+        'success': False
+    }
+# Main Execution
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"success": False, "error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+print(f"DEBUG: Processing Postal ID image URL: {image_url}", file=sys.stderr)
+try:
+    image_path = download_image(image_url, 'temp_postal_image.jpg')
+    print(f"DEBUG: Image downloaded to: {image_path}", file=sys.stderr)
+    ocr_results = extract_ocr_lines(image_path)
+    print(f"DEBUG: OCR results: {ocr_results}", file=sys.stderr)
+    # Clean up
+    if os.path.exists(image_path):
+        os.remove(image_path)
+    response = {
+        "success": ocr_results['success'],
+        "data": ocr_results
+    }
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps(response))
+    sys.stdout.flush()
+except Exception as e:
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": False, "error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    try:
+        if os.path.exists('temp_postal_image.jpg'):
+            os.remove('temp_postal_image.jpg')
+    except:
+        pass

extract_prc.py ADDED Viewed

	@@ -0,0 +1,377 @@

+#!/usr/bin/env python3
+"""
+Philippine PRC (Professional Regulation Commission) License Information Extraction Script
+Purpose:
+    Extracts structured information from PRC license images using OCR.
+    Handles various PRC license formats including UMID-style cards.
+Why this script exists:
+    - PRC licenses have complex layouts with multiple information fields
+    - Need to extract profession-specific information
+    - Handles both traditional PRC licenses and UMID-style PRC cards
+    - Required for professional verification workflows
+Key Features:
+    - Extracts CRN (Common Reference Number) - 12-digit format
+    - Processes registration numbers and dates
+    - Extracts profession information
+    - Handles GSIS/SSS number extraction
+    - Supports validity date tracking
+Dependencies:
+    - PaddleOCR: High-accuracy OCR engine (https://github.com/PaddlePaddle/PaddleOCR)
+    - Pillow (PIL): Image processing (https://pillow.readthedocs.io/)
+    - requests: HTTP library (https://docs.python-requests.org/)
+Usage:
+    python extract_prc.py "https://example.com/prc_license.jpg"
+Output:
+    JSON with extracted information: crn, registration_number, profession, valid_until, etc.
+"""
+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    """
+    Debug print function that safely handles object serialization.
+    Args:
+        msg (str): Debug message
+        obj (any): Object to print (optional)
+    Why this approach:
+    - Centralized debug logging
+    - Safe object serialization
+    - Consistent debug output format
+    """
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    cache_files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in cache_files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255,255,255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def format_date(s):
+    if not s: return None
+    raw = s.strip()
+    t = raw.replace(' ', '').replace('\\','/').replace('.','/')
+    if re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t):
+        return t.replace('/', '-')
+    # Accept mm/dd/yyyy style
+    if re.match(r'^\d{2}/\d{2}/\d{4}$', raw):
+        m, d, y = raw.split('/')
+        return f"{y}-{int(m):02d}-{int(d):02d}"
+    # Month name variants
+    m = re.match(r'([A-Za-z]+)\s*\d{1,2},\s*\d{4}', raw)
+    if m:
+        try:
+            return datetime.strptime(raw.replace('  ', ' '), "%B %d, %Y").strftime("%Y-%m-%d")
+        except Exception:
+            try:
+                return datetime.strptime(raw.replace('  ', ' '), "%b %d, %Y").strftime("%Y-%m-%d")
+            except Exception:
+                pass
+    return raw
+def cap_words(name):
+    return None if not name else ' '.join(w.capitalize() for w in name.split())
+def normalize_name_from_parts(last, first_block):
+    last = (last or '').strip()
+    tokens = [t for t in (first_block or '').strip().split(' ') if t]
+    given_kept = tokens[:2]  # keep up to two given names
+    composed = ' '.join(given_kept + [last]).strip()
+    return cap_words(composed) if composed else None
+def normalize_full_name_from_three(first, middle, last):
+    # keep first + optional second from "first" block; ignore middle completely
+    tokens = [t for t in (first or '').strip().split(' ') if t]
+    given_kept = tokens[:2]
+    composed = ' '.join(given_kept + [last or '']).strip()
+    return cap_words(composed) if composed else None
+def take_within(lines, i, k=5):
+    out = []
+    for j in range(1, k+1):
+        if i+j < len(lines):
+            t = str(lines[i+j]).strip()
+            if t:
+                out.append(t)
+    return out
+def is_numeric_id(t):
+    return bool(re.match(r'^\d{5,}$', str(t).replace(' ', '')))
+def is_crn(t):
+    # UMID CRN commonly 12 digits
+    return bool(re.match(r'^\d{12}$', t.replace(' ', '')))
+def is_date(t):
+    t1 = t.replace(' ', '').replace('\\','/').replace('.','/')
+    return bool(re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t1)) or bool(re.match(r'^\d{2}/\d{2}/\d{4}$', t)) or bool(re.match(r'^[A-Za-z]+\s*\d{1,2},\s*\d{4}$', t))
+def extract_prc_info(lines):
+    """
+    Extract PRC license information from OCR text lines.
+    Args:
+        lines (list): List of text lines from OCR processing
+    Returns:
+        dict: Extracted PRC information with keys: crn, registration_number, profession, etc.
+    Why this approach:
+    - PRC licenses have complex layouts with multiple fields
+    - Need to handle various license formats (traditional and UMID-style)
+    - Extracts profession-specific information
+    - Handles both traditional PRC licenses and UMID-style PRC cards
+    - Uses lookahead pattern matching for field extraction
+    """
+    dprint("Lines to extract", lines)
+    # Initialize variables for extracted information
+    crn = None
+    full_name = None
+    birth_date = None
+    gsis_number = None
+    sss_number = None
+    registration_number = None
+    registration_date = None
+    valid_until = None
+    profession = None
+    # Collect name parts separately for composition
+    last_name_txt = None
+    first_name_txt = None
+    L = [str(x or '').strip() for x in lines]
+    i = 0
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        # Extract CRN (UMID format) - 12 digits
+        if crn is None and is_crn(line):
+            crn = line.replace(' ', '')
+            dprint("Found CRN", crn)
+        # Extract Last Name using lookahead pattern
+        if 'last name' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                tl = t.lower()
+                if not any(k in tl for k in ['first', 'middle', 'registration', 'valid', 'date', 'no']):
+                    last_name_txt = t
+                    break
+        # Extract First Name
+        if 'firstname' in low or 'first name' in low:
+            if i+1 < len(L):
+                first_name_txt = L[i+1]
+        # Extract Date of Birth
+        if ('date of birth' in low) or ('birth' in low and 'date' in low):
+            ahead = take_within(L, i, 4)
+            for t in ahead:
+                if is_date(t):
+                    birth_date = format_date(t)
+                    break
+        # Extract Registration Number - handles split labels
+        if low == 'registration' and i+1 < len(L) and L[i+1].lower() in ('no', 'no.', 'number'):
+            ahead = take_within(L, i+1, 4)
+            for t in ahead:
+                if is_numeric_id(t):
+                    registration_number = t.replace(' ', '')
+                    break
+        # Also handle fused label forms
+        if ('registration no' in low) or ('registration number' in low):
+            ahead = take_within(L, i, 4)
+            for t in ahead:
+                if is_numeric_id(t):
+                    registration_number = t.replace(' ', '')
+                    break
+        # Extract Registration Date
+        if low == 'registration' and i+1 < len(L) and L[i+1].lower() == 'date':
+            ahead = take_within(L, i+1, 4)
+            for t in ahead:
+                if is_date(t):
+                    registration_date = format_date(t)
+                    break
+        if 'registration date' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if is_date(t):
+                    registration_date = format_date(t)
+                    break
+        # Extract Valid Until Date
+        if 'valid until' in low or 'validity' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if is_date(t):
+                    valid_until = format_date(t)
+                    break
+        # Extract Profession from bold lines
+        if any(k in low for k in ['occupational','technician','engineer','teacher','nurse']):
+            if len(line.split()) >= 2:
+                profession = cap_words(line)
+                dprint("Found profession", profession)
+        # Extract SSS Number
+        if sss_number is None and ('sss' in low or 'social security' in low):
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if is_numeric_id(t):
+                    sss_number = t.replace(' ', '')
+                    dprint("Found sss_number", sss_number)
+                    break
+        # Extract GSIS Number
+        if gsis_number is None and ('gsis' in low):
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if is_numeric_id(t):
+                    gsis_number = t.replace(' ', '')
+                    dprint("Found gsis_number", gsis_number)
+                    break
+        i += 1
+    # Compose full name from parts
+    if full_name is None:
+        full_name = normalize_name_from_parts(last_name_txt, first_name_txt)
+    # Return structured result
+    result = {
+        "id_type": "prc",
+        "crn": crn,
+        "id_number": registration_number or crn,  # Frontend expects id_number
+        "registration_number": registration_number,
+        "registration_date": registration_date,
+        "valid_until": valid_until,
+        "full_name": full_name,
+        "birth_date": birth_date,
+        "sss_number": sss_number,
+        "gsis_number": gsis_number,
+        "profession": profession
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR predict", image_path)
+        results = ocr.ocr(image_path, cls=False)
+    dprint("OCR predict done, results_count", len(results))
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_prc_info(all_text) if all_text else {
+        "id_type": "prc",
+        "crn": None,
+        "full_name": None,
+        "birth_date": None
+    }
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready")
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": True, "ocr_results": ocr_results}))
+    sys.stdout.flush()
+except Exception as e:
+    dprint("Exception", str(e))
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        clean_cache()
+    except:
+        pass

extract_sss.py ADDED Viewed

	@@ -0,0 +1,216 @@

+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255, 255, 255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def cap_words(s):
+    return None if not s else ' '.join(w.capitalize() for w in s.split())
+def normalize_full_name(s):
+    if not s:
+        return None
+    raw = ' '.join(s.split())
+    if ',' in raw:
+        parts = [p.strip() for p in raw.split(',')]
+        last = parts[0] if parts else ''
+        given_block = ','.join(parts[1:]).strip() if len(parts) > 1 else ''
+        tokens = [t for t in given_block.split(' ') if t]
+        first = tokens[0] if tokens else ''
+        second = tokens[1] if len(tokens) > 1 else ''
+        if second:
+            return f"{first} {second} {last}".strip()
+        else:
+            return f"{first} {last}".strip()
+    else:
+        tokens = [t for t in raw.split(' ') if t]
+        if len(tokens) >= 3:
+            return f"{tokens[0]} {tokens[1]} {tokens[-1]}".strip()
+        elif len(tokens) == 2:
+            return f"{tokens[0]} {tokens[1]}".strip()
+        else:
+            return raw
+def extract_sss_number(text):
+    sss_pattern = r'\b(\d{2}-\d{7}-\d{1})\b'
+    match = re.search(sss_pattern, text)
+    if match:
+        return match.group(1)
+    return None
+def extract_sss_info(lines):
+    dprint("Lines to extract", lines)
+    full_name = None
+    sss_number = None
+    sss_id_number = None
+    name_parts = []
+    L = [str(x or '').strip() for x in lines]
+    i = 0
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        # Look for SSS number pattern (XX-XXXXXXX-X)
+        if not sss_number:
+            sss_num = extract_sss_number(line)
+            if sss_num:
+                sss_number = sss_num
+                sss_id_number = sss_num
+                dprint("Found SSS number", sss_number)
+        # Collect potential name parts (single words that look like names)
+        if re.search(r'^[A-Z]{2,}$', line) and not re.search(r'[0-9]', line):
+            skip_words = ['REPUBLIC', 'PHILIPPINES', 'SOCIAL', 'SECURITY', 'SYSTEM', 'SSS', 'PRESIDENT', 'PROUD', 'FILIPINO', 'CORAZON']
+            if not any(word in line.upper() for word in skip_words):
+                name_parts.append(line)
+                dprint("Added name part", line)
+        # Look for multi-word names but don't set full_name yet
+        if len(line.split()) >= 2 and re.search(r'[A-Z]{2,}', line) and not re.search(r'[0-9]{3,}', line):
+            skip_words = ['REPUBLIC', 'PHILIPPINES', 'SOCIAL', 'SECURITY', 'SYSTEM', 'SSS', 'PRESIDENT', 'PROUD', 'FILIPINO']
+            if not any(word in line.upper() for word in skip_words):
+                # Add multi-word names to name_parts instead of setting full_name directly
+                name_parts.append(line)
+                dprint("Added multi-word name part", line)
+        i += 1
+    # Now compose the full name from all collected parts
+    if name_parts:
+        # Combine all name parts
+        combined_name = ' '.join(name_parts)
+        full_name = normalize_full_name(combined_name)
+        dprint("Composed name from all parts", {"parts": name_parts, "result": full_name})
+    result = {
+        "id_type": "sss",
+        "sss_number": sss_number,
+        "id_number": sss_id_number,
+        "full_name": full_name,
+        "birth_date": None,
+        "address": None,
+        "sex": None,
+        "nationality": "Filipino"
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR predict", image_path)
+        results = ocr.ocr(image_path, cls=False)
+    dprint("OCR predict done, results_count", len(results))
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_sss_info(all_text) if all_text else {
+        "id_type": "sss",
+        "sss_number": None,
+        "id_number": None,
+        "full_name": None,
+        "birth_date": None
+    }
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready")
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": True, "ocr_results": ocr_results}))
+    sys.stdout.flush()
+except Exception as e:
+    dprint("Exception", str(e))
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        clean_cache()
+    except:
+        pass

extract_tesda_ocr.py ADDED Viewed

	@@ -0,0 +1,331 @@

+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    cache_files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in cache_files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255,255,255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def format_date(s):
+    if not s: return None
+    raw = s.strip()
+    # Handle formats like "july 22,2022" (no space after comma)
+    raw = raw.replace(',', ', ')
+    t = raw.replace(' ', '').replace('\\','/').replace('.','/')
+    if re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t):
+        return t.replace('/', '-')
+    # Accept mm/dd/yyyy style
+    if re.match(r'^\d{2}/\d{2}/\d{4}$', raw):
+        m, d, y = raw.split('/')
+        return f"{y}-{int(m):02d}-{int(d):02d}"
+    # Month name variants - try different formats
+    date_formats = [
+        "%B %d, %Y",  # July 22, 2022
+        "%b %d, %Y",  # Jul 22, 2022
+        "%B %d %Y",   # July 22 2022
+        "%b %d %Y",   # Jul 22 2022
+    ]
+    for fmt in date_formats:
+        try:
+            return datetime.strptime(raw.replace('  ', ' '), fmt).strftime("%Y-%m-%d")
+        except Exception:
+            continue
+    return raw
+def cap_words(name):
+    return None if not name else ' '.join(w.capitalize() for w in name.split())
+def normalize_name_from_parts(last, first_block):
+    last = (last or '').strip()
+    tokens = [t for t in (first_block or '').strip().split(' ') if t]
+    given_kept = tokens[:2]  # keep up to two given names
+    composed = ' '.join(given_kept + [last]).strip()
+    return cap_words(composed) if composed else None
+def take_within(lines, i, k=5):
+    out = []
+    for j in range(1, k+1):
+        if i+j < len(lines):
+            t = str(lines[i+j]).strip()
+            if t:
+                out.append(t)
+    return out
+def extract_number_from_text(text):
+    # Remove all non-digit characters and return the result
+    return ''.join(c for c in text if c.isdigit())
+def extract_tesda_info(lines):
+    dprint("Lines to extract", lines)
+    certificate_number = None
+    uli_number = None
+    cln_nq_number = None
+    full_name = None
+    # Collect name pieces
+    last_name_txt = None
+    first_name_txt = None
+    # Initialize other variables
+    issued_date = None
+    valid_until = None
+    qualification = None
+    qualification_level = None
+    L = [str(x or '').strip() for x in lines]
+    i = 0
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        # Certificate Number - more flexible pattern matching
+        if certificate_number is None:
+            # Try different patterns
+            cert_patterns = [
+                r'certificate\s*no\.?\s*(\d{14})',
+                r'certificate\s*number[:\s]*(\d{14})',
+                r'cert\s*no\.?\s*(\d{14})',
+                r'(\d{14})'
+            ]
+            for pattern in cert_patterns:
+                match = re.search(pattern, low)
+                if match:
+                    certificate_number = match.group(1)
+                    dprint("Found certificate number", certificate_number)
+                    break
+            # If not found in current line, check next lines
+            if not certificate_number:
+                ahead = take_within(L, i, 3)
+                for t in ahead:
+                    # Look for 14-digit number
+                    nums = extract_number_from_text(t)
+                    if len(nums) == 14:
+                        certificate_number = nums
+                        dprint("Found certificate number in next lines", certificate_number)
+                        break
+        # ULI Number
+        if uli_number is None and ('uli' in low or 'ops-' in low):
+            uli_pattern = r'(?:uli:?)?\s*([a-zA-Z]{3}-\d{2}-\d{3}-\d{5}-\d{3})'
+            match = re.search(uli_pattern, low, re.IGNORECASE)
+            if match:
+                uli_number = match.group(1).upper()
+                dprint("Found ULI number", uli_number)
+        # CLN-NQ Number
+        if cln_nq_number is None and ('cln' in low or 'nq' in low):
+            cln_pattern = r'(?:cln-nq-?)?(\d{7})'
+            match = re.search(cln_pattern, low)
+            if match:
+                cln_nq_number = match.group(1)
+                dprint("Found CLN-NQ number", cln_nq_number)
+        # Name appears after "is awarded to"
+        if 'awarded to' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                if t and not any(k in t.lower() for k in ['awarded', 'certificate', 'valid', 'for having']):
+                    # Clean up the name - remove periods and fix spacing
+                    cleaned_name = t.replace('.', ' ').replace('  ', ' ').strip()
+                    full_name = cap_words(cleaned_name)
+                    # Try to split into components
+                    parts = full_name.split()
+                    if len(parts) >= 2:
+                        last_name_txt = parts[-1]
+                        first_name_txt = ' '.join(parts[:-1])
+                    dprint("Found full name", full_name)
+                    break
+        # Qualification Level
+        if qualification_level is None and ('national certificate' in low or 'nc' in low):
+            qualification_level = cap_words(line)
+            dprint("Found qualification level", qualification_level)
+        # Qualification/Specialization
+        if qualification is None and ('in' in low and len(line.split()) > 1):
+            if i+1 < len(L):
+                qualification = cap_words(L[i+1])
+                dprint("Found qualification", qualification)
+        # Issued Date
+        if issued_date is None and ('issued' in low):
+            date_pattern = r'(?:issued\s*(?:on|:)?\s*)?([A-Za-z]+\s+\d{1,2},?\s*\d{4})'
+            match = re.search(date_pattern, low)
+            if match:
+                issued_date = format_date(match.group(1))
+                dprint("Found issued date", issued_date)
+        # Valid Until Date
+        if valid_until is None and ('valid' in low):
+            date_pattern = r'(?:valid\s*until\s*)?([A-Za-z]+\s+\d{1,2},?\s*\d{4})'
+            match = re.search(date_pattern, low)
+            if match:
+                valid_until = format_date(match.group(1))
+                dprint("Found valid until date", valid_until)
+        i += 1
+    # Compose name at the end
+    if full_name is None:
+        full_name = normalize_name_from_parts(last_name_txt, first_name_txt)
+    # Get first and last 4 digits of certificate number if available
+    cert_first_four = certificate_number[:4] if certificate_number else None
+    cert_last_four = certificate_number[-4:] if certificate_number else None
+    result = {
+        "id_type": "tesda",
+        "certificate_number": certificate_number,
+        "cert_first_four": cert_first_four,
+        "cert_last_four": cert_last_four,
+        "uli_number": uli_number,
+        "cln_nq_number": cln_nq_number,
+        "full_name": full_name,
+        "first_name": first_name_txt,
+        "last_name": last_name_txt,
+        "qualification_level": qualification_level,
+        "qualification": qualification,
+        "issued_date": issued_date,
+        "valid_until": valid_until
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    # Redirect both stdout and stderr during PaddleOCR operations
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR", image_path)
+        results = ocr.ocr(image_path)
+    dprint("OCR done, results_count", len(results))
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_tesda_info(all_text) if all_text else {
+        "id_type": "tesda",
+        "certificate_number": None,
+        "cert_first_four": None,
+        "cert_last_four": None,
+        "uli_number": None,
+        "cln_nq_number": None,
+        "full_name": None,
+        "first_name": None,
+        "last_name": None,
+        "qualification_level": None,
+        "qualification": None,
+        "issued_date": None,
+        "valid_until": None
+    }
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"success": False, "error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready", ocr_results)
+    # Create the response object
+    response = {
+        "success": True,
+        "ocr_results": ocr_results
+    }
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps(response))
+    sys.stdout.flush()
+except Exception as e:
+    dprint("Exception", str(e))
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": False, "error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        clean_cache()
+    except:
+        pass

extract_umid.py ADDED Viewed

	@@ -0,0 +1,317 @@

+import sys, json, os, glob, re, requests
+from PIL import Image
+from io import BytesIO
+from datetime import datetime
+from contextlib import redirect_stdout, redirect_stderr
+# Immediately redirect all output to stderr except for our final JSON
+original_stdout = sys.stdout
+sys.stdout = sys.stderr
+# Suppress all PaddleOCR output
+os.environ['PADDLEOCR_LOG_LEVEL'] = 'ERROR'
+os.environ['QT_QPA_PLATFORM'] = 'offscreen'
+os.environ['DISPLAY'] = ':99'
+# Import PaddleOCR after setting environment variables
+from paddleocr import PaddleOCR
+def dprint(msg, obj=None):
+    try:
+        print(f"DEBUG: {msg}" + (f": {obj}" if obj is not None else ""), file=sys.stderr)
+    except Exception:
+        pass
+def clean_cache():
+    files = ['temp_image.jpg', 'temp_image_ocr_res_img.jpg', 'temp_image_preprocessed_img.jpg', 'temp_image_res.json']
+    for f in files:
+        if os.path.exists(f):
+            os.remove(f)
+            dprint("Removed cache file", f)
+    if os.path.exists("output"):
+        import shutil
+        shutil.rmtree("output")
+        dprint("Removed output directory")
+def download_image(url, output_path='temp_image.jpg'):
+    dprint("Starting download", url)
+    clean_cache()
+    r = requests.get(url)
+    dprint("HTTP status", r.status_code)
+    r.raise_for_status()
+    img = Image.open(BytesIO(r.content))
+    if img.mode == 'RGBA':
+        bg = Image.new('RGB', img.size, (255, 255, 255))
+        bg.paste(img, mask=img.split()[-1])
+        img = bg
+    elif img.mode != 'RGB':
+        img = img.convert('RGB')
+    img.save(output_path, 'JPEG', quality=95)
+    dprint("Saved image", output_path)
+    return output_path
+def format_date(s):
+    if not s:
+        return None
+    raw = str(s).strip()
+    t = raw.replace(' ', '').replace('\\','/').replace('.','/')
+    # 1960/01/28 or 1960-01-28
+    if re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', t):
+        return t.replace('/', '-')
+    # 01/28/1960
+    if re.match(r'^\d{2}/\d{2}/\d{4}$', raw):
+        m, d, y = raw.split('/')
+        return f"{y}-{int(m):02d}-{int(d):02d}"
+    # Month name variants
+    try:
+        return datetime.strptime(raw, "%B %d, %Y").strftime("%Y-%m-%d")
+    except Exception:
+        pass
+    try:
+        return datetime.strptime(raw, "%b %d, %Y").strftime("%Y-%m-%d")
+    except Exception:
+        pass
+    return raw
+def cap_words(s):
+    return None if not s else ' '.join(w.capitalize() for w in s.split())
+def take_within(lines, i, k=5):
+    out = []
+    for j in range(1, k+1):
+        if i+j < len(lines):
+            t = str(lines[i+j]).strip()
+            if t:
+                out.append(t)
+    return out
+def is_crn_text(t):
+    # UMID CRN like 0028-1215160-9 or 002812151609
+    z = str(t).strip()
+    return bool(re.match(r'^\d{4}-\d{7}-\d$', z)) or bool(re.match(r'^\d{12,13}$', z))
+def normalize_name(last, given):
+    last = (last or '').strip()
+    tokens = [t for t in (given or '').strip().split(' ') if t]
+    kept = tokens[:2]  # First [+Second], ignore middles beyond that
+    name = ' '.join(kept + [last]).strip()
+    return cap_words(name) if name else None
+def glue_address(lines, start_idx):
+    parts = []
+    stop_labels = ['crn', 'surname', 'given', 'middle', 'sex', 'date', 'birth']
+    for k in range(1, 6):  # take up to 5 lines after ADDRESS
+        idx = start_idx + k
+        if idx >= len(lines):
+            break
+        t = str(lines[idx]).strip()
+        if not t:
+            continue
+        low = t.lower()
+        if any(lbl in low for lbl in stop_labels):
+            break
+        parts.append(t)
+    # collapse extra spaces and commas
+    address = ', '.join(parts)
+    address = re.sub(r'\s{2,}', ' ', address)
+    address = address.replace(' ,', ',')
+    return address or None
+def extract_umid_info(lines):
+    dprint("Lines to extract", lines)
+    crn = None
+    full_name = None
+    birth_date = None
+    sex = None
+    address = None
+    last_name_txt = None
+    given_name_txt = None
+    L = [str(x or '').strip() for x in lines]
+    i = 0
+    while i < len(L):
+        line = L[i]
+        low = line.lower()
+        dprint("Line", {"i": i, "text": line})
+        # CRN
+        crn_candidate = extract_crn_from_text(line)
+        if crn is None and crn_candidate:
+            crn = crn_candidate
+            dprint("Found CRN", crn)
+        elif crn is None and i+1 < len(L):
+            crn_candidate = extract_crn_from_text(L[i+1])
+            if crn_candidate:
+                crn = crn_candidate
+                dprint("Found CRN (next)", crn)
+        # Surname / Given Name / Middle Name (ignore middle)
+        if 'surname' in low:
+            ahead = take_within(L, i, 3)
+            for t in ahead:
+                tl = t.lower()
+                if not any(k in tl for k in ['given', 'middle', 'sex', 'date', 'birth', 'address', 'crn']):
+                    last_name_txt = t
+                    dprint("Captured last_name", last_name_txt)
+                    break
+        if 'given name' in low or 'given' in low:
+            if i+1 < len(L):
+                # sometimes value is same line (rare), often next line
+                val = L[i+1] if L[i+1] else line.replace('given name', '').strip()
+                given_name_txt = val if val else None
+                dprint("Captured given_name", given_name_txt)
+        # Sex and Date of Birth (handles same-line case like "SEXM DATEOFBIRTH 196O/O1/28")
+        if 'sex' in low:
+            # sex inline
+            if re.search(r'\bF(EMALE)?\b', line, flags=re.I): sex = 'F'
+            if re.search(r'\bM(ALE)?\b', line, flags=re.I): sex = 'M'
+            # lookahead fallback
+            if sex is None:
+                for t in take_within(L, i, 2):
+                    tt = t.strip().upper()
+                    if tt in ('M','F','MALE','FEMALE'):
+                        sex = 'M' if tt.startswith('M') else 'F'
+                        break
+        # Date of Birth inline or in next tokens
+        if 'date' in low and 'birth' in low:
+            # inline date
+            m = re.search(r'(\d[0-9OIl]{3}[/\-][0-1OIl]\d[/\-]\d{2,4})', line)
+            if m:
+                birth_date = format_date(normalize_digits(m.group(1)))
+                dprint("Found birth_date (inline)", birth_date)
+            if birth_date is None:
+                for t in take_within(L, i, 3):
+                    # fix digits then try
+                    cand = normalize_digits(t)
+                    if re.search(r'\d', cand) and format_date(cand):
+                        birth_date = format_date(cand)
+                        dprint("Found birth_date (lookahead)", birth_date)
+                        break
+        # Also catch standalone yyyy/mm/dd line
+        if birth_date is None and re.match(r'^\d{4}[-/]\d{2}[-/]\d{2}$', normalize_digits(line)):
+            birth_date = format_date(normalize_digits(line))
+            dprint("Found standalone birth_date", birth_date)
+        # Address block
+        if 'address' in low and address is None:
+            address = glue_address(L, i)
+            dprint("Found address", address)
+        i += 1
+    # Compose final name
+    if full_name is None:
+        full_name = normalize_name(last_name_txt, given_name_txt)
+        dprint("Composed full_name", {"last": last_name_txt, "given": given_name_txt, "full": full_name})
+    result = {
+        "id_type": "umid",
+        "crn": crn,
+        "id_number": crn,   # frontend expects this
+        "full_name": full_name,
+        "birth_date": birth_date,
+        "sex": sex,
+        "address": address
+    }
+    dprint("Final result", result)
+    return result
+def extract_ocr_lines(image_path):
+    os.makedirs("output", exist_ok=True)
+    dprint("Initializing PaddleOCR")
+    with redirect_stdout(sys.stderr), redirect_stderr(sys.stderr):
+        ocr = PaddleOCR(
+            use_doc_orientation_classify=False,
+            use_doc_unwarping=False,
+            use_textline_orientation=False,
+            lang='en',
+            show_log=False
+        )
+        dprint("OCR initialized")
+        dprint("Running OCR predict", image_path)
+        results = ocr.ocr(image_path, cls=False)
+    dprint("OCR predict done, results_count", len(results))
+    # Process OCR results directly
+    all_text = []
+    try:
+        lines = results[0] if results and isinstance(results[0], list) else results
+        for item in lines:
+            if isinstance(item, (list, tuple)) and len(item) >= 2:
+                meta = item[1]
+                if isinstance(meta, (list, tuple)) and len(meta) >= 1:
+                    all_text.append(str(meta[0]))
+    except Exception as e:
+        dprint("Error processing OCR results", str(e))
+    dprint("All direct texts", all_text)
+    return extract_umid_info(all_text) if all_text else {
+        "id_type": "umid",
+        "crn": None,
+        "id_number": None,
+        "full_name": None,
+        "birth_date": None
+    }
+def normalize_digits(s):
+    # Fix common OCR digit confusions: O→0, o→0, I/l→1, S→5, B→8
+    return (
+        str(s)
+        .replace('O','0').replace('o','0')
+        .replace('I','1').replace('l','1')
+        .replace('S','5')
+        .replace('B','8')
+    )
+def extract_crn_from_text(t):
+    # Accept "CRN-0028-1215160-9" or plain digits/hyphens
+    m = re.search(r'crn[^0-9]*([0-9OIl\-]{10,})', t, flags=re.IGNORECASE)
+    if m:
+        val = normalize_digits(m.group(1))
+        # Keep hyphens; also accept compact digits
+        if re.match(r'^\d{4}-\d{7}-\d$', val) or re.match(r'^\d{12,13}$', val):
+            return val
+    # Or whole token is the number
+    val = normalize_digits(t.strip())
+    if re.match(r'^\d{4}-\d{7}-\d$', val) or re.match(r'^\d{12,13}$', val):
+        return val
+    return None
+if len(sys.argv) < 2:
+    sys.stdout = original_stdout
+    print(json.dumps({"error": "No image URL provided"}))
+    sys.exit(1)
+image_url = sys.argv[1]
+dprint("Processing image URL", image_url)
+try:
+    image_path = download_image(image_url)
+    dprint("Image downloaded to", image_path)
+    ocr_results = extract_ocr_lines(image_path)
+    dprint("OCR results ready")
+    # Restore stdout and print only the JSON response
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"success": True, "ocr_results": ocr_results}))
+    sys.stdout.flush()
+except Exception as e:
+    dprint("Exception", str(e))
+    # Restore stdout for error JSON
+    sys.stdout = original_stdout
+    sys.stdout.write(json.dumps({"error": str(e)}))
+    sys.stdout.flush()
+    sys.exit(1)
+finally:
+    # Clean up
+    try:
+        clean_cache()
+    except:
+        pass

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+paddleocr>=2.7.0
+paddlepaddle>=2.5.0
+opencv-python-headless>=4.8.0
+pillow>=10.0.0
+numpy>=1.24.0
+requests>=2.31.0
+flask>=2.3.0
+gunicorn>=21.2.0
+shapely>=2.0.0
+pyclipper>=1.3.0
+imgaug>=0.4.0
+lmdb>=1.4.0
+tqdm>=4.65.0

test_api.py ADDED Viewed

	@@ -0,0 +1,145 @@

+#!/usr/bin/env python3
+"""
+Test script for HandyHome OCR API
+Tests all endpoints with sample document URLs
+"""
+import requests
+import json
+import sys
+# Configuration
+BASE_URL = "http://localhost:7860"  # Change to your Hugging Face Space URL for remote testing
+# Sample test URLs (replace with your actual test image URLs)
+TEST_URLS = {
+    'national_id': 'https://example.com/national_id.jpg',
+    'drivers_license': 'https://example.com/drivers_license.jpg',
+    'prc': 'https://example.com/prc.jpg',
+    'umid': 'https://example.com/umid.jpg',
+    'sss': 'https://example.com/sss.jpg',
+    'passport': 'https://example.com/passport.jpg',
+    'postal': 'https://example.com/postal.jpg',
+    'phic': 'https://example.com/phic.jpg',
+    'nbi': 'https://example.com/nbi.jpg',
+    'police': 'https://example.com/police.jpg',
+    'tesda': 'https://example.com/tesda.jpg',
+}
+def test_health():
+    """Test health endpoint"""
+    print("\n" + "="*60)
+    print("Testing Health Endpoint")
+    print("="*60)
+    try:
+        response = requests.get(f"{BASE_URL}/health", timeout=10)
+        print(f"Status Code: {response.status_code}")
+        print(f"Response: {json.dumps(response.json(), indent=2)}")
+        return response.status_code == 200
+    except Exception as e:
+        print(f"Error: {str(e)}")
+        return False
+def test_index():
+    """Test index/documentation endpoint"""
+    print("\n" + "="*60)
+    print("Testing Index/Documentation Endpoint")
+    print("="*60)
+    try:
+        response = requests.get(f"{BASE_URL}/", timeout=10)
+        print(f"Status Code: {response.status_code}")
+        data = response.json()
+        print(f"Service: {data.get('service')}")
+        print(f"Version: {data.get('version')}")
+        print(f"Available Endpoints: {len(data.get('endpoints', {}))}")
+        return response.status_code == 200
+    except Exception as e:
+        print(f"Error: {str(e)}")
+        return False
+def test_extraction_endpoint(endpoint, document_type, document_url):
+    """Test an OCR extraction endpoint"""
+    print("\n" + "="*60)
+    print(f"Testing: {document_type}")
+    print("="*60)
+    print(f"Endpoint: {endpoint}")
+    print(f"Document URL: {document_url}")
+    try:
+        response = requests.post(
+            f"{BASE_URL}{endpoint}",
+            json={'document_url': document_url},
+            timeout=300
+        )
+        print(f"Status Code: {response.status_code}")
+        try:
+            data = response.json()
+            print(f"Response: {json.dumps(data, indent=2)}")
+            return response.status_code == 200 and data.get('success', False)
+        except:
+            print(f"Response (non-JSON): {response.text[:200]}")
+            return False
+    except requests.exceptions.Timeout:
+        print("Error: Request timed out (>5 minutes)")
+        return False
+    except Exception as e:
+        print(f"Error: {str(e)}")
+        return False
+def main():
+    """Run all tests"""
+    print("\n" + "="*60)
+    print("HandyHome OCR API Test Suite")
+    print("="*60)
+    print(f"Base URL: {BASE_URL}")
+    results = {}
+    # Test utility endpoints
+    results['health'] = test_health()
+    results['index'] = test_index()
+    # Test OCR endpoints
+    endpoints_to_test = [
+        ('/api/extract-national-id', 'National ID', TEST_URLS['national_id']),
+        ('/api/extract-drivers-license', "Driver's License", TEST_URLS['drivers_license']),
+        ('/api/extract-prc', 'PRC ID', TEST_URLS['prc']),
+        ('/api/extract-umid', 'UMID', TEST_URLS['umid']),
+        ('/api/extract-sss', 'SSS ID', TEST_URLS['sss']),
+        ('/api/extract-passport', 'Passport', TEST_URLS['passport']),
+        ('/api/extract-postal', 'Postal ID', TEST_URLS['postal']),
+        ('/api/extract-phic', 'PhilHealth ID', TEST_URLS['phic']),
+        ('/api/extract-nbi', 'NBI Clearance', TEST_URLS['nbi']),
+        ('/api/extract-police-clearance', 'Police Clearance', TEST_URLS['police']),
+        ('/api/extract-tesda', 'TESDA Certificate', TEST_URLS['tesda']),
+    ]
+    for endpoint, doc_type, url in endpoints_to_test:
+        key = doc_type.lower().replace(' ', '_').replace("'", '')
+        results[key] = test_extraction_endpoint(endpoint, doc_type, url)
+    # Print summary
+    print("\n" + "="*60)
+    print("Test Summary")
+    print("="*60)
+    passed = sum(1 for v in results.values() if v)
+    total = len(results)
+    for test_name, passed_test in results.items():
+        status = "✓ PASS" if passed_test else "✗ FAIL"
+        print(f"{status} - {test_name}")
+    print("\n" + "="*60)
+    print(f"Results: {passed}/{total} tests passed")
+    print("="*60)
+    return 0 if passed == total else 1
+if __name__ == '__main__':
+    sys.exit(main())