Files
metascraper/doc/ARCHITECTURE.md
2025-11-23 14:25:09 +03:00

8.7 KiB

MetaScraper Architecture Documentation

🏗️ System Architecture Overview

MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design.

Core Design Principles

  1. Dual-Mode Operation: Static HTML parsing with Playwright fallback
  2. Graceful Degradation: Continue operation even when optional dependencies fail
  3. Localization-Aware: Built-in support for Turkish Netflix interfaces
  4. Error Resilience: Comprehensive error handling with Turkish error messages
  5. Modern JavaScript: ES6+ modules with Node.js 18+ compatibility

🔄 System Flow

Input URL → URL Normalization → Static HTML Fetch → HTML Parsing → Success?
    ↓                                               ↓
   Error                                      Headless Fallback
    ↓                                               ↓
  Return ← HTML Parsing ← Browser Execution ← Playwright Launch

Detailed Flow Analysis

1. URL Normalization (src/index.js:21-48)

  • Validates Netflix URL format
  • Extracts Netflix title ID from various URL patterns
  • Normalizes to standard format: https://www.netflix.com/title/{id}

Supported URL Patterns:

  • https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=tr
  • https://www.netflix.com/title/80189685
  • https://www.netflix.com/tr/title/70195800?trackId=12345

2. Static HTML Fetch (src/index.js:99-128)

  • Uses native fetch API with undici polyfill support
  • Configurable timeout and User-Agent
  • Comprehensive error handling for network issues

3. HTML Parsing (src/parser.js:134-162)

  • Primary Strategy: JSON-LD structured data extraction
  • Fallback Strategy: Meta tags and title element parsing
  • Title Cleaning: Removes Turkish UI text and Netflix suffixes

4. Headless Fallback (src/headless.js:9-41)

  • Optional Playwright integration
  • Chromium browser automation
  • Network idle detection for complete page loads

🧩 Module Architecture

Core Modules

src/index.js - Main Orchestrator

export async function scraperNetflix(inputUrl, options = {})

Responsibilities:

  • URL validation and normalization
  • Fetch strategy selection (static vs headless)
  • Error orchestration and Turkish localization
  • Result aggregation and formatting

Key Functions:

  • normalizeNetflixUrl(inputUrl) - URL processing
  • fetchStaticHtml(url, userAgent, timeoutMs) - HTTP client
  • ensureFetchGlobals() - Polyfill management

src/parser.js - HTML Processing Engine

export function parseNetflixHtml(html)

Responsibilities:

  • JSON-LD extraction and parsing
  • Title cleaning and localization
  • Year extraction from multiple fields
  • Season information detection

Key Functions:

  • parseJsonLdObject(obj) - Structured data processing
  • cleanTitle(title) - UI text removal
  • extractYear(value) - Multi-format year parsing

Turkish Localization Patterns:

const TURKISH_UI_PATTERNS = [
  /\s+izlemenizi bekliyor$/i,           // "waiting for you to watch"
  /\s+izleyin$/i,                      // "watch"
  /\s+devam et$/i,                     // "continue"
  /\s+başla$/i,                        // "start"
  /\s+izlemeye devam$/i,               // "continue watching"
  /\s+Sezon\s+\d+.*izlemeye devam$/i,  // "Sezon X izlemeye devam"
  /\s+Sezon\s+\d+.*başla$/i,           // "Sezon X başla"
];

src/headless.js - Browser Automation

export async function fetchPageContentWithPlaywright(url, options)

Responsibilities:

  • Playwright browser management
  • Page navigation and content extraction
  • Resource cleanup and error handling

Browser Configuration:

  • Viewport: 1280x720 (standard desktop)
  • Wait Strategy: domcontentloaded + networkidle
  • Launch Mode: Headless (configurable)

src/polyfill.js - Compatibility Layer

// File/Blob polyfill for Node.js undici compatibility

Responsibilities:

  • File API polyfill for undici fetch
  • Node.js 18+ compatibility
  • Minimal footprint

📊 Data Flow Architecture

Input Processing

interface Input {
  url: string;           // Netflix URL
  options?: {
    headless?: boolean;    // Enable/disable Playwright
    timeoutMs?: number;    // Request timeout
    userAgent?: string;    // Custom User-Agent
  };
}

Output Schema

interface NetflixMetadata {
  url: string;                    // Normalized URL
  id: string;                     // Netflix title ID
  name: string;                   // Clean title
  year: string | number | undefined; // Release year
  seasons: string | null;         // Season info for series
}

Internal Data Structures

JSON-LD Processing

const YEAR_FIELDS = [
  'datePublished', 'startDate', 'uploadDate',
  'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
];

const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];

Error Handling

class NetflixScrapingError extends Error {
  constructor(message, originalError, context) {
    super(message);
    this.originalError = originalError;
    this.context = context;
  }
}

🔧 Technical Implementation Details

Fetch Strategy Selection Algorithm

function needsHeadless(meta) {
  return !meta?.name || !meta?.year;
}

Decision Logic:

  1. Static First: Always try static parsing (faster, lighter)
  2. Missing Data: If title or year missing, trigger headless
  3. Configurable: Can force headless or disable entirely

Error Recovery Patterns

Network Errors

  • Timeout handling with AbortController
  • HTTP status code validation
  • Retry logic for transient failures

Parsing Errors

  • Graceful JSON-LD error handling
  • Multiple title extraction strategies
  • Fallback to basic meta tags

Browser Errors

  • Playwright detection and graceful messaging
  • Browser process cleanup on errors
  • Memory leak prevention

🎯 Performance Optimizations

Static Mode Optimizations

  • Single HTTP Request: Minimal network overhead
  • String Parsing: Fast regex-based title cleaning
  • Memory Efficient: No browser overhead
  • Cache-Friendly: Deterministic output

Headless Mode Optimizations

  • Browser Pooling: Reuse browser instances (future enhancement)
  • Selective Resources: Block unnecessary requests
  • Early Termination: Stop when required data found
  • Timeout Protection: Prevent hanging operations

Memory Management

// Always cleanup browser resources
try {
  return await page.content();
} finally {
  await browser.close();
}

🔒 Security Architecture

Input Validation

  • URL format validation with regex patterns
  • Netflix domain verification
  • Path traversal prevention

Request Security

  • Configurable User-Agent strings
  • Rate limiting considerations
  • Request header standardization

Data Sanitization

  • HTML entity decoding
  • XSS prevention in title extraction
  • Structured data validation

🔮 Extensibility Points

Future Enhancements

1. Multiple Language Support

// Architecture ready for additional languages
const LOCALIZATION_PATTERNS = {
  tr: TURKISH_UI_PATTERNS,
  es: SPANISH_UI_PATTERNS,
  // ... future languages
};

2. Caching Layer

// Hook points for caching integration
const cacheMiddleware = {
  get: (url) => cache.get(url),
  set: (url, data) => cache.set(url, data, ttl)
};

3. Browser Pool Management

// Scalable browser resource management
class BrowserPool {
  constructor(maxSize = 5) {
    this.maxSize = maxSize;
    this.pool = [];
  }
}

4. Netflix API Integration

// Potential Netflix API integration point
class NetflixAPIClient {
  async getMetadata(titleId) {
    // Direct API calls when available
  }
}

📈 Monitoring & Observability

Logging Strategy

  • Progress Logs: Pass/fail indicators
  • Error Logs: Detailed error context with Turkish messages
  • Performance Logs: Timing information (future enhancement)

Metrics Collection

  • Success/failure rates per mode
  • Response time distributions
  • Error categorization
  • Resource utilization

🧪 Testing Architecture

Test Categories

  1. Unit Tests: Individual function testing
  2. Integration Tests: Full workflow testing
  3. Live Tests: Real Netflix URL testing
  4. Performance Tests: Benchmarking

Test Data Management

tests/fixtures/
├── sample-title.html     # Static test HTML
├── turkish-ui.json       # Turkish UI patterns
└── test-urls.json        # Test URL collection

Architecture documentation last updated: 2025-11-23