# MetaScraper Architecture Documentation ## ๐Ÿ—๏ธ System Architecture Overview MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design. ### Core Design Principles 1. **Dual-Mode Operation**: Static HTML parsing with Playwright fallback 2. **Graceful Degradation**: Continue operation even when optional dependencies fail 3. **Localization-Aware**: Built-in support for Turkish Netflix interfaces 4. **Error Resilience**: Comprehensive error handling with Turkish error messages 5. **Modern JavaScript**: ES6+ modules with Node.js 18+ compatibility ## ๐Ÿ”„ System Flow ``` Input URL โ†’ URL Normalization โ†’ Static HTML Fetch โ†’ HTML Parsing โ†’ Success? โ†“ โ†“ Error Headless Fallback โ†“ โ†“ Return โ† HTML Parsing โ† Browser Execution โ† Playwright Launch ``` ### Detailed Flow Analysis #### 1. URL Normalization (`src/index.js:21-48`) - Validates Netflix URL format - Extracts Netflix title ID from various URL patterns - Normalizes to standard format: `https://www.netflix.com/title/{id}` **Supported URL Patterns:** - `https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=tr` - `https://www.netflix.com/title/80189685` - `https://www.netflix.com/tr/title/70195800?trackId=12345` #### 2. Static HTML Fetch (`src/index.js:99-128`) - Uses native `fetch` API with undici polyfill support - Configurable timeout and User-Agent - Comprehensive error handling for network issues #### 3. HTML Parsing (`src/parser.js:134-162`) - **Primary Strategy**: JSON-LD structured data extraction - **Fallback Strategy**: Meta tags and title element parsing - **Title Cleaning**: Removes Turkish UI text and Netflix suffixes #### 4. Headless Fallback (`src/headless.js:9-41`) - Optional Playwright integration - Chromium browser automation - Network idle detection for complete page loads ## ๐Ÿงฉ Module Architecture ### Core Modules #### `src/index.js` - Main Orchestrator ```javascript export async function scraperNetflix(inputUrl, options = {}) ``` **Responsibilities:** - URL validation and normalization - Fetch strategy selection (static vs headless) - Error orchestration and Turkish localization - Result aggregation and formatting **Key Functions:** - `normalizeNetflixUrl(inputUrl)` - URL processing - `fetchStaticHtml(url, userAgent, timeoutMs)` - HTTP client - `ensureFetchGlobals()` - Polyfill management #### `src/parser.js` - HTML Processing Engine ```javascript export function parseNetflixHtml(html) ``` **Responsibilities:** - JSON-LD extraction and parsing - Title cleaning and localization - Year extraction from multiple fields - Season information detection **Key Functions:** - `parseJsonLdObject(obj)` - Structured data processing - `cleanTitle(title)` - UI text removal - `extractYear(value)` - Multi-format year parsing **Turkish Localization Patterns:** ```javascript const TURKISH_UI_PATTERNS = [ /\s+izlemenizi bekliyor$/i, // "waiting for you to watch" /\s+izleyin$/i, // "watch" /\s+devam et$/i, // "continue" /\s+baลŸla$/i, // "start" /\s+izlemeye devam$/i, // "continue watching" /\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam" /\s+Sezon\s+\d+.*baลŸla$/i, // "Sezon X baลŸla" ]; ``` #### `src/headless.js` - Browser Automation ```javascript export async function fetchPageContentWithPlaywright(url, options) ``` **Responsibilities:** - Playwright browser management - Page navigation and content extraction - Resource cleanup and error handling **Browser Configuration:** - Viewport: 1280x720 (standard desktop) - Wait Strategy: `domcontentloaded` + `networkidle` - Launch Mode: Headless (configurable) #### `src/polyfill.js` - Compatibility Layer ```javascript // File/Blob polyfill for Node.js undici compatibility ``` **Responsibilities:** - File API polyfill for undici fetch - Node.js 18+ compatibility - Minimal footprint ## ๐Ÿ“Š Data Flow Architecture ### Input Processing ```typescript interface Input { url: string; // Netflix URL options?: { headless?: boolean; // Enable/disable Playwright timeoutMs?: number; // Request timeout userAgent?: string; // Custom User-Agent }; } ``` ### Output Schema ```typescript interface NetflixMetadata { url: string; // Normalized URL id: string; // Netflix title ID name: string; // Clean title year: string | number | undefined; // Release year seasons: string | null; // Season info for series } ``` ### Internal Data Structures #### JSON-LD Processing ```javascript const YEAR_FIELDS = [ 'datePublished', 'startDate', 'uploadDate', 'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated' ]; const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series']; ``` #### Error Handling ```javascript class NetflixScrapingError extends Error { constructor(message, originalError, context) { super(message); this.originalError = originalError; this.context = context; } } ``` ## ๐Ÿ”ง Technical Implementation Details ### Fetch Strategy Selection Algorithm ```javascript function needsHeadless(meta) { return !meta?.name || !meta?.year; } ``` **Decision Logic:** 1. **Static First**: Always try static parsing (faster, lighter) 2. **Missing Data**: If title or year missing, trigger headless 3. **Configurable**: Can force headless or disable entirely ### Error Recovery Patterns #### Network Errors - Timeout handling with AbortController - HTTP status code validation - Retry logic for transient failures #### Parsing Errors - Graceful JSON-LD error handling - Multiple title extraction strategies - Fallback to basic meta tags #### Browser Errors - Playwright detection and graceful messaging - Browser process cleanup on errors - Memory leak prevention ## ๐ŸŽฏ Performance Optimizations ### Static Mode Optimizations - **Single HTTP Request**: Minimal network overhead - **String Parsing**: Fast regex-based title cleaning - **Memory Efficient**: No browser overhead - **Cache-Friendly**: Deterministic output ### Headless Mode Optimizations - **Browser Pooling**: Reuse browser instances (future enhancement) - **Selective Resources**: Block unnecessary requests - **Early Termination**: Stop when required data found - **Timeout Protection**: Prevent hanging operations ### Memory Management ```javascript // Always cleanup browser resources try { return await page.content(); } finally { await browser.close(); } ``` ## ๐Ÿ”’ Security Architecture ### Input Validation - URL format validation with regex patterns - Netflix domain verification - Path traversal prevention ### Request Security - Configurable User-Agent strings - Rate limiting considerations - Request header standardization ### Data Sanitization - HTML entity decoding - XSS prevention in title extraction - Structured data validation ## ๐Ÿ”ฎ Extensibility Points ### Future Enhancements #### 1. Multiple Language Support ```javascript // Architecture ready for additional languages const LOCALIZATION_PATTERNS = { tr: TURKISH_UI_PATTERNS, es: SPANISH_UI_PATTERNS, // ... future languages }; ``` #### 2. Caching Layer ```javascript // Hook points for caching integration const cacheMiddleware = { get: (url) => cache.get(url), set: (url, data) => cache.set(url, data, ttl) }; ``` #### 3. Browser Pool Management ```javascript // Scalable browser resource management class BrowserPool { constructor(maxSize = 5) { this.maxSize = maxSize; this.pool = []; } } ``` #### 4. Netflix API Integration ```javascript // Potential Netflix API integration point class NetflixAPIClient { async getMetadata(titleId) { // Direct API calls when available } } ``` ## ๐Ÿ“ˆ Monitoring & Observability ### Logging Strategy - **Progress Logs**: โœ… Pass/fail indicators - **Error Logs**: Detailed error context with Turkish messages - **Performance Logs**: Timing information (future enhancement) ### Metrics Collection - Success/failure rates per mode - Response time distributions - Error categorization - Resource utilization ## ๐Ÿงช Testing Architecture ### Test Categories 1. **Unit Tests**: Individual function testing 2. **Integration Tests**: Full workflow testing 3. **Live Tests**: Real Netflix URL testing 4. **Performance Tests**: Benchmarking ### Test Data Management ``` tests/fixtures/ โ”œโ”€โ”€ sample-title.html # Static test HTML โ”œโ”€โ”€ turkish-ui.json # Turkish UI patterns โ””โ”€โ”€ test-urls.json # Test URL collection ``` --- *Architecture documentation last updated: 2025-11-23*