8.7 KiB
8.7 KiB
MetaScraper Architecture Documentation
🏗️ System Architecture Overview
MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design.
Core Design Principles
- Dual-Mode Operation: Static HTML parsing with Playwright fallback
- Graceful Degradation: Continue operation even when optional dependencies fail
- Localization-Aware: Built-in support for Turkish Netflix interfaces
- Error Resilience: Comprehensive error handling with Turkish error messages
- Modern JavaScript: ES6+ modules with Node.js 18+ compatibility
🔄 System Flow
Input URL → URL Normalization → Static HTML Fetch → HTML Parsing → Success?
↓ ↓
Error Headless Fallback
↓ ↓
Return ← HTML Parsing ← Browser Execution ← Playwright Launch
Detailed Flow Analysis
1. URL Normalization (src/index.js:21-48)
- Validates Netflix URL format
- Extracts Netflix title ID from various URL patterns
- Normalizes to standard format:
https://www.netflix.com/title/{id}
Supported URL Patterns:
https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=trhttps://www.netflix.com/title/80189685https://www.netflix.com/tr/title/70195800?trackId=12345
2. Static HTML Fetch (src/index.js:99-128)
- Uses native
fetchAPI with undici polyfill support - Configurable timeout and User-Agent
- Comprehensive error handling for network issues
3. HTML Parsing (src/parser.js:134-162)
- Primary Strategy: JSON-LD structured data extraction
- Fallback Strategy: Meta tags and title element parsing
- Title Cleaning: Removes Turkish UI text and Netflix suffixes
4. Headless Fallback (src/headless.js:9-41)
- Optional Playwright integration
- Chromium browser automation
- Network idle detection for complete page loads
🧩 Module Architecture
Core Modules
src/index.js - Main Orchestrator
export async function scraperNetflix(inputUrl, options = {})
Responsibilities:
- URL validation and normalization
- Fetch strategy selection (static vs headless)
- Error orchestration and Turkish localization
- Result aggregation and formatting
Key Functions:
normalizeNetflixUrl(inputUrl)- URL processingfetchStaticHtml(url, userAgent, timeoutMs)- HTTP clientensureFetchGlobals()- Polyfill management
src/parser.js - HTML Processing Engine
export function parseNetflixHtml(html)
Responsibilities:
- JSON-LD extraction and parsing
- Title cleaning and localization
- Year extraction from multiple fields
- Season information detection
Key Functions:
parseJsonLdObject(obj)- Structured data processingcleanTitle(title)- UI text removalextractYear(value)- Multi-format year parsing
Turkish Localization Patterns:
const TURKISH_UI_PATTERNS = [
/\s+izlemenizi bekliyor$/i, // "waiting for you to watch"
/\s+izleyin$/i, // "watch"
/\s+devam et$/i, // "continue"
/\s+başla$/i, // "start"
/\s+izlemeye devam$/i, // "continue watching"
/\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam"
/\s+Sezon\s+\d+.*başla$/i, // "Sezon X başla"
];
src/headless.js - Browser Automation
export async function fetchPageContentWithPlaywright(url, options)
Responsibilities:
- Playwright browser management
- Page navigation and content extraction
- Resource cleanup and error handling
Browser Configuration:
- Viewport: 1280x720 (standard desktop)
- Wait Strategy:
domcontentloaded+networkidle - Launch Mode: Headless (configurable)
src/polyfill.js - Compatibility Layer
// File/Blob polyfill for Node.js undici compatibility
Responsibilities:
- File API polyfill for undici fetch
- Node.js 18+ compatibility
- Minimal footprint
📊 Data Flow Architecture
Input Processing
interface Input {
url: string; // Netflix URL
options?: {
headless?: boolean; // Enable/disable Playwright
timeoutMs?: number; // Request timeout
userAgent?: string; // Custom User-Agent
};
}
Output Schema
interface NetflixMetadata {
url: string; // Normalized URL
id: string; // Netflix title ID
name: string; // Clean title
year: string | number | undefined; // Release year
seasons: string | null; // Season info for series
}
Internal Data Structures
JSON-LD Processing
const YEAR_FIELDS = [
'datePublished', 'startDate', 'uploadDate',
'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
];
const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];
Error Handling
class NetflixScrapingError extends Error {
constructor(message, originalError, context) {
super(message);
this.originalError = originalError;
this.context = context;
}
}
🔧 Technical Implementation Details
Fetch Strategy Selection Algorithm
function needsHeadless(meta) {
return !meta?.name || !meta?.year;
}
Decision Logic:
- Static First: Always try static parsing (faster, lighter)
- Missing Data: If title or year missing, trigger headless
- Configurable: Can force headless or disable entirely
Error Recovery Patterns
Network Errors
- Timeout handling with AbortController
- HTTP status code validation
- Retry logic for transient failures
Parsing Errors
- Graceful JSON-LD error handling
- Multiple title extraction strategies
- Fallback to basic meta tags
Browser Errors
- Playwright detection and graceful messaging
- Browser process cleanup on errors
- Memory leak prevention
🎯 Performance Optimizations
Static Mode Optimizations
- Single HTTP Request: Minimal network overhead
- String Parsing: Fast regex-based title cleaning
- Memory Efficient: No browser overhead
- Cache-Friendly: Deterministic output
Headless Mode Optimizations
- Browser Pooling: Reuse browser instances (future enhancement)
- Selective Resources: Block unnecessary requests
- Early Termination: Stop when required data found
- Timeout Protection: Prevent hanging operations
Memory Management
// Always cleanup browser resources
try {
return await page.content();
} finally {
await browser.close();
}
🔒 Security Architecture
Input Validation
- URL format validation with regex patterns
- Netflix domain verification
- Path traversal prevention
Request Security
- Configurable User-Agent strings
- Rate limiting considerations
- Request header standardization
Data Sanitization
- HTML entity decoding
- XSS prevention in title extraction
- Structured data validation
🔮 Extensibility Points
Future Enhancements
1. Multiple Language Support
// Architecture ready for additional languages
const LOCALIZATION_PATTERNS = {
tr: TURKISH_UI_PATTERNS,
es: SPANISH_UI_PATTERNS,
// ... future languages
};
2. Caching Layer
// Hook points for caching integration
const cacheMiddleware = {
get: (url) => cache.get(url),
set: (url, data) => cache.set(url, data, ttl)
};
3. Browser Pool Management
// Scalable browser resource management
class BrowserPool {
constructor(maxSize = 5) {
this.maxSize = maxSize;
this.pool = [];
}
}
4. Netflix API Integration
// Potential Netflix API integration point
class NetflixAPIClient {
async getMetadata(titleId) {
// Direct API calls when available
}
}
📈 Monitoring & Observability
Logging Strategy
- Progress Logs: ✅ Pass/fail indicators
- Error Logs: Detailed error context with Turkish messages
- Performance Logs: Timing information (future enhancement)
Metrics Collection
- Success/failure rates per mode
- Response time distributions
- Error categorization
- Resource utilization
🧪 Testing Architecture
Test Categories
- Unit Tests: Individual function testing
- Integration Tests: Full workflow testing
- Live Tests: Real Netflix URL testing
- Performance Tests: Benchmarking
Test Data Management
tests/fixtures/
├── sample-title.html # Static test HTML
├── turkish-ui.json # Turkish UI patterns
└── test-urls.json # Test URL collection
Architecture documentation last updated: 2025-11-23