321 lines
8.7 KiB
Markdown
321 lines
8.7 KiB
Markdown
# MetaScraper Architecture Documentation
|
|
|
|
## 🏗️ System Architecture Overview
|
|
|
|
MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design.
|
|
|
|
### Core Design Principles
|
|
|
|
1. **Dual-Mode Operation**: Static HTML parsing with Playwright fallback
|
|
2. **Graceful Degradation**: Continue operation even when optional dependencies fail
|
|
3. **Localization-Aware**: Built-in support for Turkish Netflix interfaces
|
|
4. **Error Resilience**: Comprehensive error handling with Turkish error messages
|
|
5. **Modern JavaScript**: ES6+ modules with Node.js 18+ compatibility
|
|
|
|
## 🔄 System Flow
|
|
|
|
```
|
|
Input URL → URL Normalization → Static HTML Fetch → HTML Parsing → Success?
|
|
↓ ↓
|
|
Error Headless Fallback
|
|
↓ ↓
|
|
Return ← HTML Parsing ← Browser Execution ← Playwright Launch
|
|
```
|
|
|
|
### Detailed Flow Analysis
|
|
|
|
#### 1. URL Normalization (`src/index.js:21-48`)
|
|
- Validates Netflix URL format
|
|
- Extracts Netflix title ID from various URL patterns
|
|
- Normalizes to standard format: `https://www.netflix.com/title/{id}`
|
|
|
|
**Supported URL Patterns:**
|
|
- `https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=tr`
|
|
- `https://www.netflix.com/title/80189685`
|
|
- `https://www.netflix.com/tr/title/70195800?trackId=12345`
|
|
|
|
#### 2. Static HTML Fetch (`src/index.js:99-128`)
|
|
- Uses native `fetch` API with undici polyfill support
|
|
- Configurable timeout and User-Agent
|
|
- Comprehensive error handling for network issues
|
|
|
|
#### 3. HTML Parsing (`src/parser.js:134-162`)
|
|
- **Primary Strategy**: JSON-LD structured data extraction
|
|
- **Fallback Strategy**: Meta tags and title element parsing
|
|
- **Title Cleaning**: Removes Turkish UI text and Netflix suffixes
|
|
|
|
#### 4. Headless Fallback (`src/headless.js:9-41`)
|
|
- Optional Playwright integration
|
|
- Chromium browser automation
|
|
- Network idle detection for complete page loads
|
|
|
|
## 🧩 Module Architecture
|
|
|
|
### Core Modules
|
|
|
|
#### `src/index.js` - Main Orchestrator
|
|
```javascript
|
|
export async function scraperNetflix(inputUrl, options = {})
|
|
```
|
|
|
|
**Responsibilities:**
|
|
- URL validation and normalization
|
|
- Fetch strategy selection (static vs headless)
|
|
- Error orchestration and Turkish localization
|
|
- Result aggregation and formatting
|
|
|
|
**Key Functions:**
|
|
- `normalizeNetflixUrl(inputUrl)` - URL processing
|
|
- `fetchStaticHtml(url, userAgent, timeoutMs)` - HTTP client
|
|
- `ensureFetchGlobals()` - Polyfill management
|
|
|
|
#### `src/parser.js` - HTML Processing Engine
|
|
```javascript
|
|
export function parseNetflixHtml(html)
|
|
```
|
|
|
|
**Responsibilities:**
|
|
- JSON-LD extraction and parsing
|
|
- Title cleaning and localization
|
|
- Year extraction from multiple fields
|
|
- Season information detection
|
|
|
|
**Key Functions:**
|
|
- `parseJsonLdObject(obj)` - Structured data processing
|
|
- `cleanTitle(title)` - UI text removal
|
|
- `extractYear(value)` - Multi-format year parsing
|
|
|
|
**Turkish Localization Patterns:**
|
|
```javascript
|
|
const TURKISH_UI_PATTERNS = [
|
|
/\s+izlemenizi bekliyor$/i, // "waiting for you to watch"
|
|
/\s+izleyin$/i, // "watch"
|
|
/\s+devam et$/i, // "continue"
|
|
/\s+başla$/i, // "start"
|
|
/\s+izlemeye devam$/i, // "continue watching"
|
|
/\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam"
|
|
/\s+Sezon\s+\d+.*başla$/i, // "Sezon X başla"
|
|
];
|
|
```
|
|
|
|
#### `src/headless.js` - Browser Automation
|
|
```javascript
|
|
export async function fetchPageContentWithPlaywright(url, options)
|
|
```
|
|
|
|
**Responsibilities:**
|
|
- Playwright browser management
|
|
- Page navigation and content extraction
|
|
- Resource cleanup and error handling
|
|
|
|
**Browser Configuration:**
|
|
- Viewport: 1280x720 (standard desktop)
|
|
- Wait Strategy: `domcontentloaded` + `networkidle`
|
|
- Launch Mode: Headless (configurable)
|
|
|
|
#### `src/polyfill.js` - Compatibility Layer
|
|
```javascript
|
|
// File/Blob polyfill for Node.js undici compatibility
|
|
```
|
|
|
|
**Responsibilities:**
|
|
- File API polyfill for undici fetch
|
|
- Node.js 18+ compatibility
|
|
- Minimal footprint
|
|
|
|
## 📊 Data Flow Architecture
|
|
|
|
### Input Processing
|
|
```typescript
|
|
interface Input {
|
|
url: string; // Netflix URL
|
|
options?: {
|
|
headless?: boolean; // Enable/disable Playwright
|
|
timeoutMs?: number; // Request timeout
|
|
userAgent?: string; // Custom User-Agent
|
|
};
|
|
}
|
|
```
|
|
|
|
### Output Schema
|
|
```typescript
|
|
interface NetflixMetadata {
|
|
url: string; // Normalized URL
|
|
id: string; // Netflix title ID
|
|
name: string; // Clean title
|
|
year: string | number | undefined; // Release year
|
|
seasons: string | null; // Season info for series
|
|
}
|
|
```
|
|
|
|
### Internal Data Structures
|
|
|
|
#### JSON-LD Processing
|
|
```javascript
|
|
const YEAR_FIELDS = [
|
|
'datePublished', 'startDate', 'uploadDate',
|
|
'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
|
|
];
|
|
|
|
const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];
|
|
```
|
|
|
|
#### Error Handling
|
|
```javascript
|
|
class NetflixScrapingError extends Error {
|
|
constructor(message, originalError, context) {
|
|
super(message);
|
|
this.originalError = originalError;
|
|
this.context = context;
|
|
}
|
|
}
|
|
```
|
|
|
|
## 🔧 Technical Implementation Details
|
|
|
|
### Fetch Strategy Selection Algorithm
|
|
```javascript
|
|
function needsHeadless(meta) {
|
|
return !meta?.name || !meta?.year;
|
|
}
|
|
```
|
|
|
|
**Decision Logic:**
|
|
1. **Static First**: Always try static parsing (faster, lighter)
|
|
2. **Missing Data**: If title or year missing, trigger headless
|
|
3. **Configurable**: Can force headless or disable entirely
|
|
|
|
### Error Recovery Patterns
|
|
|
|
#### Network Errors
|
|
- Timeout handling with AbortController
|
|
- HTTP status code validation
|
|
- Retry logic for transient failures
|
|
|
|
#### Parsing Errors
|
|
- Graceful JSON-LD error handling
|
|
- Multiple title extraction strategies
|
|
- Fallback to basic meta tags
|
|
|
|
#### Browser Errors
|
|
- Playwright detection and graceful messaging
|
|
- Browser process cleanup on errors
|
|
- Memory leak prevention
|
|
|
|
## 🎯 Performance Optimizations
|
|
|
|
### Static Mode Optimizations
|
|
- **Single HTTP Request**: Minimal network overhead
|
|
- **String Parsing**: Fast regex-based title cleaning
|
|
- **Memory Efficient**: No browser overhead
|
|
- **Cache-Friendly**: Deterministic output
|
|
|
|
### Headless Mode Optimizations
|
|
- **Browser Pooling**: Reuse browser instances (future enhancement)
|
|
- **Selective Resources**: Block unnecessary requests
|
|
- **Early Termination**: Stop when required data found
|
|
- **Timeout Protection**: Prevent hanging operations
|
|
|
|
### Memory Management
|
|
```javascript
|
|
// Always cleanup browser resources
|
|
try {
|
|
return await page.content();
|
|
} finally {
|
|
await browser.close();
|
|
}
|
|
```
|
|
|
|
## 🔒 Security Architecture
|
|
|
|
### Input Validation
|
|
- URL format validation with regex patterns
|
|
- Netflix domain verification
|
|
- Path traversal prevention
|
|
|
|
### Request Security
|
|
- Configurable User-Agent strings
|
|
- Rate limiting considerations
|
|
- Request header standardization
|
|
|
|
### Data Sanitization
|
|
- HTML entity decoding
|
|
- XSS prevention in title extraction
|
|
- Structured data validation
|
|
|
|
## 🔮 Extensibility Points
|
|
|
|
### Future Enhancements
|
|
|
|
#### 1. Multiple Language Support
|
|
```javascript
|
|
// Architecture ready for additional languages
|
|
const LOCALIZATION_PATTERNS = {
|
|
tr: TURKISH_UI_PATTERNS,
|
|
es: SPANISH_UI_PATTERNS,
|
|
// ... future languages
|
|
};
|
|
```
|
|
|
|
#### 2. Caching Layer
|
|
```javascript
|
|
// Hook points for caching integration
|
|
const cacheMiddleware = {
|
|
get: (url) => cache.get(url),
|
|
set: (url, data) => cache.set(url, data, ttl)
|
|
};
|
|
```
|
|
|
|
#### 3. Browser Pool Management
|
|
```javascript
|
|
// Scalable browser resource management
|
|
class BrowserPool {
|
|
constructor(maxSize = 5) {
|
|
this.maxSize = maxSize;
|
|
this.pool = [];
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 4. Netflix API Integration
|
|
```javascript
|
|
// Potential Netflix API integration point
|
|
class NetflixAPIClient {
|
|
async getMetadata(titleId) {
|
|
// Direct API calls when available
|
|
}
|
|
}
|
|
```
|
|
|
|
## 📈 Monitoring & Observability
|
|
|
|
### Logging Strategy
|
|
- **Progress Logs**: ✅ Pass/fail indicators
|
|
- **Error Logs**: Detailed error context with Turkish messages
|
|
- **Performance Logs**: Timing information (future enhancement)
|
|
|
|
### Metrics Collection
|
|
- Success/failure rates per mode
|
|
- Response time distributions
|
|
- Error categorization
|
|
- Resource utilization
|
|
|
|
## 🧪 Testing Architecture
|
|
|
|
### Test Categories
|
|
1. **Unit Tests**: Individual function testing
|
|
2. **Integration Tests**: Full workflow testing
|
|
3. **Live Tests**: Real Netflix URL testing
|
|
4. **Performance Tests**: Benchmarking
|
|
|
|
### Test Data Management
|
|
```
|
|
tests/fixtures/
|
|
├── sample-title.html # Static test HTML
|
|
├── turkish-ui.json # Turkish UI patterns
|
|
└── test-urls.json # Test URL collection
|
|
```
|
|
|
|
---
|
|
|
|
*Architecture documentation last updated: 2025-11-23* |