first commit
This commit is contained in:
321
doc/ARCHITECTURE.md
Normal file
321
doc/ARCHITECTURE.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# MetaScraper Architecture Documentation
|
||||
|
||||
## 🏗️ System Architecture Overview
|
||||
|
||||
MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design.
|
||||
|
||||
### Core Design Principles
|
||||
|
||||
1. **Dual-Mode Operation**: Static HTML parsing with Playwright fallback
|
||||
2. **Graceful Degradation**: Continue operation even when optional dependencies fail
|
||||
3. **Localization-Aware**: Built-in support for Turkish Netflix interfaces
|
||||
4. **Error Resilience**: Comprehensive error handling with Turkish error messages
|
||||
5. **Modern JavaScript**: ES6+ modules with Node.js 18+ compatibility
|
||||
|
||||
## 🔄 System Flow
|
||||
|
||||
```
|
||||
Input URL → URL Normalization → Static HTML Fetch → HTML Parsing → Success?
|
||||
↓ ↓
|
||||
Error Headless Fallback
|
||||
↓ ↓
|
||||
Return ← HTML Parsing ← Browser Execution ← Playwright Launch
|
||||
```
|
||||
|
||||
### Detailed Flow Analysis
|
||||
|
||||
#### 1. URL Normalization (`src/index.js:21-48`)
|
||||
- Validates Netflix URL format
|
||||
- Extracts Netflix title ID from various URL patterns
|
||||
- Normalizes to standard format: `https://www.netflix.com/title/{id}`
|
||||
|
||||
**Supported URL Patterns:**
|
||||
- `https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=tr`
|
||||
- `https://www.netflix.com/title/80189685`
|
||||
- `https://www.netflix.com/tr/title/70195800?trackId=12345`
|
||||
|
||||
#### 2. Static HTML Fetch (`src/index.js:99-128`)
|
||||
- Uses native `fetch` API with undici polyfill support
|
||||
- Configurable timeout and User-Agent
|
||||
- Comprehensive error handling for network issues
|
||||
|
||||
#### 3. HTML Parsing (`src/parser.js:134-162`)
|
||||
- **Primary Strategy**: JSON-LD structured data extraction
|
||||
- **Fallback Strategy**: Meta tags and title element parsing
|
||||
- **Title Cleaning**: Removes Turkish UI text and Netflix suffixes
|
||||
|
||||
#### 4. Headless Fallback (`src/headless.js:9-41`)
|
||||
- Optional Playwright integration
|
||||
- Chromium browser automation
|
||||
- Network idle detection for complete page loads
|
||||
|
||||
## 🧩 Module Architecture
|
||||
|
||||
### Core Modules
|
||||
|
||||
#### `src/index.js` - Main Orchestrator
|
||||
```javascript
|
||||
export async function scraperNetflix(inputUrl, options = {})
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- URL validation and normalization
|
||||
- Fetch strategy selection (static vs headless)
|
||||
- Error orchestration and Turkish localization
|
||||
- Result aggregation and formatting
|
||||
|
||||
**Key Functions:**
|
||||
- `normalizeNetflixUrl(inputUrl)` - URL processing
|
||||
- `fetchStaticHtml(url, userAgent, timeoutMs)` - HTTP client
|
||||
- `ensureFetchGlobals()` - Polyfill management
|
||||
|
||||
#### `src/parser.js` - HTML Processing Engine
|
||||
```javascript
|
||||
export function parseNetflixHtml(html)
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- JSON-LD extraction and parsing
|
||||
- Title cleaning and localization
|
||||
- Year extraction from multiple fields
|
||||
- Season information detection
|
||||
|
||||
**Key Functions:**
|
||||
- `parseJsonLdObject(obj)` - Structured data processing
|
||||
- `cleanTitle(title)` - UI text removal
|
||||
- `extractYear(value)` - Multi-format year parsing
|
||||
|
||||
**Turkish Localization Patterns:**
|
||||
```javascript
|
||||
const TURKISH_UI_PATTERNS = [
|
||||
/\s+izlemenizi bekliyor$/i, // "waiting for you to watch"
|
||||
/\s+izleyin$/i, // "watch"
|
||||
/\s+devam et$/i, // "continue"
|
||||
/\s+başla$/i, // "start"
|
||||
/\s+izlemeye devam$/i, // "continue watching"
|
||||
/\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam"
|
||||
/\s+Sezon\s+\d+.*başla$/i, // "Sezon X başla"
|
||||
];
|
||||
```
|
||||
|
||||
#### `src/headless.js` - Browser Automation
|
||||
```javascript
|
||||
export async function fetchPageContentWithPlaywright(url, options)
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- Playwright browser management
|
||||
- Page navigation and content extraction
|
||||
- Resource cleanup and error handling
|
||||
|
||||
**Browser Configuration:**
|
||||
- Viewport: 1280x720 (standard desktop)
|
||||
- Wait Strategy: `domcontentloaded` + `networkidle`
|
||||
- Launch Mode: Headless (configurable)
|
||||
|
||||
#### `src/polyfill.js` - Compatibility Layer
|
||||
```javascript
|
||||
// File/Blob polyfill for Node.js undici compatibility
|
||||
```
|
||||
|
||||
**Responsibilities:**
|
||||
- File API polyfill for undici fetch
|
||||
- Node.js 18+ compatibility
|
||||
- Minimal footprint
|
||||
|
||||
## 📊 Data Flow Architecture
|
||||
|
||||
### Input Processing
|
||||
```typescript
|
||||
interface Input {
|
||||
url: string; // Netflix URL
|
||||
options?: {
|
||||
headless?: boolean; // Enable/disable Playwright
|
||||
timeoutMs?: number; // Request timeout
|
||||
userAgent?: string; // Custom User-Agent
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
### Output Schema
|
||||
```typescript
|
||||
interface NetflixMetadata {
|
||||
url: string; // Normalized URL
|
||||
id: string; // Netflix title ID
|
||||
name: string; // Clean title
|
||||
year: string | number | undefined; // Release year
|
||||
seasons: string | null; // Season info for series
|
||||
}
|
||||
```
|
||||
|
||||
### Internal Data Structures
|
||||
|
||||
#### JSON-LD Processing
|
||||
```javascript
|
||||
const YEAR_FIELDS = [
|
||||
'datePublished', 'startDate', 'uploadDate',
|
||||
'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
|
||||
];
|
||||
|
||||
const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];
|
||||
```
|
||||
|
||||
#### Error Handling
|
||||
```javascript
|
||||
class NetflixScrapingError extends Error {
|
||||
constructor(message, originalError, context) {
|
||||
super(message);
|
||||
this.originalError = originalError;
|
||||
this.context = context;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 🔧 Technical Implementation Details
|
||||
|
||||
### Fetch Strategy Selection Algorithm
|
||||
```javascript
|
||||
function needsHeadless(meta) {
|
||||
return !meta?.name || !meta?.year;
|
||||
}
|
||||
```
|
||||
|
||||
**Decision Logic:**
|
||||
1. **Static First**: Always try static parsing (faster, lighter)
|
||||
2. **Missing Data**: If title or year missing, trigger headless
|
||||
3. **Configurable**: Can force headless or disable entirely
|
||||
|
||||
### Error Recovery Patterns
|
||||
|
||||
#### Network Errors
|
||||
- Timeout handling with AbortController
|
||||
- HTTP status code validation
|
||||
- Retry logic for transient failures
|
||||
|
||||
#### Parsing Errors
|
||||
- Graceful JSON-LD error handling
|
||||
- Multiple title extraction strategies
|
||||
- Fallback to basic meta tags
|
||||
|
||||
#### Browser Errors
|
||||
- Playwright detection and graceful messaging
|
||||
- Browser process cleanup on errors
|
||||
- Memory leak prevention
|
||||
|
||||
## 🎯 Performance Optimizations
|
||||
|
||||
### Static Mode Optimizations
|
||||
- **Single HTTP Request**: Minimal network overhead
|
||||
- **String Parsing**: Fast regex-based title cleaning
|
||||
- **Memory Efficient**: No browser overhead
|
||||
- **Cache-Friendly**: Deterministic output
|
||||
|
||||
### Headless Mode Optimizations
|
||||
- **Browser Pooling**: Reuse browser instances (future enhancement)
|
||||
- **Selective Resources**: Block unnecessary requests
|
||||
- **Early Termination**: Stop when required data found
|
||||
- **Timeout Protection**: Prevent hanging operations
|
||||
|
||||
### Memory Management
|
||||
```javascript
|
||||
// Always cleanup browser resources
|
||||
try {
|
||||
return await page.content();
|
||||
} finally {
|
||||
await browser.close();
|
||||
}
|
||||
```
|
||||
|
||||
## 🔒 Security Architecture
|
||||
|
||||
### Input Validation
|
||||
- URL format validation with regex patterns
|
||||
- Netflix domain verification
|
||||
- Path traversal prevention
|
||||
|
||||
### Request Security
|
||||
- Configurable User-Agent strings
|
||||
- Rate limiting considerations
|
||||
- Request header standardization
|
||||
|
||||
### Data Sanitization
|
||||
- HTML entity decoding
|
||||
- XSS prevention in title extraction
|
||||
- Structured data validation
|
||||
|
||||
## 🔮 Extensibility Points
|
||||
|
||||
### Future Enhancements
|
||||
|
||||
#### 1. Multiple Language Support
|
||||
```javascript
|
||||
// Architecture ready for additional languages
|
||||
const LOCALIZATION_PATTERNS = {
|
||||
tr: TURKISH_UI_PATTERNS,
|
||||
es: SPANISH_UI_PATTERNS,
|
||||
// ... future languages
|
||||
};
|
||||
```
|
||||
|
||||
#### 2. Caching Layer
|
||||
```javascript
|
||||
// Hook points for caching integration
|
||||
const cacheMiddleware = {
|
||||
get: (url) => cache.get(url),
|
||||
set: (url, data) => cache.set(url, data, ttl)
|
||||
};
|
||||
```
|
||||
|
||||
#### 3. Browser Pool Management
|
||||
```javascript
|
||||
// Scalable browser resource management
|
||||
class BrowserPool {
|
||||
constructor(maxSize = 5) {
|
||||
this.maxSize = maxSize;
|
||||
this.pool = [];
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### 4. Netflix API Integration
|
||||
```javascript
|
||||
// Potential Netflix API integration point
|
||||
class NetflixAPIClient {
|
||||
async getMetadata(titleId) {
|
||||
// Direct API calls when available
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 📈 Monitoring & Observability
|
||||
|
||||
### Logging Strategy
|
||||
- **Progress Logs**: ✅ Pass/fail indicators
|
||||
- **Error Logs**: Detailed error context with Turkish messages
|
||||
- **Performance Logs**: Timing information (future enhancement)
|
||||
|
||||
### Metrics Collection
|
||||
- Success/failure rates per mode
|
||||
- Response time distributions
|
||||
- Error categorization
|
||||
- Resource utilization
|
||||
|
||||
## 🧪 Testing Architecture
|
||||
|
||||
### Test Categories
|
||||
1. **Unit Tests**: Individual function testing
|
||||
2. **Integration Tests**: Full workflow testing
|
||||
3. **Live Tests**: Real Netflix URL testing
|
||||
4. **Performance Tests**: Benchmarking
|
||||
|
||||
### Test Data Management
|
||||
```
|
||||
tests/fixtures/
|
||||
├── sample-title.html # Static test HTML
|
||||
├── turkish-ui.json # Turkish UI patterns
|
||||
└── test-urls.json # Test URL collection
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
*Architecture documentation last updated: 2025-11-23*
|
||||
Reference in New Issue
Block a user