first commit

This commit is contained in:
2025-11-23 14:25:09 +03:00
commit 46d75b64d5
18 changed files with 4749 additions and 0 deletions

321
doc/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,321 @@
# MetaScraper Architecture Documentation
## 🏗️ System Architecture Overview
MetaScraper is a Node.js library designed for extracting metadata from Netflix title pages. The architecture emphasizes reliability, performance, and maintainability through a modular design.
### Core Design Principles
1. **Dual-Mode Operation**: Static HTML parsing with Playwright fallback
2. **Graceful Degradation**: Continue operation even when optional dependencies fail
3. **Localization-Aware**: Built-in support for Turkish Netflix interfaces
4. **Error Resilience**: Comprehensive error handling with Turkish error messages
5. **Modern JavaScript**: ES6+ modules with Node.js 18+ compatibility
## 🔄 System Flow
```
Input URL → URL Normalization → Static HTML Fetch → HTML Parsing → Success?
↓ ↓
Error Headless Fallback
↓ ↓
Return ← HTML Parsing ← Browser Execution ← Playwright Launch
```
### Detailed Flow Analysis
#### 1. URL Normalization (`src/index.js:21-48`)
- Validates Netflix URL format
- Extracts Netflix title ID from various URL patterns
- Normalizes to standard format: `https://www.netflix.com/title/{id}`
**Supported URL Patterns:**
- `https://www.netflix.com/tr/title/82123114?s=i&trkid=264356104&vlang=tr`
- `https://www.netflix.com/title/80189685`
- `https://www.netflix.com/tr/title/70195800?trackId=12345`
#### 2. Static HTML Fetch (`src/index.js:99-128`)
- Uses native `fetch` API with undici polyfill support
- Configurable timeout and User-Agent
- Comprehensive error handling for network issues
#### 3. HTML Parsing (`src/parser.js:134-162`)
- **Primary Strategy**: JSON-LD structured data extraction
- **Fallback Strategy**: Meta tags and title element parsing
- **Title Cleaning**: Removes Turkish UI text and Netflix suffixes
#### 4. Headless Fallback (`src/headless.js:9-41`)
- Optional Playwright integration
- Chromium browser automation
- Network idle detection for complete page loads
## 🧩 Module Architecture
### Core Modules
#### `src/index.js` - Main Orchestrator
```javascript
export async function scraperNetflix(inputUrl, options = {})
```
**Responsibilities:**
- URL validation and normalization
- Fetch strategy selection (static vs headless)
- Error orchestration and Turkish localization
- Result aggregation and formatting
**Key Functions:**
- `normalizeNetflixUrl(inputUrl)` - URL processing
- `fetchStaticHtml(url, userAgent, timeoutMs)` - HTTP client
- `ensureFetchGlobals()` - Polyfill management
#### `src/parser.js` - HTML Processing Engine
```javascript
export function parseNetflixHtml(html)
```
**Responsibilities:**
- JSON-LD extraction and parsing
- Title cleaning and localization
- Year extraction from multiple fields
- Season information detection
**Key Functions:**
- `parseJsonLdObject(obj)` - Structured data processing
- `cleanTitle(title)` - UI text removal
- `extractYear(value)` - Multi-format year parsing
**Turkish Localization Patterns:**
```javascript
const TURKISH_UI_PATTERNS = [
/\s+izlemenizi bekliyor$/i, // "waiting for you to watch"
/\s+izleyin$/i, // "watch"
/\s+devam et$/i, // "continue"
/\s+başla$/i, // "start"
/\s+izlemeye devam$/i, // "continue watching"
/\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam"
/\s+Sezon\s+\d+.*başla$/i, // "Sezon X başla"
];
```
#### `src/headless.js` - Browser Automation
```javascript
export async function fetchPageContentWithPlaywright(url, options)
```
**Responsibilities:**
- Playwright browser management
- Page navigation and content extraction
- Resource cleanup and error handling
**Browser Configuration:**
- Viewport: 1280x720 (standard desktop)
- Wait Strategy: `domcontentloaded` + `networkidle`
- Launch Mode: Headless (configurable)
#### `src/polyfill.js` - Compatibility Layer
```javascript
// File/Blob polyfill for Node.js undici compatibility
```
**Responsibilities:**
- File API polyfill for undici fetch
- Node.js 18+ compatibility
- Minimal footprint
## 📊 Data Flow Architecture
### Input Processing
```typescript
interface Input {
url: string; // Netflix URL
options?: {
headless?: boolean; // Enable/disable Playwright
timeoutMs?: number; // Request timeout
userAgent?: string; // Custom User-Agent
};
}
```
### Output Schema
```typescript
interface NetflixMetadata {
url: string; // Normalized URL
id: string; // Netflix title ID
name: string; // Clean title
year: string | number | undefined; // Release year
seasons: string | null; // Season info for series
}
```
### Internal Data Structures
#### JSON-LD Processing
```javascript
const YEAR_FIELDS = [
'datePublished', 'startDate', 'uploadDate',
'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
];
const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];
```
#### Error Handling
```javascript
class NetflixScrapingError extends Error {
constructor(message, originalError, context) {
super(message);
this.originalError = originalError;
this.context = context;
}
}
```
## 🔧 Technical Implementation Details
### Fetch Strategy Selection Algorithm
```javascript
function needsHeadless(meta) {
return !meta?.name || !meta?.year;
}
```
**Decision Logic:**
1. **Static First**: Always try static parsing (faster, lighter)
2. **Missing Data**: If title or year missing, trigger headless
3. **Configurable**: Can force headless or disable entirely
### Error Recovery Patterns
#### Network Errors
- Timeout handling with AbortController
- HTTP status code validation
- Retry logic for transient failures
#### Parsing Errors
- Graceful JSON-LD error handling
- Multiple title extraction strategies
- Fallback to basic meta tags
#### Browser Errors
- Playwright detection and graceful messaging
- Browser process cleanup on errors
- Memory leak prevention
## 🎯 Performance Optimizations
### Static Mode Optimizations
- **Single HTTP Request**: Minimal network overhead
- **String Parsing**: Fast regex-based title cleaning
- **Memory Efficient**: No browser overhead
- **Cache-Friendly**: Deterministic output
### Headless Mode Optimizations
- **Browser Pooling**: Reuse browser instances (future enhancement)
- **Selective Resources**: Block unnecessary requests
- **Early Termination**: Stop when required data found
- **Timeout Protection**: Prevent hanging operations
### Memory Management
```javascript
// Always cleanup browser resources
try {
return await page.content();
} finally {
await browser.close();
}
```
## 🔒 Security Architecture
### Input Validation
- URL format validation with regex patterns
- Netflix domain verification
- Path traversal prevention
### Request Security
- Configurable User-Agent strings
- Rate limiting considerations
- Request header standardization
### Data Sanitization
- HTML entity decoding
- XSS prevention in title extraction
- Structured data validation
## 🔮 Extensibility Points
### Future Enhancements
#### 1. Multiple Language Support
```javascript
// Architecture ready for additional languages
const LOCALIZATION_PATTERNS = {
tr: TURKISH_UI_PATTERNS,
es: SPANISH_UI_PATTERNS,
// ... future languages
};
```
#### 2. Caching Layer
```javascript
// Hook points for caching integration
const cacheMiddleware = {
get: (url) => cache.get(url),
set: (url, data) => cache.set(url, data, ttl)
};
```
#### 3. Browser Pool Management
```javascript
// Scalable browser resource management
class BrowserPool {
constructor(maxSize = 5) {
this.maxSize = maxSize;
this.pool = [];
}
}
```
#### 4. Netflix API Integration
```javascript
// Potential Netflix API integration point
class NetflixAPIClient {
async getMetadata(titleId) {
// Direct API calls when available
}
}
```
## 📈 Monitoring & Observability
### Logging Strategy
- **Progress Logs**: ✅ Pass/fail indicators
- **Error Logs**: Detailed error context with Turkish messages
- **Performance Logs**: Timing information (future enhancement)
### Metrics Collection
- Success/failure rates per mode
- Response time distributions
- Error categorization
- Resource utilization
## 🧪 Testing Architecture
### Test Categories
1. **Unit Tests**: Individual function testing
2. **Integration Tests**: Full workflow testing
3. **Live Tests**: Real Netflix URL testing
4. **Performance Tests**: Benchmarking
### Test Data Management
```
tests/fixtures/
├── sample-title.html # Static test HTML
├── turkish-ui.json # Turkish UI patterns
└── test-urls.json # Test URL collection
```
---
*Architecture documentation last updated: 2025-11-23*