Files
metascraper/doc/API.md
2025-11-23 14:25:09 +03:00

446 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MetaScraper API Reference
## 🎯 Main API
### `scraperNetflix(inputUrl, options?)`
Netflix metadata extraction function with automatic fallback and Turkish localization.
#### Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `inputUrl` | `string` | ✅ | - | Netflix title URL (any format) |
| `options` | `object` | ❌ | `{}` | Configuration options |
#### Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `headless` | `boolean` | `true` | Enable Playwright fallback for missing data |
| `timeoutMs` | `number` | `15000` | Request timeout in milliseconds |
| `userAgent` | `string` | Chrome 118 User-Agent | Custom User-Agent string |
#### Returns
```typescript
Promise<{
url: string; // Normalized Netflix URL
id: string; // Netflix title ID
name: string; // Clean title (Turkish UI removed)
year: string \| number \| undefined; // Release year
seasons: string \| null; // Season info for TV series
}>
```
#### Examples
**Basic Usage**
```javascript
import { scraperNetflix } from 'metascraper';
const result = await scraperNetflix('https://www.netflix.com/tr/title/82123114');
console.log(result);
// {
// "url": "https://www.netflix.com/title/82123114",
// "id": "82123114",
// "name": "ONE SHOT with Ed Sheeran",
// "year": "2025",
// "seasons": null
// }
```
**Advanced Configuration**
```javascript
import { scraperNetflix } from 'metascraper';
const result = await scraperNetflix(
'https://www.netflix.com/title/80189685',
{
headless: false, // Disable browser fallback
timeoutMs: 30000, // 30 second timeout
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
);
```
**Error Handling**
```javascript
import { scraperNetflix } from 'metascraper';
try {
const result = await scraperNetflix('https://www.netflix.com/title/80189685');
console.log('Success:', result);
} catch (error) {
console.error('Scraping failed:', error.message);
// Turkish error messages for Turkish users
// "Netflix scraping başarısız: Netflix URL'i gereklidir."
}
```
## 🧩 Internal APIs
### `parseNetflixHtml(html)` - Parser API
Parse Netflix HTML content to extract metadata without network requests.
#### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `html` | `string` | ✅ | Raw HTML content from Netflix page |
#### Returns
```typescript
{
name?: string; // Clean title
year?: string \| number; // Release year
seasons?: string \| null; // Season information
}
```
#### Examples
```javascript
import { parseNetflixHtml } from 'metascraper/parser';
// With cached HTML
const fs = await import('node:fs');
const html = fs.readFileSync('netflix-page.html', 'utf8');
const metadata = parseNetflixHtml(html);
console.log(metadata);
// {
// "name": "The Witcher",
// "year": "2025",
// "seasons": "4 Sezon"
// }
```
### `fetchPageContentWithPlaywright(url, options)` - Headless API
Fetch Netflix page content using Playwright browser automation.
#### Parameters
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `url` | `string` | ✅ | Complete URL to fetch |
| `options` | `object` | ✅ | Browser configuration |
#### Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `timeoutMs` | `number` | `15000` | Page load timeout |
| `userAgent` | `string` | Chrome 118 | Browser User-Agent |
| `headless` | `boolean` | `true` | Run browser in headless mode |
#### Returns
```typescript
Promise<string> // HTML content of the page
```
#### Examples
```javascript
import { fetchPageContentWithPlaywright } from 'metascraper/headless';
try {
const html = await fetchPageContentWithPlaywright(
'https://www.netflix.com/title/80189685',
{
timeoutMs: 30000,
headless: false // Show browser (useful for debugging)
}
);
// Process the HTML with parser
const metadata = parseNetflixHtml(html);
console.log(metadata);
} catch (error) {
console.error('Browser automation failed:', error.message);
}
```
## 🔧 URL Processing
### Supported URL Formats
The `scraperNetflix` function automatically normalizes various Netflix URL formats:
| Input Format | Normalized Output | Notes |
|--------------|-------------------|-------|
| `https://www.netflix.com/title/80189685` | `https://www.netflix.com/title/80189685` | Standard format |
| `https://www.netflix.com/tr/title/80189685` | `https://www.netflix.com/title/80189685` | Turkish locale |
| `https://www.netflix.com/tr/title/80189685?s=i&trkid=264356104&vlang=tr` | `https://www.netflix.com/title/80189685` | With parameters |
| `https://www.netflix.com/title/80189685?trackId=12345` | `https://www.netflix.com/title/80189685` | With tracking |
### URL Validation
The function validates URLs with these rules:
1. **Format**: Must be a valid URL
2. **Domain**: Must contain `netflix.com`
3. **Path**: Must contain `title/` followed by numeric ID
4. **ID Extraction**: Uses regex to extract title ID
```javascript
// These will work:
'https://www.netflix.com/title/80189685'
'https://www.netflix.com/tr/title/80189685?s=i&vlang=tr'
// These will fail:
'https://google.com' // Wrong domain
'https://www.netflix.com/browse' // No title ID
'not-a-url' // Invalid format
'https://www.netflix.com/title/abc' // Non-numeric ID
```
## 🌍 Localization Features
### Turkish UI Text Removal
The parser automatically removes Turkish Netflix UI text from titles:
| Original Title | Cleaned Title | Removed Pattern |
|----------------|---------------|-----------------|
| "The Witcher izlemenizi bekliyor" | "The Witcher | `izlemenizi bekliyor` |
| "Stranger Things izleyin" | "Stranger Things" | `izleyin` |
| "Sezon 4 devam et" | "Sezon 4" | `devam et` |
| "Dark başla" | "Dark" | `başla` |
| "The Crown izlemeye devam" | "The Crown" | `izlemeye devam` |
### Supported Turkish Patterns
```javascript
const TURKISH_UI_PATTERNS = [
/\s+izlemenizi bekliyor$/i, // "waiting for you to watch"
/\s+izleyin$/i, // "watch"
/\s+devam et$/i, // "continue"
/\s+başla$/i, // "start"
/\s+izlemeye devam$/i, // "continue watching"
/\s+Sezon\s+\d+.*izlemeye devam$/i, // "Sezon X izlemeye devam"
/\s+Sezon\s+\d+.*başla$/i, // "Sezon X başla"
];
```
### English UI Pattern Removal
Also removes universal English UI text:
| Original Title | Cleaned Title | Removed Pattern |
|----------------|---------------|-----------------|
| "Watch Now The Witcher" | "The Witcher" | `Watch Now` |
| "The Witcher Continue Watching" | "The Witcher" | `Continue Watching` |
| "Season 4 Play" | "Season 4" | `Season X Play` |
## 📊 Data Extraction Patterns
### JSON-LD Processing
The parser extracts metadata from JSON-LD structured data:
```javascript
// Looks for these JSON-LD fields:
const YEAR_FIELDS = [
'datePublished', 'startDate', 'uploadDate',
'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
];
const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];
```
### Meta Tag Fallbacks
If JSON-LD is unavailable, falls back to HTML meta tags:
```html
<meta property="og:title" content="The Witcher izlemenizi bekliyor | Netflix">
<meta name="title" content="The Witcher | Netflix">
<title>The Witcher izlemenizi bekliyor | Netflix</title>
```
### Season Detection
For TV series, extracts season information:
```javascript
// Example JSON-LD for TV series:
{
"@type": "TVSeries",
"name": "The Witcher",
"numberOfSeasons": 4,
"datePublished": "2025"
}
// Result: "4 Sezon"
```
## ⚡ Performance Characteristics
### Response Times by Mode
| Mode | Typical Response | Success Rate | Resource Usage |
|------|------------------|--------------|----------------|
| Static Only | 200-500ms | ~85% | Very Low |
| Static + Headless Fallback | 2-5s | ~95% | Medium |
| Headless Only | 2-3s | ~90% | High |
### Resource Requirements
**Static Mode:**
- CPU: Low (< 5%)
- Memory: < 20MB
- Network: 1 HTTP request
**Headless Mode:**
- CPU: Medium (10-20%)
- Memory: 100-200MB
- Network: Multiple requests
- Browser: Chromium instance
## 🚨 Error Types & Handling
### Common Error Scenarios
#### 1. Invalid URL
```javascript
await scraperNetflix('invalid-url');
// Throws: "Geçersiz URL sağlandı."
```
#### 2. Non-Netflix URL
```javascript
await scraperNetflix('https://google.com');
// Throws: "URL netflix.com adresini göstermelidir."
```
#### 3. Missing Title ID
```javascript
await scraperNetflix('https://www.netflix.com/browse');
// Throws: "URL'de Netflix başlık ID'si bulunamadı."
```
#### 4. Network Timeout
```javascript
await scraperNetflix('https://www.netflix.com/title/80189685', { timeoutMs: 1 });
// Throws: "Request timed out while reaching Netflix."
```
#### 5. 404 Not Found
```javascript
await scraperNetflix('https://www.netflix.com/title/99999999');
// Throws: "Netflix title not found (404)."
```
#### 6. Playwright Not Available
```javascript
// When headless mode needed but Playwright not installed
// Throws: "Playwright is not installed. Install the optional dependency..."
```
#### 7. Parsing Failed
```javascript
// When HTML cannot be parsed for metadata
// Throws: "Netflix sayfa meta verisi parse edilemedi."
```
### Error Object Structure
```javascript
{
name: "Error",
message: "Netflix scraping başarısız: Geçersiz URL sağlandı.",
stack: "Error: Netflix scraping başarısız: Geçersiz URL sağlandı.\n at scraperNetflix...",
// Additional context for debugging
}
```
## 🔧 Advanced Usage Patterns
### Batch Processing
```javascript
import { scraperNetflix } from 'metascraper';
const urls = [
'https://www.netflix.com/title/80189685',
'https://www.netflix.com/title/82123114',
'https://www.netflix.com/title/70177057'
];
const results = await Promise.allSettled(
urls.map(url => scraperNetflix(url))
);
results.forEach((result, index) => {
if (result.status === 'fulfilled') {
console.log(`${urls[index]}:`, result.value.name);
} else {
console.log(`${urls[index]}:`, result.reason.message);
}
});
```
### Custom User-Agent Rotation
```javascript
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
];
const getRandomUA = () => userAgents[Math.floor(Math.random() * userAgents.length)];
const result = await scraperNetflix(url, {
userAgent: getRandomUA()
});
```
### Retry Logic Implementation
```javascript
async function scrapeWithRetry(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await scraperNetflix(url);
} catch (error) {
if (attempt === maxRetries) throw error;
console.log(`Attempt ${attempt} failed, retrying in ${attempt * 1000}ms...`);
await new Promise(resolve => setTimeout(resolve, attempt * 1000));
}
}
}
```
### Caching Integration
```javascript
const cache = new Map();
async function scrapeWithCache(url) {
const cacheKey = `netflix:${url}`;
if (cache.has(cacheKey)) {
console.log('Cache hit for:', url);
return cache.get(cacheKey);
}
const result = await scraperNetflix(url);
cache.set(cacheKey, result);
// Optional: Cache expiration
setTimeout(() => cache.delete(cacheKey), 30 * 60 * 1000); // 30 minutes
return result;
}
```
---
*API documentation last updated: 2025-11-23*