wisecolt/metascraper

Fork 0

Files

sbilketay fefa6627e9 amazon prime scrap özelliği eklendi

2025-11-23 16:09:39 +03:00

19 KiB

Raw Blame History

MetaScraper API Reference

🎯 Main API

`scraperNetflix(inputUrl, options?)`

Netflix metadata extraction function with automatic fallback and Turkish localization.

`scraperPrime(inputUrl, options?)`

Amazon Prime Video metadata extraction function with automatic fallback and Turkish localization.

Parameters

Parameter	Type	Required	Default	Description
`inputUrl`	`string`	✅	-	Netflix title URL (any format)
`options`	`object`	❌	`{}`	Configuration options

Options

Option	Type	Default	Description
`headless`	`boolean`	`true`	Enable Playwright fallback for missing data
`timeoutMs`	`number`	`15000`	Request timeout in milliseconds
`userAgent`	`string`	Chrome 118 User-Agent	Custom User-Agent string

Returns

Promise<{
  url: string;                    // Normalized Netflix URL
  id: string;                     // Netflix title ID
  name: string;                   // Clean title (Turkish UI removed)
  year: string \| number \| undefined; // Release year
  seasons: string \| null;         // Season info for TV series
  thumbnail: string \| null;       // Poster/thumbnail image URL
  info: string \| null;            // Content description/summary
  genre: string \| null;           // Genre (Turkish normalized)
}>

Examples

Basic Usage

import { scraperNetflix } from 'metascraper';

const result = await scraperNetflix('https://www.netflix.com/tr/title/82123114');
console.log(result);
// {
//   "url": "https://www.netflix.com/title/82123114",
//   "id": "82123114",
//   "name": "ONE SHOT with Ed Sheeran",
//   "year": "2025",
//   "seasons": null,
//   "thumbnail": "https://occ-0-7335-778.1.nflxso.net/dnm/api/v6/6AYY37jfdO6hpXcMjf9Yu5cnmO0/AAAABSkrIGPSyEfSWYQzc8rEFo6EtVV6Ls8WtPpNwR42MSKSNPNomZWV5P_l2MxGuJEkoPm71UT_eBK_SsTEH8pRslQr0sjpdhVHjxh4.jpg",
//   "info": "Ed Sheeran, matematiğin mucizevi gücünü ve müziğin birleştirici gücünü sergileyen benzersiz bir performansla sahneye çıkıyor.",
//   "genre": "Belgesel"
// }

Advanced Configuration

import { scraperNetflix } from 'metascraper';

const result = await scraperNetflix(
  'https://www.netflix.com/title/80189685',
  {
    headless: false,        // Disable browser fallback
    timeoutMs: 30000,       // 30 second timeout
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  }
);

Error Handling

import { scraperNetflix } from 'metascraper';

try {
  const result = await scraperNetflix('https://www.netflix.com/title/80189685');
  console.log('Success:', result);
} catch (error) {
  console.error('Scraping failed:', error.message);
  // Turkish error messages for Turkish users
  // "Netflix scraping başarısız: Netflix URL'i gereklidir."
}

`scraperPrime(inputUrl, options?)`

Amazon Prime Video metadata extraction function with automatic fallback and Turkish localization.

Parameters

Parameter	Type	Required	Default	Description
`inputUrl`	`string`	✅	-	Amazon Prime Video URL (any format)
`options`	`object`	❌	`{}`	Configuration options

Options

Option	Type	Default	Description
`headless`	`boolean`	`true`	Enable Playwright fallback for missing data
`timeoutMs`	`number`	`15000`	Request timeout in milliseconds
`userAgent`	`string`	Chrome 118 User-Agent	Custom User-Agent string

Returns

Promise<{
  url: string;                    // Normalized Prime Video URL
  id: string;                     // Prime Video content ID
  name: string;                   // Clean title (Amazon UI removed)
  year: string | number | undefined; // Release year
  seasons: string | null;         // Season info for TV series (null for movies)
  thumbnail: string | null;       // Poster/thumbnail image URL
  info: string | null;            // Content description/summary
  genre: string | null;           // Genre (Turkish normalized)
}>

Examples

Basic Usage

import { scraperPrime } from 'metascraper';

const result = await scraperPrime('https://www.primevideo.com/-/tr/detail/0NHIN3TGAI9L7VZ45RS52RHUPL/ref=share_ios_movie');
console.log(result);
// {
//   "url": "https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL",
//   "id": "0NHIN3TGAI9L7VZ45RS52RHUPL",
//   "name": "Little Women",
//   "year": "2020",
//   "seasons": null,
//   "thumbnail": "https://m.media-amazon.com/images/S/pv-target-images/c1b08ebea5ba29c47145c623e7d1c586290221ec12fa93850029e581f54049c4.jpg",
//   "info": "In the years after the Civil War, Jo March lives in New York and makes her living as a writer...",
//   "genre": "Dram"
// }

Advanced Configuration

import { scraperPrime } from 'metascraper';

const result = await scraperPrime(
  'https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL',
  {
    headless: false,        // Disable browser fallback
    timeoutMs: 30000,       // 30 second timeout
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  }
);

Error Handling

import { scraperPrime } from 'metascraper';

try {
  const result = await scraperPrime('https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL');
  console.log('Success:', result);
} catch (error) {
  console.error('Scraping failed:', error.message);
  // Turkish error messages for Turkish users
  // "Amazon Prime scraping başarısız: Amazon Prime URL'i gereklidir."
}

🧩 Internal APIs

`parseNetflixHtml(html)` - Parser API

Parse Netflix HTML content to extract metadata without network requests.

Parameters

Parameter	Type	Required	Description
`html`	`string`	✅	Raw HTML content from Netflix page

Returns

{
  name?: string;                    // Clean title
  year?: string \| number;          // Release year
  seasons?: string \| null;         // Season information
  thumbnail?: string \| null;       // Thumbnail image URL
  info?: string \| null;            // Content description
  genre?: string \| null;           // Genre information
}

Examples

import { parseNetflixHtml } from 'metascraper/parser';

// With cached HTML
const fs = await import('node:fs');
const html = fs.readFileSync('netflix-page.html', 'utf8');
const metadata = parseNetflixHtml(html);

console.log(metadata);
// {
//   "name": "The Witcher",
//   "year": "2025",
//   "seasons": "4 Sezon",
//   "thumbnail": "https://occ-0-7335-778.1.nflxso.net/dnm/api/v6/6AYY37jfdO6hpXcMjf9Yu5cnmO0/AAAABSkrIGPSyEfSWYQzc8rEFo6EtVV6Ls8WtPpNwR42MSKSNPNomZWV5P_l2MxGuJEkoPm71UT_eBK_SsTEH8pRslQr0sjpdhVHjxh4.jpg",
//   "info": "Mutasyona uğramış bir canavar avcısı olan Rivyalı Geralt, insanların çoğunlukla yaratıklardan daha uğursuz olduğu, karmaşa içindeki bir dünyada kaderine doğru yol alıyor.",
//   "genre": "Aksiyon"
// }

`fetchPageContentWithPlaywright(url, options)` - Headless API

Fetch Netflix page content using Playwright browser automation.

Parameters

Parameter	Type	Required	Description
`url`	`string`	✅	Complete URL to fetch
`options`	`object`	✅	Browser configuration

Options

Option	Type	Default	Description
`timeoutMs`	`number`	`15000`	Page load timeout
`userAgent`	`string`	Chrome 118	Browser User-Agent
`headless`	`boolean`	`true`	Run browser in headless mode

Returns

Promise<string>  // HTML content of the page

Examples

import { fetchPageContentWithPlaywright } from 'metascraper/headless';

try {
  const html = await fetchPageContentWithPlaywright(
    'https://www.netflix.com/title/80189685',
    {
      timeoutMs: 30000,
      headless: false  // Show browser (useful for debugging)
    }
  );

  // Process the HTML with parser
  const metadata = parseNetflixHtml(html);
  console.log(metadata);
} catch (error) {
  console.error('Browser automation failed:', error.message);
}

`parsePrimeHtml(html)` - Prime Video Parser API

Parse Amazon Prime Video HTML content to extract metadata without network requests.

Parameters

Parameter	Type	Required	Description
`html`	`string`	✅	Raw HTML content from Prime Video page

Returns

{
  name?: string;                    // Clean title
  year?: string | number;          // Release year
  seasons?: string | null;         // Season information
  thumbnail?: string | null;       // Thumbnail image URL
  info?: string | null;            // Content description
  genre?: string | null;           // Genre information
}

Examples

import { parsePrimeHtml } from 'metascraper/parser';

// With cached HTML
const fs = await import('node:fs');
const html = fs.readFileSync('prime-page.html', 'utf8');
const metadata = parsePrimeHtml(html);

console.log(metadata);
// {
//   "name": "Little Women",
//   "year": "2020",
//   "seasons": null,
//   "thumbnail": "https://m.media-amazon.com/images/S/pv-target-images/...",
//   "info": "In the years after the Civil War, Jo March lives in New York...",
//   "genre": "Dram"
// }

🔧 URL Processing

Supported URL Formats

The scraperNetflix function automatically normalizes various Netflix URL formats:

Input Format	Normalized Output	Notes
`https://www.netflix.com/title/80189685`	`https://www.netflix.com/title/80189685`	Standard format
`https://www.netflix.com/tr/title/80189685`	`https://www.netflix.com/title/80189685`	Turkish locale
`https://www.netflix.com/tr/title/80189685?s=i&trkid=264356104&vlang=tr`	`https://www.netflix.com/title/80189685`	With parameters
`https://www.netflix.com/title/80189685?trackId=12345`	`https://www.netflix.com/title/80189685`	With tracking

URL Validation

The function validates URLs with these rules:

Format: Must be a valid URL
Domain: Must contain netflix.com
Path: Must contain title/ followed by numeric ID
ID Extraction: Uses regex to extract title ID

// These will work:
'https://www.netflix.com/title/80189685'
'https://www.netflix.com/tr/title/80189685?s=i&vlang=tr'

// These will fail:
'https://google.com'                    // Wrong domain
'https://www.netflix.com/browse'        // No title ID
'not-a-url'                            // Invalid format
'https://www.netflix.com/title/abc'    // Non-numeric ID

Amazon Prime Video URL Formats

The scraperPrime function automatically normalizes various Prime Video URL formats:

Input Format	Normalized Output	Notes
`https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL`	`https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL`	Standard format
`https://www.primevideo.com/-/tr/detail/0NHIN3TGAI9L7VZ45RS52RHUPL/ref=share_ios_movie`	`https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL`	Turkish locale with tracking
`https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL?ref_=atv_dp`	`https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL`	With parameters

Prime Video URL Validation

The function validates URLs with these rules:

Format: Must be a valid URL
Domain: Must contain primevideo.com
Path: Must contain detail/ followed by content ID
ID Extraction: Uses path parsing to extract content ID

// These will work:
'https://www.primevideo.com/detail/0NHIN3TGAI9L7VZ45RS52RHUPL'
'https://www.primevideo.com/-/tr/detail/0NHIN3TGAI9L7VZ45RS52RHUPL/ref=share_ios_movie'

// These will fail:
'https://google.com'                    // Wrong domain
'https://www.primevideo.com/browse'    // No content ID
'not-a-url'                            // Invalid format

🌍 Localization Features

Turkish UI Text Removal

The parser automatically removes Turkish Netflix UI text from titles:

Original Title	Cleaned Title	Removed Pattern
"The Witcher izlemenizi bekliyor"	"The Witcher	`izlemenizi bekliyor`
"Stranger Things izleyin"	"Stranger Things"	`izleyin`
"Sezon 4 devam et"	"Sezon 4"	`devam et`
"Dark başla"	"Dark"	`başla`
"The Crown izlemeye devam"	"The Crown"	`izlemeye devam`

Supported Turkish Patterns

const TURKISH_UI_PATTERNS = [
  /\s+izlemenizi bekliyor$/i,           // "waiting for you to watch"
  /\s+izleyin$/i,                      // "watch"
  /\s+devam et$/i,                     // "continue"
  /\s+başla$/i,                        // "start"
  /\s+izlemeye devam$/i,               // "continue watching"
  /\s+Sezon\s+\d+.*izlemeye devam$/i,  // "Sezon X izlemeye devam"
  /\s+Sezon\s+\d+.*başla$/i,           // "Sezon X başla"
];

English UI Pattern Removal

Also removes universal English UI text:

Original Title	Cleaned Title	Removed Pattern
"Watch Now The Witcher"	"The Witcher"	`Watch Now`
"The Witcher Continue Watching"	"The Witcher"	`Continue Watching`
"Season 4 Play"	"Season 4"	`Season X Play`

📊 Data Extraction Patterns

JSON-LD Processing

The parser extracts metadata from JSON-LD structured data:

// Looks for these JSON-LD fields:
const YEAR_FIELDS = [
  'datePublished', 'startDate', 'uploadDate',
  'copyrightYear', 'releasedEvent', 'releaseYear', 'dateCreated'
];

const SEASON_TYPES = ['TVSeries', 'TVShow', 'Series'];

Meta Tag Fallbacks

If JSON-LD is unavailable, falls back to HTML meta tags:

<meta property="og:title" content="The Witcher izlemenizi bekliyor | Netflix">
<meta name="title" content="The Witcher | Netflix">
<title>The Witcher izlemenizi bekliyor | Netflix</title>

Thumbnail Image Extraction

The parser automatically extracts poster/thumbnail images from Netflix meta tags:

// Thumbnail selectors in priority order:
const THUMBNAIL_SELECTORS = [
  'meta[property="og:image"]',           // Open Graph image (primary)
  'meta[name="twitter:image"]',        // Twitter card image
  'meta[property="og:image:secure_url"]', // Secure image URL
  'link[rel="image_src"]',             // Image source link
  'meta[itemprop="image"]'              // Schema.org image
];

Example Netflix HTML:

<meta property="og:image" content="https://occ-0-7335-778.1.nflxso.net/dnm/api/v6/6AYY37jfdO6hpXcMjf9Yu5cnmO0/AAAABSkrIGPSyEfSWYQzc8rEFo6EtVV6Ls8WtPpNwR42MSKSNPNomZWV5P_l2MxGuJEkoPm71UT_eBK_SsTEH8pRslQr0sjpdhVHjxh4.jpg">

URL Validation:

Only Netflix CDN domains are accepted (nflxso.net, nflximg.net, etc.)
Image file extensions are verified (.jpg, .jpeg, .png, .webp)
Query parameters are cleaned for stability

Fallback Strategy:

Try Open Graph image first (most reliable)
Fall back to Twitter card image
Try other meta tags if needed
Return null if no valid thumbnail found

Season Detection

For TV series, extracts season information:

// Example JSON-LD for TV series:
{
  "@type": "TVSeries",
  "name": "The Witcher",
  "numberOfSeasons": 4,
  "datePublished": "2025"
}

// Result: "4 Sezon"

⚡ Performance Characteristics

Response Times by Mode

Mode	Typical Response	Success Rate	Resource Usage
Static Only	200-500ms	~85%	Very Low
Static + Headless Fallback	2-5s	~95%	Medium
Headless Only	2-3s	~90%	High

Resource Requirements

Static Mode:

CPU: Low (< 5%)
Memory: < 20MB
Network: 1 HTTP request

Headless Mode:

CPU: Medium (10-20%)
Memory: 100-200MB
Network: Multiple requests
Browser: Chromium instance

🚨 Error Types & Handling

Common Error Scenarios

1. Invalid URL

await scraperNetflix('invalid-url');
// Throws: "Geçersiz URL sağlandı."

2. Non-Netflix URL

await scraperNetflix('https://google.com');
// Throws: "URL netflix.com adresini göstermelidir."

3. Missing Title ID

await scraperNetflix('https://www.netflix.com/browse');
// Throws: "URL'de Netflix başlık ID'si bulunamadı."

4. Network Timeout

await scraperNetflix('https://www.netflix.com/title/80189685', { timeoutMs: 1 });
// Throws: "Request timed out while reaching Netflix."

5. 404 Not Found

await scraperNetflix('https://www.netflix.com/title/99999999');
// Throws: "Netflix title not found (404)."

6. Playwright Not Available

// When headless mode needed but Playwright not installed
// Throws: "Playwright is not installed. Install the optional dependency..."

7. Parsing Failed

// When HTML cannot be parsed for metadata
// Throws: "Netflix sayfa meta verisi parse edilemedi."

Error Object Structure

{
  name: "Error",
  message: "Netflix scraping başarısız: Geçersiz URL sağlandı.",
  stack: "Error: Netflix scraping başarısız: Geçersiz URL sağlandı.\n    at scraperNetflix...",
  // Additional context for debugging
}

🔧 Advanced Usage Patterns

Batch Processing

import { scraperNetflix } from 'metascraper';

const urls = [
  'https://www.netflix.com/title/80189685',
  'https://www.netflix.com/title/82123114',
  'https://www.netflix.com/title/70177057'
];

const results = await Promise.allSettled(
  urls.map(url => scraperNetflix(url))
);

results.forEach((result, index) => {
  if (result.status === 'fulfilled') {
    console.log(`✅ ${urls[index]}:`, result.value.name);
  } else {
    console.log(`❌ ${urls[index]}:`, result.reason.message);
  }
});

Custom User-Agent Rotation

const userAgents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
];

const getRandomUA = () => userAgents[Math.floor(Math.random() * userAgents.length)];

const result = await scraperNetflix(url, {
  userAgent: getRandomUA()
});

Retry Logic Implementation

async function scrapeWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await scraperNetflix(url);
    } catch (error) {
      if (attempt === maxRetries) throw error;

      console.log(`Attempt ${attempt} failed, retrying in ${attempt * 1000}ms...`);
      await new Promise(resolve => setTimeout(resolve, attempt * 1000));
    }
  }
}

Caching Integration

const cache = new Map();

async function scrapeWithCache(url) {
  const cacheKey = `netflix:${url}`;

  if (cache.has(cacheKey)) {
    console.log('Cache hit for:', url);
    return cache.get(cacheKey);
  }

  const result = await scraperNetflix(url);
  cache.set(cacheKey, result);

  // Optional: Cache expiration
  setTimeout(() => cache.delete(cacheKey), 30 * 60 * 1000); // 30 minutes

  return result;
}

API documentation last updated: 2025-11-23

19 KiB Raw Blame History Unescape Escape

MetaScraper API Reference

🎯 Main API

scraperNetflix(inputUrl, options?)

scraperPrime(inputUrl, options?)

Parameters

Options

Returns

Examples

scraperPrime(inputUrl, options?)

Parameters

Options

Returns

Examples

🧩 Internal APIs

parseNetflixHtml(html) - Parser API

Parameters

Returns

Examples

fetchPageContentWithPlaywright(url, options) - Headless API

Parameters

Options

Returns

Examples

parsePrimeHtml(html) - Prime Video Parser API

Parameters

Returns

Examples

🔧 URL Processing

Supported URL Formats

URL Validation

Amazon Prime Video URL Formats

Prime Video URL Validation

🌍 Localization Features

Turkish UI Text Removal

Supported Turkish Patterns

English UI Pattern Removal

📊 Data Extraction Patterns

JSON-LD Processing

Meta Tag Fallbacks

Thumbnail Image Extraction

Season Detection

⚡ Performance Characteristics

Response Times by Mode

Resource Requirements

🚨 Error Types & Handling

Common Error Scenarios

1. Invalid URL

2. Non-Netflix URL

3. Missing Title ID

4. Network Timeout

5. 404 Not Found

6. Playwright Not Available

7. Parsing Failed

Error Object Structure

🔧 Advanced Usage Patterns

Batch Processing

Custom User-Agent Rotation

Retry Logic Implementation

Caching Integration

19 KiB

Raw Blame History

`scraperNetflix(inputUrl, options?)`

`scraperPrime(inputUrl, options?)`

`scraperPrime(inputUrl, options?)`

`parseNetflixHtml(html)` - Parser API

`fetchPageContentWithPlaywright(url, options)` - Headless API

`parsePrimeHtml(html)` - Prime Video Parser API