wisecolt/metascraper

Fork 0

Files

sbilketay 46d75b64d5 first commit

2025-11-23 14:25:09 +03:00

12 KiB

Raw Blame History

MetaScraper Frequently Asked Questions (FAQ)

🚀 Getting Started

Q: How do I install MetaScraper?

npm install metascraper

Q: What are the system requirements?

Node.js: 18+ (recommended 20+) Memory: Minimum 50MB for static mode, 200MB+ for headless mode Network: Internet connection to Netflix

# Check your Node.js version
node --version  # Should be 18.x or higher

Q: Does MetaScraper work with TypeScript?

Yes! MetaScraper provides TypeScript support out of the box:

import { scraperNetflix } from 'metascraper';

interface NetflixMetadata {
  url: string;
  id: string;
  name: string;
  year: string | number | undefined;
  seasons: string | null;
}

const result: Promise<NetflixMetadata> = scraperNetflix('https://www.netflix.com/title/80189685');

🔧 Technical Questions

Q: What's the difference between static and headless mode?

Static Mode (default):

✅ Faster (200-500ms)
✅ Lower memory usage
✅ No browser required
⚠️ 85% success rate

Headless Mode (fallback):

✅ Higher success rate (99%)
✅ Handles JavaScript-rendered content
❌ Slower (2-5 seconds)
❌ Requires Playwright

// Force static mode only
await scraperNetflix(url, { headless: false });

// Enable headless fallback
await scraperNetflix(url, { headless: true });

Q: Do I need to install Playwright?

No, Playwright is optional. MetaScraper works without it using static HTML parsing.

Install Playwright only if:

You need higher success rates
Static mode fails for specific titles
You want JavaScript-rendered content

# Optional: Install for better success rates
npm install playwright
npx playwright install chromium

Q: Can MetaScraper work in the browser?

Not currently. MetaScraper is designed for Node.js environments due to:

CORS restrictions in browsers
Netflix's bot protection
Node.js-specific APIs (fetch, cheerio)

For browser usage, consider:

Creating a proxy API server
Using serverless functions
Implementing browser-based scraping separately

Q: How does MetaScraper handle Netflix's bot protection?

MetaScraper uses several techniques:

Realistic User-Agent strings that mimic regular browsers
Proper HTTP headers including Accept-Language
Rate limiting considerations to avoid detection
JavaScript rendering (when needed) to appear more human

const result = await scraperNetflix(url, {
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
});

🌍 Localization & Turkish Support

Q: What Turkish UI patterns does MetaScraper remove?

MetaScraper removes these Turkish Netflix UI patterns:

Pattern	English Equivalent	Example
`izlemenizi bekliyor`	"waiting for you to watch"	"The Witcher izlemenizi bekliyor"
`izleyin`	"watch"	"Dark izleyin"
`devam et`	"continue"	"Money Heist devam et"
`başla`	"start"	"Stranger Things başla"
`izlemeye devam`	"continue watching"	"The Crown izlemeye devam"

Q: Does MetaScraper support other languages?

Currently optimized for Turkish Netflix interfaces, but also removes universal English patterns:

✅ Turkish: Full support with specific patterns
✅ English: Basic UI text removal
🔄 Other languages: Can be extended (file an issue)

Q: What about regional Netflix content?

MetaScraper works globally but:

Content availability varies by region
Some titles may be region-locked
URL formats work universally

// Test different regional URLs
const regionalUrls = [
  'https://www.netflix.com/title/80189685',     // Global
  'https://www.netflix.com/tr/title/80189685',   // Turkey
  'https://www.netflix.com/us/title/80189685'    // US
];

⚡ Performance & Usage

Q: How fast is MetaScraper?

Response Times:

Static mode: 200-500ms
Headless fallback: 2-5 seconds
Batch processing: 10-50 URLs per second (static mode)

Resource Usage:

Memory: <50MB (static), 100-200MB (headless)
CPU: Low impact for normal usage
Network: 1 HTTP request per title

// Performance monitoring
import { performance } from 'node:perf_hooks';

const start = performance.now();
await scraperNetflix(url);
const duration = performance.now() - start;
console.log(`Scraping took ${duration}ms`);

Q: Can I use MetaScraper for bulk scraping?

Yes, but consider:

// Good: Sequential processing with delays
async function bulkScrape(urls) {
  const results = [];

  for (const url of urls) {
    const result = await scraperNetflix(url);
    results.push(result);

    // Be respectful: add delay between requests
    await new Promise(resolve => setTimeout(resolve, 1000));
  }

  return results;
}

// Better: Concurrent processing with limits
async function concurrentScrape(urls, concurrency = 5) {
  const chunks = [];
  for (let i = 0; i < urls.length; i += concurrency) {
    chunks.push(urls.slice(i, i + concurrency));
  }

  const results = [];
  for (const chunk of chunks) {
    const chunkResults = await Promise.allSettled(
      chunk.map(url => scraperNetflix(url, { headless: false }))
    );
    results.push(...chunkResults);

    // Delay between chunks
    await new Promise(resolve => setTimeout(resolve, 2000));
  }

  return results;
}

Q: Does MetaScraper cache results?

No built-in caching, but easy to implement:

// Simple cache implementation
const cache = new Map();
const CACHE_TTL = 30 * 60 * 1000; // 30 minutes

async function scrapeWithCache(url, options = {}) {
  const cacheKey = `${url}:${JSON.stringify(options)}`;

  if (cache.has(cacheKey)) {
    const { data, timestamp } = cache.get(cacheKey);
    if (Date.now() - timestamp < CACHE_TTL) {
      return data;
    }
  }

  const result = await scraperNetflix(url, options);
  cache.set(cacheKey, { data: result, timestamp: Date.now() });

  return result;
}

🛠️ Troubleshooting

Q: Why am I getting "File is not defined" errors?

This happens on Node.js 18 without proper polyfills:

# Solution 1: Update to Node.js 20+
nvm install 20
nvm use 20

# Solution 2: Use latest MetaScraper version
npm update metascraper

Q: Why does scraping fail for some titles?

Common reasons:

Region restrictions: Title not available in your location
Invalid URL: Netflix URL format changed or incorrect
Netflix changes: HTML structure updated
Network issues: Connection problems or timeouts

Debug steps:

async function debugScraping(url) {
  try {
    console.log('Testing URL:', url);

    // Test URL normalization
    const normalized = normalizeNetflixUrl(url);
    console.log('Normalized:', normalized);

    // Test with different configurations
    const configs = [
      { headless: false, timeoutMs: 30000 },
      { headless: true, timeoutMs: 30000 },
      { headless: false, userAgent: 'different-ua' }
    ];

    for (const config of configs) {
      try {
        const result = await scraperNetflix(url, config);
        console.log('✅ Success with config:', config, result.name);
        return result;
      } catch (error) {
        console.log('❌ Failed with config:', config, error.message);
      }
    }
  } catch (error) {
    console.error('Debug error:', error);
  }
}

Q: How do I handle rate limiting?

MetaScraper doesn't include built-in rate limiting, but you can implement it:

class RateLimiter {
  constructor(requestsPerSecond = 1) {
    this.delay = 1000 / requestsPerSecond;
    this.lastRequest = 0;
  }

  async wait() {
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequest;

    if (timeSinceLastRequest < this.delay) {
      const waitTime = this.delay - timeSinceLastRequest;
      await new Promise(resolve => setTimeout(resolve, waitTime));
    }

    this.lastRequest = Date.now();
  }
}

const rateLimiter = new RateLimiter(0.5); // 0.5 requests per second

async function rateLimitedScrape(url) {
  await rateLimiter.wait();
  return await scraperNetflix(url);
}

🔒 Legal & Ethical Questions

Q: Is scraping Netflix legal?

Important: Web scraping exists in a legal gray area. Consider:

✅ Generally Acceptable:

Personal use and research
Educational purposes
Non-commercial applications
Respectful scraping (low frequency)

⚠️ Potentially Problematic:

Commercial use without permission
High-frequency scraping
Competing with Netflix's services
Violating Netflix's Terms of Service

📋 Best Practices:

Be respectful with request frequency
Don't scrape at commercial scale
Use results for personal/educational purposes
Consider Netflix's ToS

Q: Does MetaScraper respect robots.txt?

MetaScraper doesn't automatically check robots.txt, but you can:

import { robotsParser } from 'robots-parser';

async function scrapeWithRobotsCheck(url) {
  const robotsUrl = new URL('/robots.txt', url).href;
  const robots = robotsParser(robotsUrl, 'User-agent: *\nDisallow: /');

  if (robots.isAllowed(url, 'MetaScraper')) {
    return await scraperNetflix(url);
  } else {
    throw new Error('Scraping disallowed by robots.txt');
  }
}

📦 Development & Contributing

Q: How can I contribute to MetaScraper?

Report Issues: Found bugs or new Turkish UI patterns
Suggest Features: Ideas for improvement
Submit Pull Requests: Code contributions
Improve Documentation: Better examples and guides

# Development setup
git clone https://github.com/username/flixscaper.git
cd flixscaper
npm install
npm test
npm run demo

Q: How do I add new Turkish UI patterns?

If you discover new Turkish Netflix UI text patterns:

Create an issue with examples:

**New Pattern**: "yeni bölüm"
**Example**: "Dizi Adı yeni bölüm | Netflix"
**Expected**: "Dizi Adı"

Or submit a PR adding the pattern:

// src/parser.js
const TURKISH_UI_PATTERNS = [
  // ... existing patterns
  /\s+yeni bölüm$/i,  // Add new pattern
];

Q: How can I test MetaScraper locally?

# Clone repository
git clone https://github.com/username/flixscaper.git
cd flixscaper

# Install dependencies
npm install

# Run tests
npm test

# Test with demo
npm run demo

# Test your own URLs
node -e "
import('./src/index.js').then(async (m) => {
  const result = await m.scraperNetflix('https://www.netflix.com/title/80189685');
  console.log(result);
})
"

🔮 Future Questions

Q: Will MetaScraper support other streaming platforms?

Currently focused on Netflix, but the architecture could be adapted. If you're interested in other platforms, create an issue to discuss:

YouTube metadata extraction
Amazon Prime scraping
Disney+ integration
Multi-platform support

Q: Is there a REST API version available?

Not currently, but you could easily create one:

// Example Express.js server
import express from 'express';
import { scraperNetflix } from 'metascraper';

const app = express();
app.use(express.json());

app.post('/scrape', async (req, res) => {
  try {
    const { url, options } = req.body;
    const result = await scraperNetflix(url, options);
    res.json(result);
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

app.listen(3000, () => console.log('API server running on port 3000'));

📞 Still Have Questions?

Documentation: Check the /doc directory for detailed guides
Issues: GitHub Issues
Examples: See local-demo.js for usage patterns
Testing: Run npm test to see functionality in action

FAQ last updated: 2025-11-23

12 KiB Raw Blame History Unescape Escape