Files
metascraper/doc/FAQ.md
2025-11-23 14:25:09 +03:00

477 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MetaScraper Frequently Asked Questions (FAQ)
## 🚀 Getting Started
### Q: How do I install MetaScraper?
```bash
npm install metascraper
```
### Q: What are the system requirements?
**Node.js**: 18+ (recommended 20+)
**Memory**: Minimum 50MB for static mode, 200MB+ for headless mode
**Network**: Internet connection to Netflix
```bash
# Check your Node.js version
node --version # Should be 18.x or higher
```
### Q: Does MetaScraper work with TypeScript?
Yes! MetaScraper provides TypeScript support out of the box:
```typescript
import { scraperNetflix } from 'metascraper';
interface NetflixMetadata {
url: string;
id: string;
name: string;
year: string | number | undefined;
seasons: string | null;
}
const result: Promise<NetflixMetadata> = scraperNetflix('https://www.netflix.com/title/80189685');
```
## 🔧 Technical Questions
### Q: What's the difference between static and headless mode?
**Static Mode** (default):
- ✅ Faster (200-500ms)
- ✅ Lower memory usage
- ✅ No browser required
- ⚠️ 85% success rate
**Headless Mode** (fallback):
- ✅ Higher success rate (99%)
- ✅ Handles JavaScript-rendered content
- ❌ Slower (2-5 seconds)
- ❌ Requires Playwright
```javascript
// Force static mode only
await scraperNetflix(url, { headless: false });
// Enable headless fallback
await scraperNetflix(url, { headless: true });
```
### Q: Do I need to install Playwright?
**No**, Playwright is optional. MetaScraper works without it using static HTML parsing.
Install Playwright only if:
- You need higher success rates
- Static mode fails for specific titles
- You want JavaScript-rendered content
```bash
# Optional: Install for better success rates
npm install playwright
npx playwright install chromium
```
### Q: Can MetaScraper work in the browser?
**Not currently**. MetaScraper is designed for Node.js environments due to:
- CORS restrictions in browsers
- Netflix's bot protection
- Node.js-specific APIs (fetch, cheerio)
For browser usage, consider:
- Creating a proxy API server
- Using serverless functions
- Implementing browser-based scraping separately
### Q: How does MetaScraper handle Netflix's bot protection?
MetaScraper uses several techniques:
- **Realistic User-Agent strings** that mimic regular browsers
- **Proper HTTP headers** including Accept-Language
- **Rate limiting considerations** to avoid detection
- **JavaScript rendering** (when needed) to appear more human
```javascript
const result = await scraperNetflix(url, {
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
});
```
## 🌍 Localization & Turkish Support
### Q: What Turkish UI patterns does MetaScraper remove?
MetaScraper removes these Turkish Netflix UI patterns:
| Pattern | English Equivalent | Example |
|---------|-------------------|---------|
| `izlemenizi bekliyor` | "waiting for you to watch" | "The Witcher izlemenizi bekliyor" |
| `izleyin` | "watch" | "Dark izleyin" |
| `devam et` | "continue" | "Money Heist devam et" |
| `başla` | "start" | "Stranger Things başla" |
| `izlemeye devam` | "continue watching" | "The Crown izlemeye devam" |
### Q: Does MetaScraper support other languages?
Currently optimized for Turkish Netflix interfaces, but also removes universal English patterns:
-**Turkish**: Full support with specific patterns
-**English**: Basic UI text removal
- 🔄 **Other languages**: Can be extended (file an issue)
### Q: What about regional Netflix content?
MetaScraper works globally but:
- **Content availability** varies by region
- **Some titles** may be region-locked
- **URL formats** work universally
```javascript
// Test different regional URLs
const regionalUrls = [
'https://www.netflix.com/title/80189685', // Global
'https://www.netflix.com/tr/title/80189685', // Turkey
'https://www.netflix.com/us/title/80189685' // US
];
```
## ⚡ Performance & Usage
### Q: How fast is MetaScraper?
**Response Times**:
- **Static mode**: 200-500ms
- **Headless fallback**: 2-5 seconds
- **Batch processing**: 10-50 URLs per second (static mode)
**Resource Usage**:
- **Memory**: <50MB (static), 100-200MB (headless)
- **CPU**: Low impact for normal usage
- **Network**: 1 HTTP request per title
```javascript
// Performance monitoring
import { performance } from 'node:perf_hooks';
const start = performance.now();
await scraperNetflix(url);
const duration = performance.now() - start;
console.log(`Scraping took ${duration}ms`);
```
### Q: Can I use MetaScraper for bulk scraping?
**Yes**, but consider:
```javascript
// Good: Sequential processing with delays
async function bulkScrape(urls) {
const results = [];
for (const url of urls) {
const result = await scraperNetflix(url);
results.push(result);
// Be respectful: add delay between requests
await new Promise(resolve => setTimeout(resolve, 1000));
}
return results;
}
// Better: Concurrent processing with limits
async function concurrentScrape(urls, concurrency = 5) {
const chunks = [];
for (let i = 0; i < urls.length; i += concurrency) {
chunks.push(urls.slice(i, i + concurrency));
}
const results = [];
for (const chunk of chunks) {
const chunkResults = await Promise.allSettled(
chunk.map(url => scraperNetflix(url, { headless: false }))
);
results.push(...chunkResults);
// Delay between chunks
await new Promise(resolve => setTimeout(resolve, 2000));
}
return results;
}
```
### Q: Does MetaScraper cache results?
**No built-in caching**, but easy to implement:
```javascript
// Simple cache implementation
const cache = new Map();
const CACHE_TTL = 30 * 60 * 1000; // 30 minutes
async function scrapeWithCache(url, options = {}) {
const cacheKey = `${url}:${JSON.stringify(options)}`;
if (cache.has(cacheKey)) {
const { data, timestamp } = cache.get(cacheKey);
if (Date.now() - timestamp < CACHE_TTL) {
return data;
}
}
const result = await scraperNetflix(url, options);
cache.set(cacheKey, { data: result, timestamp: Date.now() });
return result;
}
```
## 🛠️ Troubleshooting
### Q: Why am I getting "File is not defined" errors?
This happens on Node.js 18 without proper polyfills:
```bash
# Solution 1: Update to Node.js 20+
nvm install 20
nvm use 20
# Solution 2: Use latest MetaScraper version
npm update metascraper
```
### Q: Why does scraping fail for some titles?
Common reasons:
1. **Region restrictions**: Title not available in your location
2. **Invalid URL**: Netflix URL format changed or incorrect
3. **Netflix changes**: HTML structure updated
4. **Network issues**: Connection problems or timeouts
**Debug steps**:
```javascript
async function debugScraping(url) {
try {
console.log('Testing URL:', url);
// Test URL normalization
const normalized = normalizeNetflixUrl(url);
console.log('Normalized:', normalized);
// Test with different configurations
const configs = [
{ headless: false, timeoutMs: 30000 },
{ headless: true, timeoutMs: 30000 },
{ headless: false, userAgent: 'different-ua' }
];
for (const config of configs) {
try {
const result = await scraperNetflix(url, config);
console.log('✅ Success with config:', config, result.name);
return result;
} catch (error) {
console.log('❌ Failed with config:', config, error.message);
}
}
} catch (error) {
console.error('Debug error:', error);
}
}
```
### Q: How do I handle rate limiting?
MetaScraper doesn't include built-in rate limiting, but you can implement it:
```javascript
class RateLimiter {
constructor(requestsPerSecond = 1) {
this.delay = 1000 / requestsPerSecond;
this.lastRequest = 0;
}
async wait() {
const now = Date.now();
const timeSinceLastRequest = now - this.lastRequest;
if (timeSinceLastRequest < this.delay) {
const waitTime = this.delay - timeSinceLastRequest;
await new Promise(resolve => setTimeout(resolve, waitTime));
}
this.lastRequest = Date.now();
}
}
const rateLimiter = new RateLimiter(0.5); // 0.5 requests per second
async function rateLimitedScrape(url) {
await rateLimiter.wait();
return await scraperNetflix(url);
}
```
## 🔒 Legal & Ethical Questions
### Q: Is scraping Netflix legal?
**Important**: Web scraping exists in a legal gray area. Consider:
**✅ Generally Acceptable**:
- Personal use and research
- Educational purposes
- Non-commercial applications
- Respectful scraping (low frequency)
**⚠️ Potentially Problematic**:
- Commercial use without permission
- High-frequency scraping
- Competing with Netflix's services
- Violating Netflix's Terms of Service
**📋 Best Practices**:
- Be respectful with request frequency
- Don't scrape at commercial scale
- Use results for personal/educational purposes
- Consider Netflix's ToS
### Q: Does MetaScraper respect robots.txt?
MetaScraper doesn't automatically check robots.txt, but you can:
```javascript
import { robotsParser } from 'robots-parser';
async function scrapeWithRobotsCheck(url) {
const robotsUrl = new URL('/robots.txt', url).href;
const robots = robotsParser(robotsUrl, 'User-agent: *\nDisallow: /');
if (robots.isAllowed(url, 'MetaScraper')) {
return await scraperNetflix(url);
} else {
throw new Error('Scraping disallowed by robots.txt');
}
}
```
## 📦 Development & Contributing
### Q: How can I contribute to MetaScraper?
1. **Report Issues**: Found bugs or new Turkish UI patterns
2. **Suggest Features**: Ideas for improvement
3. **Submit Pull Requests**: Code contributions
4. **Improve Documentation**: Better examples and guides
```bash
# Development setup
git clone https://github.com/username/flixscaper.git
cd flixscaper
npm install
npm test
npm run demo
```
### Q: How do I add new Turkish UI patterns?
If you discover new Turkish Netflix UI text patterns:
1. **Create an issue** with examples:
```markdown
**New Pattern**: "yeni bölüm"
**Example**: "Dizi Adı yeni bölüm | Netflix"
**Expected**: "Dizi Adı"
```
2. **Or submit a PR** adding the pattern:
```javascript
// src/parser.js
const TURKISH_UI_PATTERNS = [
// ... existing patterns
/\s+yeni bölüm$/i, // Add new pattern
];
```
### Q: How can I test MetaScraper locally?
```bash
# Clone repository
git clone https://github.com/username/flixscaper.git
cd flixscaper
# Install dependencies
npm install
# Run tests
npm test
# Test with demo
npm run demo
# Test your own URLs
node -e "
import('./src/index.js').then(async (m) => {
const result = await m.scraperNetflix('https://www.netflix.com/title/80189685');
console.log(result);
})
"
```
## 🔮 Future Questions
### Q: Will MetaScraper support other streaming platforms?
Currently focused on Netflix, but the architecture could be adapted. If you're interested in other platforms, create an issue to discuss:
- YouTube metadata extraction
- Amazon Prime scraping
- Disney+ integration
- Multi-platform support
### Q: Is there a REST API version available?
Not currently, but you could easily create one:
```javascript
// Example Express.js server
import express from 'express';
import { scraperNetflix } from 'metascraper';
const app = express();
app.use(express.json());
app.post('/scrape', async (req, res) => {
try {
const { url, options } = req.body;
const result = await scraperNetflix(url, options);
res.json(result);
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.listen(3000, () => console.log('API server running on port 3000'));
```
---
## 📞 Still Have Questions?
- **Documentation**: Check the `/doc` directory for detailed guides
- **Issues**: [GitHub Issues](https://github.com/username/flixscaper/issues)
- **Examples**: See `local-demo.js` for usage patterns
- **Testing**: Run `npm test` to see functionality in action
---
*FAQ last updated: 2025-11-23*