first commit

This commit is contained in:
2025-11-23 14:25:09 +03:00
commit 46d75b64d5
18 changed files with 4749 additions and 0 deletions

561
doc/TROUBLESHOOTING.md Normal file
View File

@@ -0,0 +1,561 @@
# MetaScraper Troubleshooting Guide
## 🚨 Common Issues & Solutions
### 1. Module Import Errors
#### ❌ Error: `Cannot resolve import 'flixscaper'`
**Problem**: Cannot import the library in your project
```javascript
import { scraperNetflix } from 'metascraper';
// Throws: Cannot resolve import 'flixscaper'
```
**Causes & Solutions**:
1. **Not installed properly**
```bash
npm install flixscaper
# or
yarn add flixscaper
```
2. **Using local development without proper path**
```javascript
// Instead of this:
import { scraperNetflix } from 'metascraper';
// Use this for local development:
import { scraperNetflix } from './src/index.js';
```
3. **TypeScript configuration issue**
```json
// tsconfig.json
{
"compilerOptions": {
"moduleResolution": "node",
"allowSyntheticDefaultImports": true
}
}
```
#### ❌ Error: `Failed to load url ../globals-polyfill.mjs`
**Problem**: Polyfill file missing after Node.js upgrade
**Solution**: The library has been updated to use a minimal polyfill. Ensure you're using the latest version:
```bash
npm update flixscaper
```
If still occurring, check your Node.js version:
```bash
node --version # Should be 18+
```
### 2. Network & Connection Issues
#### ❌ Error: `Request timed out while reaching Netflix`
**Problem**: Network requests are timing out
**Solutions**:
1. **Increase timeout**
```javascript
await scraperNetflix(url, {
timeoutMs: 30000 // 30 seconds instead of 15
});
```
2. **Check internet connection**
```bash
# Test connectivity to Netflix
curl -I https://www.netflix.com
```
3. **Use different User-Agent**
```javascript
await scraperNetflix(url, {
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
});
```
#### ❌ Error: `Netflix title not found (404)`
**Problem**: Title ID doesn't exist or is not available
**Solutions**:
1. **Verify URL is correct**
```javascript
// Test with known working URL
await scraperNetflix('https://www.netflix.com/title/80189685');
```
2. **Check title availability in your region**
```javascript
// Some titles are region-locked
console.log('Title may not be available in your region');
```
3. **Use browser to verify**
- Open the URL in your browser
- If it shows 404 in browser, it's not a library issue
### 3. Parsing & Data Issues
#### ❌ Error: `Netflix sayfa meta verisi parse edilemedi`
**Problem**: Cannot extract metadata from Netflix page
**Causes & Solutions**:
1. **Netflix changed their HTML structure**
```javascript
// Enable headless mode to get JavaScript-rendered content
await scraperNetflix(url, { headless: true });
```
2. **Title has unusual formatting**
```javascript
// Debug by examining the HTML
const html = await fetchStaticHtml(url);
console.log(html.slice(0, 1000)); // First 1000 chars
```
3. **Missing JSON-LD data**
- Netflix may have removed structured data
- Use headless mode as fallback
#### ❌ Problem: Turkish UI text not being removed
**Problem**: Titles still contain Turkish UI text like "izlemenizi bekliyor"
**Solutions**:
1. **Check if pattern is covered**
```javascript
import { cleanTitle } from 'flixscaper/parser';
const testTitle = "The Witcher izlemenizi bekliyor";
const cleaned = cleanTitle(testTitle);
console.log('Cleaned:', cleaned);
```
2. **Add new pattern if needed**
```javascript
// If Netflix added new UI text, file an issue with:
// 1. The problematic title
// 2. The expected cleaned title
// 3. The new UI pattern that needs to be added
```
### 4. Playwright/Browser Issues
#### ❌ Error: `Playwright is not installed`
**Problem**: Headless mode not available
**Solutions**:
1. **Install Playwright**
```bash
npm install playwright
npx playwright install chromium
```
2. **Use library without headless mode**
```javascript
await scraperNetflix(url, { headless: false });
```
3. **Check if you really need headless mode**
- Most titles work with static mode
- Only use headless if static parsing fails
#### ❌ Error: `Playwright chromium browser is unavailable`
**Problem**: Chromium browser not installed
**Solution**:
```bash
npx playwright install chromium
```
#### ❌ Error: Memory issues with Playwright
**Problem**: Browser automation using too much memory
**Solutions**:
1. **Limit concurrent requests**
```javascript
const urls = ['url1', 'url2', 'url3'];
// Process sequentially instead of parallel
for (const url of urls) {
const result = await scraperNetflix(url);
// Process result
}
```
2. **Close browser resources properly**
- The library handles this automatically
- Ensure you're not calling Playwright directly
### 5. Environment & Compatibility Issues
#### ❌ Error: `File is not defined` (Node.js 18)
**Problem**: Node.js 18 missing File API for undici
**Solutions**:
1. **Use latest library version**
```bash
npm update flixscaper
```
2. **Upgrade Node.js**
```bash
# Upgrade to Node.js 20+ to avoid polyfill issues
nvm install 20
nvm use 20
```
3. **Manual polyfill (if needed)**
```javascript
import './src/polyfill.js'; // Include before library import
import { scraperNetflix } from './src/index.js';
```
#### ❌ Problem: Works on one machine but not another
**Diagnosis Steps**:
1. **Check Node.js versions**
```bash
node --version # Should be 18+
npm --version # Should be 8+
```
2. **Check Netflix accessibility**
```bash
curl -I "https://www.netflix.com/title/80189685"
```
3. **Compare User-Agent strings**
```javascript
console.log(navigator.userAgent); // Browser
console.log(process.userAgent); // Node.js (may be undefined)
```
## 🔍 Debugging Techniques
### 1. Enable Verbose Logging
```javascript
// Add debug logging to your code
async function debugScraping(url) {
console.log('🚀 Starting scrape for:', url);
try {
const result = await scraperNetflix(url, {
headless: false, // Try without browser first
timeoutMs: 30000
});
console.log('✅ Success:', result);
return result;
} catch (error) {
console.error('❌ Error details:', {
message: error.message,
stack: error.stack,
url: url
});
throw error;
}
}
```
### 2. Test with Known Working URLs
```javascript
// Test with URLs that should definitely work
const testUrls = [
'https://www.netflix.com/title/80189685', // The Witcher
'https://www.netflix.com/title/82123114' // ONE SHOT
];
for (const url of testUrls) {
try {
const result = await scraperNetflix(url);
console.log(`✅ ${url}: ${result.name}`);
} catch (error) {
console.error(`❌ ${url}: ${error.message}`);
}
}
```
### 3. Isolate the Problem
```javascript
// Test each component separately
import { normalizeNetflixUrl } from 'flixscaper/index';
import { parseNetflixHtml } from 'flixscaper/parser';
async function isolateProblem(url) {
try {
// 1. Test URL normalization
const normalized = normalizeNetflixUrl(url);
console.log('✅ URL normalized:', normalized);
// 2. Test HTML fetching
const html = await fetchStaticHtml(normalized);
console.log('✅ HTML fetched, length:', html.length);
// 3. Test parsing
const parsed = parseNetflixHtml(html);
console.log('✅ Parsed:', parsed);
} catch (error) {
console.error('❌ Step failed:', error.message);
}
}
```
### 4. Browser Mode Debugging
```javascript
// Test with visible browser for debugging
const result = await scraperNetflix(url, {
headless: false, // Show browser window
timeoutMs: 60000 // Longer timeout for manual inspection
});
// Keep browser open by adding delay if needed
await new Promise(resolve => setTimeout(resolve, 5000));
```
## 🌍 Regional & Language Issues
### Turkish Netflix Specific Issues
#### ❌ Problem: Turkish URLs not working
**Test different URL formats**:
```javascript
const turkishUrls = [
'https://www.netflix.com/title/80189685', // Standard
'https://www.netflix.com/tr/title/80189685', // Turkish subdomain
'https://www.netflix.com/tr/title/80189685?s=i', // With Turkish params
'https://www.netflix.com/tr/title/80189685?vlang=tr' // Turkish language
];
for (const url of turkishUrls) {
try {
const result = await scraperNetflix(url);
console.log(`✅ ${url}: ${result.name}`);
} catch (error) {
console.error(`❌ ${url}: ${error.message}`);
}
}
```
#### ❌ Problem: New Turkish UI patterns not recognized
**Report the issue with**:
1. **Original title**: What Netflix returned
2. **Expected title**: What it should be after cleaning
3. **URL**: The Netflix URL where this occurs
4. **Region**: Your geographic location
Example issue report:
```markdown
**URL**: https://www.netflix.com/tr/title/12345678
**Original**: "Dizi Adı yeni başlık | Netflix"
**Expected**: "Dizi Adı"
**Pattern to add**: "yeni başlık"
**Region**: Turkey
```
## 📊 Performance Issues
### Slow Response Times
#### Diagnose the bottleneck:
```javascript
import { performance } from 'node:perf_hooks';
async function profileScraping(url) {
const steps = {};
// URL Normalization
steps.normStart = performance.now();
const normalized = normalizeNetflixUrl(url);
steps.normEnd = performance.now();
// HTML Fetch
steps.fetchStart = performance.now();
const html = await fetchStaticHtml(normalized);
steps.fetchEnd = performance.now();
// Parsing
steps.parseStart = performance.now();
const parsed = parseNetflixHtml(html);
steps.parseEnd = performance.now();
console.log('Performance breakdown:', {
normalization: steps.normEnd - steps.normStart,
fetch: steps.fetchEnd - steps.fetchStart,
parsing: steps.parseEnd - steps.parseStart,
htmlSize: html.length
});
return parsed;
}
```
#### Optimization Solutions:
1. **Disable headless mode** (if not needed)
```javascript
await scraperNetflix(url, { headless: false });
```
2. **Reduce timeout** (if network is fast)
```javascript
await scraperNetflix(url, { timeoutMs: 5000 });
```
3. **Cache results** (for repeated requests)
```javascript
const cache = new Map();
async function scrapeWithCache(url) {
if (cache.has(url)) {
return cache.get(url);
}
const result = await scraperNetflix(url);
cache.set(url, result);
return result;
}
```
## 🔧 Common Fixes
### Quick Fix Checklist
1. **Update dependencies**
```bash
npm update flixscaper
npm update
```
2. **Clear npm cache**
```bash
npm cache clean --force
rm -rf node_modules package-lock.json
npm install
```
3. **Check Node.js version**
```bash
node --version # Should be 18+
# If older, upgrade: nvm install 20 && nvm use 20
```
4. **Test with minimal example**
```javascript
import { scraperNetflix } from 'metascraper';
scraperNetflix('https://www.netflix.com/title/80189685')
.then(result => console.log('Success:', result))
.catch(error => console.error('Error:', error.message));
```
5. **Try different options**
```javascript
// If failing, try with different configurations
const configs = [
{ headless: false },
{ headless: true, timeoutMs: 30000 },
{ headless: false, userAgent: 'different-ua' }
];
for (const config of configs) {
try {
const result = await scraperNetflix(url, config);
console.log('✅ Working config:', config);
break;
} catch (error) {
console.log('❌ Failed config:', config, error.message);
}
}
```
## 📞 Getting Help
### When to Report an Issue
Report an issue when:
1. **Previously working URL suddenly fails**
2. **Error messages are unclear or unhelpful**
3. **Turkish UI patterns not being removed**
4. **Performance degrades significantly**
5. **Documentation is unclear or incomplete**
### Issue Report Template
```markdown
## Issue Description
Brief description of the problem
## Steps to Reproduce
1. URL used: ...
2. Code executed: ...
3. Expected result: ...
4. Actual result: ...
## Environment
- Node.js version: ...
- OS: ...
- flixscaper version: ...
- Browser (if relevant): ...
## Error Message
```
Paste full error message here
```
## Additional Context
Any additional information that might help
```
### Debug Information to Include
```javascript
// Include this information in issue reports
const debugInfo = {
nodeVersion: process.version,
platform: process.platform,
arch: process.arch,
flixscaperVersion: require('flixscaper/package.json').version,
timestamp: new Date().toISOString()
};
console.log('Debug Info:', JSON.stringify(debugInfo, null, 2));
```
---
*Troubleshooting guide last updated: 2025-11-23*