wisecolt/metascraper

Fork 0

Files

sbilketay 46d75b64d5 first commit

2025-11-23 14:25:09 +03:00

15 KiB

Raw Permalink Blame History

MetaScraper Testing Guide

🧪 Testing Philosophy

MetaScraper follows a comprehensive testing strategy that ensures reliability, performance, and maintainability:

Integration First: Focus on end-to-end functionality
Live Data Testing: Test against real Netflix pages
Performance Awareness: Monitor response times and resource usage
Error Coverage: Test failure scenarios and edge cases
Localization Testing: Verify Turkish UI text removal

📋 Test Structure

Test Categories

tests/
├── scrape.test.js           # Main integration tests
├── unit/                    # Unit tests (future)
│   ├── parser.test.js      # Parser function tests
│   ├── url-normalizer.test.js # URL normalization tests
│   └── title-cleaner.test.js   # Title cleaning tests
├── integration/             # Integration tests (current)
│   ├── live-scraping.test.js # Real Netflix URL tests
│   └── headless-fallback.test.js # Browser fallback tests
├── performance/             # Performance benchmarks (future)
│   ├── response-times.test.js # Timing tests
│   └── concurrent.test.js   # Multiple request tests
├── fixtures/                # Test data
│   ├── sample-title.html   # Sample Netflix HTML
│   ├── turkish-ui.json     # Turkish UI patterns
│   └── test-urls.json      # Test URL collection
└── helpers/                 # Test utilities (future)
    ├── mock-data.js        # Mock HTML generators
    └── test-utils.js       # Common test helpers

🏗️ Current Test Implementation

Main Test Suite: `tests/scrape.test.js`

import { beforeAll, describe, expect, it } from 'vitest';
import { scraperNetflix } from '../src/index.js';
import { parseNetflixHtml } from '../src/parser.js';

const TEST_URL = 'https://www.netflix.com/title/80189685'; // The Witcher
const UA = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36';

let liveHtml = '';

beforeAll(async () => {
  // Fetch real Netflix page for testing
  const res = await fetch(TEST_URL, {
    headers: {
      'User-Agent': UA,
      Accept: 'text/html,application/xhtml+xml'
    }
  });

  if (!res.ok) {
    throw new Error(`Live fetch başarısız: ${res.status}`);
  }

  liveHtml = await res.text();
}, 20000); // 20 second timeout for network requests

Test Coverage Areas

1. HTML Parsing Tests

describe('parseNetflixHtml (canlı sayfa)', () => {
  it(
    'static HTML\'den en az isim ve yıl bilgisini okur',
    () => {
      const meta = parseNetflixHtml(liveHtml);
      expect(meta.name).toBeTruthy();
      expect(String(meta.name).toLowerCase()).toContain('witcher');
      expect(meta.year).toMatch(/\d{4}/);
    },
    20000
  );
});

2. End-to-End Scraping Tests

describe('scraperNetflix (canlı istek)', () => {
  it(
    'normalize edilmiş url, id ve meta bilgilerini döner',
    async () => {
      const meta = await scraperNetflix(TEST_URL, { headless: false, userAgent: UA });
      expect(meta.url).toBe('https://www.netflix.com/title/80189685');
      expect(meta.id).toBe('80189685');
      expect(meta.name).toBeTruthy();
      expect(String(meta.name).toLowerCase()).toContain('witcher');
      expect(meta.year).toMatch(/\d{4}/);
    },
    20000
  );
});

🧪 Running Tests

Basic Test Commands

# Run all tests
npm test

# Run tests in watch mode
npm test -- --watch

# Run tests once
npm test -- --run

# Run tests with coverage
npm test -- --coverage

# Run specific test file
npm test scrape.test.js

# Run tests matching pattern
npm test -- --grep "Turkish"

Test Configuration

// vitest.config.js (if needed)
import { defineConfig } from 'vitest/config';

export default defineConfig({
  test: {
    timeout: 30000,        // 30 second timeout for network tests
    hookTimeout: 30000,    // Timeout for beforeAll hooks
    environment: 'node',   // Node.js environment
    globals: true,         // Use global test functions
    coverage: {
      reporter: ['text', 'json'],
      exclude: [
        'node_modules/',
        'tests/',
        'doc/'
      ]
    }
  }
});

📊 Test Data Management

Live Test URLs

// tests/fixtures/test-urls.json
[
  {
    "name": "The Witcher (TV Series)",
    "url": "https://www.netflix.com/title/80189685",
    "expected": {
      "type": "series",
      "hasSeasons": true,
      "titleContains": "witcher"
    }
  },
  {
    "name": "ONE SHOT (Movie)",
    "url": "https://www.netflix.com/title/82123114",
    "expected": {
      "type": "movie",
      "hasSeasons": false,
      "titleContains": "one shot"
    }
  }
]

Sample HTML Fixtures

<!-- tests/fixtures/sample-title.html -->
<!DOCTYPE html>
<html>
<head>
  <meta property="og:title" content="The Witcher izlemenizi bekliyor | Netflix">
  <meta name="title" content="The Witcher | Netflix">
  <title>The Witcher izlemenizi bekliyor | Netflix</title>
  <script type="application/ld+json">
  {
    "@type": "TVSeries",
    "name": "The Witcher izlemenizi bekliyor",
    "numberOfSeasons": 4,
    "datePublished": "2025"
  }
  </script>
</head>
<body>
  <!-- Netflix page content -->
</body>
</html>

Turkish UI Pattern Tests

// tests/fixtures/turkish-ui-patterns.json
{
  "title_cleaning_tests": [
    {
      "input": "The Witcher izlemenizi bekliyor | Netflix",
      "expected": "The Witcher",
      "removed": "izlemenizi bekliyor | Netflix"
    },
    {
      "input": "Stranger Things izleyin",
      "expected": "Stranger Things",
      "removed": "izleyin"
    },
    {
      "input": "Sezon 4 devam et",
      "expected": "Sezon 4",
      "removed": "devam et"
    }
  ]
}

🔧 Test Utilities

Custom Test Helpers

// tests/helpers/test-utils.js
import fs from 'node:fs';
import path from 'node:path';
import { fileURLToPath } from 'node:url';

const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

export function loadFixture(filename) {
  const fixturePath = path.join(__dirname, '../fixtures', filename);
  return fs.readFileSync(fixturePath, 'utf8');
}

export function loadJSONFixture(filename) {
  const content = loadFixture(filename);
  return JSON.parse(content);
}

export async function withTimeout(promise, timeoutMs = 5000) {
  const timeout = new Promise((_, reject) => {
    setTimeout(() => reject(new Error(`Test timeout after ${timeoutMs}ms`)), timeoutMs);
  });

  return Promise.race([promise, timeout]);
}

export function expectTurkishTitleClean(input, expected) {
  const result = cleanTitle(input);
  expect(result).toBe(expected);
}

Mock Browser Automation

// tests/helpers/mock-playwright.js
import { vi } from 'vitest';

export function mockPlaywrightSuccess(html) {
  vi.doMock('playwright', () => ({
    chromium: {
      launch: vi.fn(() => ({
        newContext: vi.fn(() => ({
          newPage: vi.fn(() => ({
            goto: vi.fn().mockResolvedValue(undefined),
            content: vi.fn().mockResolvedValue(html),
            waitForLoadState: vi.fn().mockResolvedValue(undefined)
          }))
        })),
        close: vi.fn().mockResolvedValue(undefined)
      }))
    }
  }));
}

export function mockPlaywrightFailure() {
  vi.doMock('playwright', () => {
    throw new Error('Playwright not available');
  });
}

🎯 Test Scenarios

1. URL Normalization Tests

describe('URL Normalization', () => {
  const testCases = [
    {
      input: 'https://www.netflix.com/tr/title/80189685?s=i&vlang=tr',
      expected: 'https://www.netflix.com/title/80189685',
      description: 'Turkish URL with parameters'
    },
    {
      input: 'https://www.netflix.com/title/80189685?trackId=12345',
      expected: 'https://www.netflix.com/title/80189685',
      description: 'URL with tracking parameters'
    }
  ];

  testCases.forEach(({ input, expected, description }) => {
    it(description, () => {
      const result = normalizeNetflixUrl(input);
      expect(result).toBe(expected);
    });
  });
});

2. Turkish UI Text Removal Tests

describe('Turkish UI Text Cleaning', () => {
  const turkishCases = [
    {
      input: 'The Witcher izlemenizi bekliyor',
      expected: 'The Witcher',
      pattern: 'waiting for you to watch'
    },
    {
      input: 'Dark izleyin',
      expected: 'Dark',
      pattern: 'watch'
    },
    {
      input: 'Money Heist devam et',
      expected: 'Money Heist',
      pattern: 'continue'
    }
  ];

  turkishCases.forEach(({ input, expected, pattern }) => {
    it(`removes Turkish UI text: ${pattern}`, () => {
      expect(cleanTitle(input)).toBe(expected);
    });
  });
});

3. JSON-LD Parsing Tests

describe('JSON-LD Metadata Extraction', () => {
  it('extracts movie metadata correctly', () => {
    const jsonLd = {
      '@type': 'Movie',
      'name': 'Inception',
      'datePublished': '2010',
      'copyrightYear': 2010
    };

    const result = parseJsonLdObject(jsonLd);
    expect(result.name).toBe('Inception');
    expect(result.year).toBe(2010);
    expect(result.seasons).toBeUndefined();
  });

  it('extracts TV series metadata with seasons', () => {
    const jsonLd = {
      '@type': 'TVSeries',
      'name': 'Stranger Things',
      'numberOfSeasons': 4,
      'datePublished': '2016'
    };

    const result = parseJsonLdObject(jsonLd);
    expect(result.name).toBe('Stranger Things');
    expect(result.seasons).toBe('4 Sezon');
  });
});

4. Error Handling Tests

describe('Error Handling', () => {
  it('throws error for invalid URL', async () => {
    await expect(scraperNetflix('invalid-url')).rejects.toThrow('Geçersiz URL sağlandı');
  });

  it('throws error for non-Netflix URL', async () => {
    await expect(scraperNetflix('https://google.com')).rejects.toThrow('URL netflix.com adresini göstermelidir');
  });

  it('throws error for URL without title ID', async () => {
    await expect(scraperNetflix('https://www.netflix.com/browse')).rejects.toThrow('URL\'de Netflix başlık ID\'si bulunamadı');
  });

  it('handles network timeouts gracefully', async () => {
    await expect(scraperNetflix(TEST_URL, { timeoutMs: 1 })).rejects.toThrow('Request timed out');
  });
});

5. Performance Tests

describe('Performance', () => {
  it('completes static scraping within 1 second', async () => {
    const start = performance.now();
    await scraperNetflix(TEST_URL, { headless: false });
    const duration = performance.now() - start;

    expect(duration).toBeLessThan(1000);
  }, 10000);

  it('handles concurrent requests efficiently', async () => {
    const urls = Array(5).fill(TEST_URL);
    const start = performance.now();

    const results = await Promise.allSettled(
      urls.map(url => scraperNetflix(url, { headless: false }))
    );

    const duration = performance.now() - start;
    const successful = results.filter(r => r.status === 'fulfilled').length;

    expect(duration).toBeLessThan(3000); // Should be faster than sequential
    expect(successful).toBeGreaterThan(0); // At least some should succeed
  }, 30000);
});

🔍 Test Debugging

1. Visual HTML Inspection

// Save HTML for manual debugging
it('captures HTML for debugging', async () => {
  const html = await fetchStaticHtml(TEST_URL);
  fs.writeFileSync('debug-netflix-page.html', html);
  console.log('HTML saved to debug-netflix-page.html');

  expect(html).toContain('<html');
  expect(html).toContain('netflix');
});

2. Network Request Debugging

// Debug network requests
it('logs network request details', async () => {
  const originalFetch = global.fetch;

  global.fetch = async (url, options) => {
    console.log('🌐 Request URL:', url);
    console.log('📋 Headers:', options.headers);
    console.log('⏰ Time:', new Date().toISOString());

    const response = await originalFetch(url, options);
    console.log('📊 Response status:', response.status);
    console.log('📏 Response size:', response.headers.get('content-length'));

    return response;
  };

  const result = await scraperNetflix(TEST_URL, { headless: false });

  // Restore original fetch
  global.fetch = originalFetch;

  expect(result.name).toBeTruthy();
});

3. Step-by-Step Processing

// Debug each step of the process
it('logs processing steps', async () => {
  console.log('🚀 Starting Netflix scraping test');

  // Step 1: URL normalization
  const normalized = normalizeNetflixUrl(TEST_URL);
  console.log('🔗 Normalized URL:', normalized);

  // Step 2: HTML fetch
  const html = await fetchStaticHtml(normalized);
  console.log('📄 HTML length:', html.length);

  // Step 3: Parsing
  const parsed = parseNetflixHtml(html);
  console.log('📊 Parsed metadata:', parsed);

  // Step 4: Full process
  const fullResult = await scraperNetflix(TEST_URL);
  console.log('✅ Full result:', fullResult);

  expect(fullResult.name).toBeTruthy();
});

📈 Continuous Testing

GitHub Actions Workflow

# .github/workflows/test.yml
name: Test Suite

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest

    strategy:
      matrix:
        node-version: [18.x, 20.x, 22.x]

    steps:
    - uses: actions/checkout@v3

    - name: Use Node.js ${{ matrix.node-version }}
      uses: actions/setup-node@v3
      with:
        node-version: ${{ matrix.node-version }}
        cache: 'npm'

    - name: Install dependencies
      run: npm ci

    - name: Install Playwright
      run: npx playwright install chromium

    - name: Run tests
      run: npm test -- --coverage

    - name: Upload coverage to Codecov
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage/lcov.info

Pre-commit Hooks

// package.json
{
  "husky": {
    "hooks": {
      "pre-commit": "npm test && npm run lint"
    }
  }
}

🚨 Test Environment Considerations

Network Dependencies

Live Tests: Require internet connection to Netflix
Timeouts: Extended timeouts for network requests (30s+)
Rate Limiting: Be respectful to Netflix's servers
Geographic: Tests may behave differently by region

Browser Dependencies

Playwright: Optional dependency for headless tests
Browser Installation: Requires npx playwright install
Memory: Browser tests use more memory
CI/CD: Need to install browsers in CI environment

Test Data Updates

Netflix Changes: UI changes may break tests
Pattern Updates: Turkish UI patterns may change
JSON-LD Structure: Netflix may modify structured data
URL Formats: New URL patterns may emerge

📊 Test Metrics

Success Criteria

Unit Tests: 90%+ code coverage
Integration Tests: 100% API coverage
Performance: <1s response time for static mode
Reliability: 95%+ success rate for known URLs

Test Monitoring

// Performance tracking
const testMetrics = {
  staticScrapingTimes: [],
  headlessScrapingTimes: [],
  successRates: {},
  errorCounts: {}
};

function recordMetric(type, value) {
  if (Array.isArray(testMetrics[type])) {
    testMetrics[type].push(value);
  } else {
    testMetrics[type][value] = (testMetrics[type][value] || 0) + 1;
  }
}

Testing guide last updated: 2025-11-23

15 KiB Raw Permalink Blame History Unescape Escape