cotalks.dev

Designing Python APIs for Data You Don’t Control - 2026

(link)
Speakers: Saurav Jain

Summary

This talk addresses the challenge of building data pipelines (scrapers and APIs) that rely on the open web—data that is inherently unstable and outside of your control. Since popular websites change their structure frequently (often every 2-4 weeks), developers must adopt a

Key Takeaways

  • Shift your mindset from 'Does it work now?' to 'What happens when it breaks?' to prevent silent data failures.
  • Define a strict data contract (using Pydantic/data classes) before scraping to ensure predictable output schema.
  • Implement robust data cleaning and validation (e.g., regex, explicit type conversion) to handle inconsistent web formats.
  • Differentiate between temporary errors (e.g., rate limiting) and structural breaking changes to guide recovery logic.
  • Monitor data quality metrics (e.g., completeness percentage) as a critical safeguard against silent data degradation.

Sections

The Challenge of Open Web Data

The open web is a massive source of valuable data, but it lacks the stability and guarantees of a formal API. Unlike controlled APIs, web data has no documentation, no versioning, and changes are silent. The biggest risk is not a loud crash, but a silent failure where the scraper continues running but returns incorrect or null data, leading to flawed downstream decisions.

Establishing a Data Contract

Instead of simply scraping and dumping data into a dictionary, the talk advocates defining a clear data model first, using tools like Python `dataclasses` or Pydantic. This model acts as a contract, specifying which fields are required (e.g., title, URL) and which are optional (e.g., rating, review count). This ensures the output data structure is predictable, regardless of the source website's variability.

Handling Data Inconsistency and Cleaning

Web data is a 'disaster zone' for formatting. Pricing, for example, can appear in multiple formats (e.g., '$1,200.00' vs '€45.00'). Robust scraping requires careful handling: check for element existence before accessing it, use regular expressions for text cleaning, and use explicit type conversions (like `Decimal` instead of `float`) to maintain precision. If conversion fails, the system should log the raw text and return `None`, rather than crashing.

Advanced Error Handling and Validation

The system must differentiate between error types. Temporary errors (like rate limiting or CAPTCHAs) should trigger retries or proxy swaps. Structural breaking changes (e.g., a complete redesign of the product page) require human intervention. Furthermore, Pydantic is highly recommended for automatic validation, enforcing rules (e.g., rating must be 1-5) and providing clear error messages when data types or constraints are violated. The `extra='ignore'` setting helps prevent crashes when websites introduce new, unknown fields.

API Best Practices and Monitoring

When providing data derived from external sources, treat your output as a formal contract. Changes to the schema (removing, renaming, or changing types) must be communicated to users well in advance (e.g., 14 business days). Finally, continuous data quality monitoring is crucial. Tracking metrics like the percentage of non-null fields is the fastest way to detect a silent failure before it impacts business intelligence.

Keywords: web scraping, python api design, pydantic, data validation, data contract, open web data, silent failure, data quality monitoring

note