Selecting Data

The accuracy of deduplication depends directly on the quality and scope of the data provided. To achieve the best results, please consider the following factors when selecting your data:

  • Date range

    Provide data across a sufficient time span. Our algorithms identify recurring patterns over time, which requires a minimum of 2–3 weeks of event logs. In some cases, up to four weeks may be needed depending on data density.

  • Density in data

    Choose datasets with high activity from User Identifiers (UIDs). The more frequently UIDs appear, the faster and more accurately they can be deduplicated. Sparse datasets may slow down or reduce the accuracy of resolution.

  • Device mix

    Include traffic from a variety of devices—Connected TV, Mobile, PC, OTT—to maximize graphable opportunities. A broader device mix helps us uncover cross-device patterns, though our engine can also resolve within a single device type (intra-device).

  • User ID types

    Incorporate all available UID types. This may include Device IDs, IDFA (Identifier for Advertisers), Android IDs, cookies, or other identifiers. Persistent identifiers (non-cookie UIDs) strengthen our ability to build long-term links and improve deduplication precision.

❗️

Sampling

If needed, data volume may be reduced by:

  1. Restricting logs to a specific country or region
  2. Extracting only required attributes
  3. Using a compact data format such as Parquet or ORC

⚠️ Any other type of sampling (e.g., random record drops, interval-based sampling) is strongly discouraged as it degrades deduplication quality. If data size is a concern, please contact Samba TV to discuss safe alternatives.