Selecting Data

The accuracy of deduplication depends directly on the quality and scope of the data provided. To achieve the best results, please consider the following factors when selecting your data:

Date range

Provide data across a sufficient time span. Our algorithms identify recurring patterns over time, which requires a minimum of 2–3 weeks of event logs. In some cases, up to four weeks may be needed depending on data density.
Density in data

Choose datasets with high activity from User Identifiers (UIDs). The more frequently UIDs appear, the faster and more accurately they can be deduplicated. Sparse datasets may slow down or reduce the accuracy of resolution.
Device mix

Include traffic from a variety of devices—Connected TV, Mobile, PC, OTT—to maximize graphable opportunities. A broader device mix helps us uncover cross-device patterns, though our engine can also resolve within a single device type (intra-device).
User ID types

Incorporate all available UID types. This may include Device IDs, IDFA (Identifier for Advertisers), Android IDs, cookies, or other identifiers. Persistent identifiers (non-cookie UIDs) strengthen our ability to build long-term links and improve deduplication precision.

❗️
Sampling
If needed, data volume may be reduced by:

Restricting logs to a specific country or region

Extracting only required attributes

Using a compact data format such as Parquet or ORC

⚠️ Any other type of sampling (e.g., random record drops, interval-based sampling) is strongly discouraged as it degrades deduplication quality. If data size is a concern, please contact Samba TV to discuss safe alternatives.

Date range

Density in data

Device mix

User ID types

❗️

Sampling