Samba TV uses probabilistic algorithms to link identifiers that belong to the same person or household. To validate accuracy, we compare these matches against deterministic identifiers (such as logins or email addresses) whenever they are available in client data.
This validation produces precision and recall metrics, giving you confidence that the Samba TV Identity Graph is both accurate and reliable.
Precision
Precision measures the percentage of matches in the graph that are confirmed correct against the deterministic truth set.
Two validation approaches are used:
Pair-Based Validation
Looks at individual pairs of User IDs.
- Correct if both User IDs resolve to the same deterministic ID.
- Incorrect if they resolve to different deterministic IDs.
Precision = correctPairs / (correctPairs + incorrectPairs)
Cluster-Based Validation
Looks at entire clusters (Person IDs or Household IDs).
- Correct if the cluster contains only one deterministic ID, and that deterministic ID is linked to at least two User IDs.
- Incorrect if multiple deterministic IDs are found within the same cluster.
Precision = correctClusters / (correctClusters + incorrectClusters)
Recall
Recall measures the percentage of deterministic matches that are also present in the graph.
In other words: Of all the “ground truth” matches in your data, how many did the graph capture?
Like precision, recall is computed only after cleaning the deterministic data to remove noise or inconsistencies.
Data Cleaning & Quality Controls
To ensure validation is meaningful and unbiased, the deterministic dataset is pre-processed before use:
- Only User IDs present in the source event data are included.
- Deterministic IDs linked to only one User ID are excluded (since they cannot verify matches).
- User IDs linked to multiple deterministic IDs are removed to avoid ambiguity.
📘 Note: Deterministic datasets themselves are not flawless. For example, a single individual may use multiple logins [email protected]
and [email protected]
. Samba TV may correctly unify these identifiers, but deterministic validation may treat them as separate, creating the appearance of an “error.”
Coverage vs. Precision and Recall
These three metrics complement each other:
- Coverage: The share of all User IDs or events in your data that are linked in the graph. (See Coverage & Match Rate)
- Precision: Of the matches in the graph, how many are correct?
- Recall: Of the deterministic matches, how many did the graph capture?
Together they provide a complete view of graph quality:
- Coverage shows breadth — how much of your data is connected.
- Precision shows accuracy — how correct those matches are.
- Recall shows completeness — how much of the ground truth the graph recovers.
📘 Key takeaway: Precision validation ensures that Samba TV’s probabilistic graphing methodology can be benchmarked against deterministic truth sets when available. This provides a transparent and measurable way to demonstrate the quality and reliability of our Household and Person ID solutions.