{"_id":"5a3254fec049430012f55870","category":{"_id":"5a3254fdc049430012f5586e","version":"5a3254fdc049430012f5586d","project":"5587ff91b3bcf52b0051314f","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-06-22T12:29:06.930Z","from_sync":false,"order":0,"slug":"proof-of-concept-documentation","title":"Screen6 Documentation"},"user":"5587ff84b3bcf52b0051314e","project":"5587ff91b3bcf52b0051314f","parentDoc":null,"version":{"_id":"5a3254fdc049430012f5586d","project":"5587ff91b3bcf52b0051314f","__v":3,"createdAt":"2017-12-14T10:39:57.964Z","releaseDate":"2017-12-14T10:39:57.964Z","categories":["5a3254fdc049430012f5586e","5a3255199a6f2000125c0d61","5bbc98ba817d5b00038e914a"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.6.0","version":"1.6"},"githubsync":"","__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2015-06-22T12:38:17.731Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":1,"body":"As the amount of deduplication depends on the data that is provided to us, it’s important to select the right data, keeping in mind the following: time range, density, device mix, user ID types. \n\n* **Date range**\nData should cover a long enough time range. Our pattern matching algorithms work by spotting patterns over time. We need *at least* two or three weeks of data, up to four weeks may be required depending on density.\n\n* **Density in data**\nThe data you select should be **dense** in terms of UID activity. The more frequently we encounter User Identifiers (UIDs) the sooner we can deduplicate them.\n\n* **Device mix**\nIncluding traffic from a mix of devices - mobile, PC, TV, OTT - will increase the level of deduplication. We will spot more patterns and there will be more devices that we can deduplicate. Of course, we are also able to connect **intra device**.\n\n* **User ID types**\nThe data should contain all available UID types. If you have access to Device IDs, IDFA (ID for Advertiser), Android IDs or any other type of UID then these should all be included in the data. These non-cookie UIDs provide a higher degree of persistency, which helps the deduplication. \n\n[block:callout]\n{\n  \"type\": \"danger\",\n  \"title\": \"Sampling\",\n  \"body\": \"Data volume may be reduced by providing the data for a certain country or region. Extract only the required attributes, and use a compact data format (such as Parquet or ORC). \\n\\nAny other type of data sampling is very strongly recommended against. If volume is an issue, please discuss options with the Screen6 Technical Ops team.\"\n}\n[/block]","excerpt":"Selecting data in order to start deduplication with Screen6","slug":"data-selection","type":"basic","title":"Selecting Data"}

Selecting Data

Selecting data in order to start deduplication with Screen6

As the amount of deduplication depends on the data that is provided to us, it’s important to select the right data, keeping in mind the following: time range, density, device mix, user ID types. * **Date range** Data should cover a long enough time range. Our pattern matching algorithms work by spotting patterns over time. We need *at least* two or three weeks of data, up to four weeks may be required depending on density. * **Density in data** The data you select should be **dense** in terms of UID activity. The more frequently we encounter User Identifiers (UIDs) the sooner we can deduplicate them. * **Device mix** Including traffic from a mix of devices - mobile, PC, TV, OTT - will increase the level of deduplication. We will spot more patterns and there will be more devices that we can deduplicate. Of course, we are also able to connect **intra device**. * **User ID types** The data should contain all available UID types. If you have access to Device IDs, IDFA (ID for Advertiser), Android IDs or any other type of UID then these should all be included in the data. These non-cookie UIDs provide a higher degree of persistency, which helps the deduplication. [block:callout] { "type": "danger", "title": "Sampling", "body": "Data volume may be reduced by providing the data for a certain country or region. Extract only the required attributes, and use a compact data format (such as Parquet or ORC). \n\nAny other type of data sampling is very strongly recommended against. If volume is an issue, please discuss options with the Screen6 Technical Ops team." } [/block]