{"_id":"5a3254fec049430012f55874","category":{"_id":"5a3254fdc049430012f5586e","version":"5a3254fdc049430012f5586d","project":"5587ff91b3bcf52b0051314f","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2015-06-22T12:29:06.930Z","from_sync":false,"order":0,"slug":"proof-of-concept-documentation","title":"Screen6 Documentation"},"user":"5587ff84b3bcf52b0051314e","project":"5587ff91b3bcf52b0051314f","parentDoc":null,"version":{"_id":"5a3254fdc049430012f5586d","project":"5587ff91b3bcf52b0051314f","__v":3,"createdAt":"2017-12-14T10:39:57.964Z","releaseDate":"2017-12-14T10:39:57.964Z","categories":["5a3254fdc049430012f5586e","5a3255199a6f2000125c0d61","5bbc98ba817d5b00038e914a"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"","version_clean":"1.6.0","version":"1.6"},"githubsync":"","__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-01-19T09:41:36.143Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":5,"body":"[block:html]\n{\n  \"html\": \"<div>\\n<ol >\\n  <li> Did I include <b>all parameters</b>:\\n  <ul>\\n    <li> Date-time or Timestamp </li>\\n  \\t<li> IP Address - can be hashed but not truncated </li>\\n  \\t<li> User Agent or equivalent <sup>(1)</sup></li>\\n  \\t<li> All available types of User IDs, preferably in separate columns <sup>(2)</sup> </li>\\n    </ul>\\n  </li>\\n  <li> Are <b>IP addresses</b> complete, not truncated? <sup>(3)</sup> </li>\\n  <li> Are the <b>User Agents</b> complete, not truncated? </li>\\n  <li> Is the data in the correct format?\\n <ul type='a'>\\n    <li> Structured formats: Parquet, ORC, NDJSON, Avro</li>\\n    <li> Delimited formats - CSV/TSV\\n      <ol type='i'> \\n        <li> User Agents <b>quoted</b> where they contain a separation character?</li>\\n        <li> Are <b>empty values</b> clearly indicated?</li>\\n   </ol></li>\\n  </ul> \\n </li>\\n\\t<li> Are my files either <b>gzip compressed</b> or not compressed at all?</li>\\n\\t<li> Are file sizes always <b>between 100MB and 2GB</b>?</li>\\n  <li> Do file paths include the <b>date stamp</b> YYYYMMDD, e.g. `20180122/datasource_20180122_part123.log.gz`?</li>\\n</ol>\\n\\n  <p>&nbsp;</p>\\n\\n<p><small>(1)</small> If a full browser-like user agent isn't available, for example, data from mobile applications - please discuss alternatives with Screen6.</p>\\n<p><small>(2)</small> If User Identifiers are combined - cookies, MAID (IFA, GAID, etc.), Vendor ID, Impression ID - are combined, it's essential to clearly identify the type of ID using different columns,  or include an ID type column. </p>\\n<p><small>(3)</small> Truncation is sometimes done by replacing the final octet with a zero or other placeholder. Please discuss with us if some truncated IP data will be included.</p>\\n</div>\\n\\n<style>\\nol { line-height: 2em; }\\nul { line-height: 1.5em; }\\n</style>\"\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"Please avoid:\",\n  \"body\": \"* Any kind of sampling, except for region-based.\\n* Alteration of ID values or other modification of source events\\n* Providing event/impression IDs instead of Cookie/User ID/MAIDs.\\n\\nScreen6 algorithms must process the entire dataset to produce meaningful results\"\n}\n[/block]\nAny questions, please don't hesitate to contact us. Thanks for reading!","excerpt":"","slug":"log-file-checklist","type":"basic","title":"Data checklist"}
[block:html] { "html": "<div>\n<ol >\n <li> Did I include <b>all parameters</b>:\n <ul>\n <li> Date-time or Timestamp </li>\n \t<li> IP Address - can be hashed but not truncated </li>\n \t<li> User Agent or equivalent <sup>(1)</sup></li>\n \t<li> All available types of User IDs, preferably in separate columns <sup>(2)</sup> </li>\n </ul>\n </li>\n <li> Are <b>IP addresses</b> complete, not truncated? <sup>(3)</sup> </li>\n <li> Are the <b>User Agents</b> complete, not truncated? </li>\n <li> Is the data in the correct format?\n <ul type='a'>\n <li> Structured formats: Parquet, ORC, NDJSON, Avro</li>\n <li> Delimited formats - CSV/TSV\n <ol type='i'> \n <li> User Agents <b>quoted</b> where they contain a separation character?</li>\n <li> Are <b>empty values</b> clearly indicated?</li>\n </ol></li>\n </ul> \n </li>\n\t<li> Are my files either <b>gzip compressed</b> or not compressed at all?</li>\n\t<li> Are file sizes always <b>between 100MB and 2GB</b>?</li>\n <li> Do file paths include the <b>date stamp</b> YYYYMMDD, e.g. `20180122/datasource_20180122_part123.log.gz`?</li>\n</ol>\n\n <p>&nbsp;</p>\n\n<p><small>(1)</small> If a full browser-like user agent isn't available, for example, data from mobile applications - please discuss alternatives with Screen6.</p>\n<p><small>(2)</small> If User Identifiers are combined - cookies, MAID (IFA, GAID, etc.), Vendor ID, Impression ID - are combined, it's essential to clearly identify the type of ID using different columns, or include an ID type column. </p>\n<p><small>(3)</small> Truncation is sometimes done by replacing the final octet with a zero or other placeholder. Please discuss with us if some truncated IP data will be included.</p>\n</div>\n\n<style>\nol { line-height: 2em; }\nul { line-height: 1.5em; }\n</style>" } [/block] [block:callout] { "type": "warning", "title": "Please avoid:", "body": "* Any kind of sampling, except for region-based.\n* Alteration of ID values or other modification of source events\n* Providing event/impression IDs instead of Cookie/User ID/MAIDs.\n\nScreen6 algorithms must process the entire dataset to produce meaningful results" } [/block] Any questions, please don't hesitate to contact us. Thanks for reading!