1. Did I include all parameters:
    • Date-time or Timestamp
    • IP Address - can be hashed but not truncated
    • User Agent or equivalent (1)
    • All available types of User IDs, preferably in separate columns (2)
  2. Are IP addresses complete, not truncated? (3)
  3. Are the User Agents complete, not truncated?
  4. Is the data in the correct format?
    • Structured formats: Parquet, ORC, NDJSON, Avro
    • Delimited formats - CSV/TSV
      1. User Agents quoted where they contain a separation character?
      2. Are empty values clearly indicated?
  5. Are my files either gzip compressed or not compressed at all?
  6. Are file sizes always between 100MB and 2GB?
  7. Do file paths include the date stamp YYYYMMDD, e.g. `20180122/datasource_20180122_part123.log.gz`?

 

(1) If a full browser-like user agent isn't available, for example, data from mobile applications - please discuss alternatives with Screen6.

(2) If User Identifiers are combined - cookies, MAID (IFA, GAID, etc.), Vendor ID, Impression ID - are combined, it's essential to clearly identify the type of ID using different columns, or include an ID type column.

(3) Truncation is sometimes done by replacing the final octet with a zero or other placeholder. Please discuss with us if some truncated IP data will be included.

🚧

Please avoid:

  • Any kind of sampling, except for region-based.
  • Alteration of ID values or other modification of source events
  • Providing event/impression IDs instead of Cookie/User ID/MAIDs.

Screen6 algorithms must process the entire dataset to produce meaningful results

Any questions, please don't hesitate to contact us. Thanks for reading!