File Format

Requirements and formatting specifications for batch log files

Data Format

Samba TV supports the following data file formats:

  • Apache Parquet
  • Optimized Row Columnar (ORC)
  • Newline-delimited JSON (NDJSON)
  • Apache Avro
  • Tab- or Comma-separated values (TSV/CSV) or similar custom-delimited formats

Preferred format: If using delimited files, Tab-separated (TSV) is recommended because it avoids the need to quote fields containing commas (e.g., UserAgent strings).

All files must be encoded in UTF-8 or ISO-8859-1.

Empty values

If your logs contain missing values (for example, when a DeviceID is unavailable), Samba TV requires you to explicitly indicate these empty fields.

Commonly accepted representations:

  • empty string (the empty string should still be separated by the separation character!)
  • null
  • \N
  • 0

Size, compression and naming

To ensure smooth ingestion, please follow these requirements:

  • File Size: Individual files must not exceed 2GB. Avoid excessive numbers of very small files (<50 MB), as this slows down ingestion.
  • Compression: Use gzip compression or provide files uncompressed.
  • File Naming: Include a date stamp in YYYYMMDD format in the filename.
    • Example: client_20150122_part123.log.gz