Requirements and formatting specifications for batch log files
Data Format
Samba TV supports the following data file formats:
- Apache Parquet
- Optimized Row Columnar (ORC)
- Newline-delimited JSON (NDJSON)
- Apache Avro
- Tab- or Comma-separated values (TSV/CSV) or similar custom-delimited formats
Preferred format: If using delimited files, Tab-separated (TSV) is recommended because it avoids the need to quote fields containing commas (e.g., UserAgent strings).
All files must be encoded in UTF-8
or ISO-8859-1
.
Empty values
If your logs contain missing values (for example, when a DeviceID
is unavailable), Samba TV requires you to explicitly indicate these empty fields.
Commonly accepted representations:
- empty string (the empty string should still be separated by the separation character!)
null
\N
0
Size, compression and naming
To ensure smooth ingestion, please follow these requirements:
- File Size: Individual files must not exceed 2GB. Avoid excessive numbers of very small files (<50 MB), as this slows down ingestion.
- Compression: Use gzip compression or provide files uncompressed.
- File Naming: Include a date stamp in YYYYMMDD format in the filename.
- Example:
client_20150122_part123.log.gz
- Example: