Skip to main content

CSV Format

Feldera can ingest and output data in the CSV format

Here we document the CSV format and how it interacts with different SQL types.

The CSV format expects comma-separated columns and rows separated by a newline (\n). The egress and expected ingress character encoding is UTF-8.

The rows must appear in the same order as the program table definition specified the fields. For example, consider the following table:

create table git_commit (
commit_id varchar not null,
commit_date timestamp not null,
);

An example CSV row would look like this:

c6d16e61,2024-02-25 12:12:33

Any unspecified rows following are discarded on ingress: This line c6d16e61,2024-02-25 12:12:33,invalid would be equivalent to the row above for the given table.

For nullable types that are not strings, a value can be set to NULL by either leaving the field empty, or writing NULL or null as the column value. For string types (e.g., CHAR, VARCHAR), leaving the field empty will be interpreted as an empty string instead of NULL.

Types

TypeExample
BOOLEANtrue, false
TINYINT,SMALLINT, INTEGER, BIGINT1, -9
FLOAT, DOUBLE, DECIMAL-1.40, 12.53, 1e20, NaN
CHAR, VARCHAR, STRING, TEXTabc
TIME12:12:33, 23:59:29.483, 23:59:09.483221092
TIMESTAMP2024-02-25 12:12:33
DATE2024-02-25
BIGINT ARRAY[1, 2]
VARCHAR ARRAY ARRAY[[ 'abc', '123'], ['c', 'sql']]
VARIANT"{""foo"": ""bar""}" (see VARIANT documetation)

BOOLEAN

The accepted values are true or false.

Integers (TINYINT, SMALLINT, INTEGER, BIGINT`)

Must be a valid integer and fit the range of the type (see SQL Types), otherwise an error is returned on ingress.

Decimals (DECIMAL / NUMERIC)

Either scientific notation (e.g., 3e234) or standard floating point numbers are valid 1.23. The provided value must fit within the specified range or precision, otherwise an error is returned.

Floating point numbers (FLOAT, DOUBLE)

Either scientific notation (e.g., 3e234), or standard floating point numbers are valid 1.23.

Not a number can be specified using NaN (case insensitive). Infinity can be specified using -Inf or Inf, or +Inf (case insensitive). If a floating point value is provided that is outside of the valid range for that type, it is set to NaN.

If a value is provided inside the maximum range of the type but still can't be represented by the type it is rounded to the nearest representable floating point value.

Strings (CHAR, VARCHAR, TEXT, STRING)

Accepts strings with any number of characters.

note

The CSV parser does not currently enforce limits on the number of characters in a string. Strings that exceed the length specified in the SQL table declaration are ingested without truncation.

If a string contains commas, it can be quoted with ": "string, with, commas".

Note that a string of just null or NULL for a nullable column gets translated to the NULL value in SQL.

TIME

Specifies times using the HH:MM:SS.fffffffff format where:

  • HH is hours from 00-23.
  • MM is minutes from 00-59.
  • SS is seconds from 00-59.
  • fffffffff is the sub-second precision up to 9 digits from 0 to 999999999

A leading 0 can be skipped in hour, minutes and seconds. Specifying the subsecond precision is optional and can have any number of digits from 0 to 9. Leading and trailing whitespaces are ignored for ingress.

DATE

Specifies dates using the YYYY-MM-DD format.

  • YYYY is the year from 0001-9999
  • MM is the month from 01-12
  • DD is the day from 01-31

Invalid dates (e.g., 1997-02-29) are rejected with an error during ingress. Leading zeros can be skipped, e.g., 0001-1-01, 1-1-1, 0000-1-1 are all equal and valid. Leading and trailing whitespaces are ignored for ingress.

TIMESTAMP

Specifies dates using the YYYY-MM-DD HH:MM:SS.fff format.

  • YYYY is the year from 0001-9999
  • MM is the month from 01-12
  • DD is the day from 01-31
  • HH is hours from 00-23.
  • MM is minutes from 00-59.
  • SS is seconds from 00-59.
  • fff is the sub-second precision up to 3 digits from 0 to 999

Note that the same rules as specified in the Date and Time sections apply, except that the sub-second precision is limited to three digits (microseconds). Specifying more digits for the subsecond precision on ingress will trim the fraction to microseconds. Leading and trailing whitespaces are ignored for ingress.

ARRAY

The CSV format does not have native support for arrays. Arrays are expected to be represented in the form of a string that is a valid JSON array. e.g., a value for ARRAY BIGINT can be expressed as '[1,2,3]'.

Configuring CSV event streams

Configure connectors

When adding a new input or output connector on a table or view, the data format is specified in the format field of the connector configuration:

create table FAILED_BANKS (
name varchar,
city varchar,
state varchar,
cert bigint,
acquirer varchar,
closing varchar, -- needs to be translated from 'DD-MMM-YY' format
fund bigint
) with (
'connectors' = '[{
"transport": {
"name": "url_input",
"config": { "path": "https://www.fdic.gov/system/files/2024-07/banklist.csv" }
},
"format": {
"name": "csv",
"config": { "headers": true }
}
}]'
);
  • delimiter: A single character that delimits fields. The default is ",".
  • headers: Whether the first line of input is a header line. If this is set to true, Feldera ignores the first line. This applies only to input connectors (Feldera never writes a header line for CSV output). The default is false.
  • quote: The quote character used to wrap fields that contain the delimiter or newlines. Must be an ASCII character. The default is '"'. Set quoting to false to disable quoting entirely.
  • escape: An escape character that can precede a quote inside a quoted field to produce a literal quote byte (e.g., "\\" for backslash). When unset (the default), the double_quote convention is used instead.
  • double_quote: When true (the default), a quote character inside a quoted field may be escaped by doubling it (e.g., "" represents a single "). Setting this to false disables that convention; use escape to escape quotes in that case.
  • quoting: When true (the default), the quote and escape characters have their special meaning. When false, every newline terminates a record regardless of context, and neither quote nor escape is treated specially.
  • comment: A single ASCII character that marks a line as a comment. Lines whose first byte matches this character are skipped entirely. Unset by default.
  • flexible: When true (the default), records may have fewer or more fields than expected. Missing fields receive empty values. When false, every record must have exactly the same number of fields as the first record; a mismatch is reported as an error.
  • trim: Controls whitespace trimming. Allowed values:
    • "none" (default) — no trimming.
    • "fields" — strip leading and trailing ASCII whitespace from every field value.
    • "headers" — strip whitespace from header names only (effective only when headers is true).
    • "all" — strip whitespace from both field values and header names.
  • buffer_size_records: Maximum number of records to batch into a single output message (output connectors only). When this threshold is reached, the encoder flushes the buffer to the transport. The default is 10000.