CSV Format
Feldera can ingest and output data in the CSV format
- via
ingressandegressREST endpoints by specifying?format=csvin the URL - as a payload received from or sent to a connector
Here we document the CSV format and how it interacts with different SQL types.
The CSV format expects comma-separated columns and rows
separated by a newline (\n). The egress and expected ingress character
encoding is UTF-8.
The rows must appear in the same order as the program table definition specified the fields. For example, consider the following table:
create table git_commit (
commit_id varchar not null,
commit_date timestamp not null,
);
An example CSV row would look like this:
c6d16e61,2024-02-25 12:12:33
Any unspecified rows following are discarded on ingress: This line
c6d16e61,2024-02-25 12:12:33,invalid would be equivalent to the row above for
the given table.
For nullable types that are not strings, a value can be set to NULL by either
leaving the field empty, or writing NULL or null as the column value. For
string types (e.g., CHAR, VARCHAR), leaving the field empty will be
interpreted as an empty string instead of NULL.
Types
| Type | Example |
|---|---|
| BOOLEAN | true, false |
| TINYINT,SMALLINT, INTEGER, BIGINT | 1, -9 |
| FLOAT, DOUBLE, DECIMAL | -1.40, 12.53, 1e20, NaN |
| CHAR, VARCHAR, STRING, TEXT | abc |
| TIME | 12:12:33, 23:59:29.483, 23:59:09.483221092 |
| TIMESTAMP | 2024-02-25 12:12:33 |
| DATE | 2024-02-25 |
| BIGINT ARRAY | [1, 2] |
| VARCHAR ARRAY ARRAY | [[ 'abc', '123'], ['c', 'sql']] |
| VARIANT | "{""foo"": ""bar""}" (see VARIANT documetation) |
BOOLEAN
The accepted values are true or false.
Integers (TINYINT, SMALLINT, INTEGER, BIGINT`)
Must be a valid integer and fit the range of the type (see SQL Types), otherwise an error is returned on ingress.
Decimals (DECIMAL / NUMERIC)
Either scientific notation (e.g., 3e234) or standard floating point numbers
are valid 1.23. The provided value must fit within the specified range or
precision, otherwise an error is returned.
Floating point numbers (FLOAT, DOUBLE)
Either scientific notation (e.g., 3e234), or standard floating point numbers
are valid 1.23.
Not a number can be specified using NaN (case insensitive). Infinity
can be specified using -Inf or Inf, or +Inf (case insensitive).
If a floating point value is provided that is outside of the valid range
for that type, it is set to NaN.
If a value is provided inside the maximum range of the type but still can't be represented by the type it is rounded to the nearest representable floating point value.
Strings (CHAR, VARCHAR, TEXT, STRING)
Accepts strings with any number of characters.
The CSV parser does not currently enforce limits on the number of characters in a string. Strings that exceed the length specified in the SQL table declaration are ingested without truncation.
If a string contains commas, it can be quoted with ": "string, with, commas".
Note that a string of just null or NULL for a nullable column gets
translated to the NULL value in SQL.
TIME
Specifies times using the HH:MM:SS.fffffffff format where:
HHis hours from00-23.MMis minutes from00-59.SSis seconds from00-59.fffffffffis the sub-second precision up to 9 digits from0to999999999
A leading 0 can be skipped in hour, minutes and seconds. Specifying the subsecond precision is optional and can have any number of digits from 0 to 9. Leading and trailing whitespaces are ignored for ingress.
DATE
Specifies dates using the YYYY-MM-DD format.
YYYYis the year from0001-9999MMis the month from01-12DDis the day from01-31
Invalid dates (e.g., 1997-02-29) are rejected with an error during ingress.
Leading zeros can be skipped, e.g., 0001-1-01, 1-1-1, 0000-1-1 are all
equal and valid. Leading and trailing whitespaces are ignored for ingress.
TIMESTAMP
Specifies dates using the YYYY-MM-DD HH:MM:SS.fff format.
YYYYis the year from0001-9999MMis the month from01-12DDis the day from01-31HHis hours from00-23.MMis minutes from00-59.SSis seconds from00-59.fffis the sub-second precision up to 3 digits from0to999
Note that the same rules as specified in the Date and Time sections apply, except that the sub-second precision is limited to three digits (microseconds). Specifying more digits for the subsecond precision on ingress will trim the fraction to microseconds. Leading and trailing whitespaces are ignored for ingress.
ARRAY
The CSV format does not have native support for arrays. Arrays are expected to
be represented in the form of a string that is a valid JSON array. e.g., a value
for ARRAY BIGINT can be expressed as '[1,2,3]'.
Configuring CSV event streams
Configure connectors
When adding a new input or output connector on a table or view,
the data format is specified in the format field of the connector configuration:
create table FAILED_BANKS (
name varchar,
city varchar,
state varchar,
cert bigint,
acquirer varchar,
closing varchar, -- needs to be translated from 'DD-MMM-YY' format
fund bigint
) with (
'connectors' = '[{
"transport": {
"name": "url_input",
"config": { "path": "https://www.fdic.gov/system/files/2024-07/banklist.csv" }
},
"format": {
"name": "csv",
"config": { "headers": true }
}
}]'
);
delimiter: A single character that delimits fields. The default is",".headers: Whether the first line of input is a header line. If this is set totrue, Feldera ignores the first line. This applies only to input connectors (Feldera never writes a header line for CSV output). The default isfalse.quote: The quote character used to wrap fields that contain the delimiter or newlines. Must be an ASCII character. The default is'"'. Setquotingtofalseto disable quoting entirely.escape: An escape character that can precede a quote inside a quoted field to produce a literal quote byte (e.g.,"\\"for backslash). When unset (the default), thedouble_quoteconvention is used instead.double_quote: Whentrue(the default), a quote character inside a quoted field may be escaped by doubling it (e.g.,""represents a single"). Setting this tofalsedisables that convention; useescapeto escape quotes in that case.quoting: Whentrue(the default), thequoteandescapecharacters have their special meaning. Whenfalse, every newline terminates a record regardless of context, and neitherquotenorescapeis treated specially.comment: A single ASCII character that marks a line as a comment. Lines whose first byte matches this character are skipped entirely. Unset by default.flexible: Whentrue(the default), records may have fewer or more fields than expected. Missing fields receive empty values. Whenfalse, every record must have exactly the same number of fields as the first record; a mismatch is reported as an error.trim: Controls whitespace trimming. Allowed values:"none"(default) — no trimming."fields"— strip leading and trailing ASCII whitespace from every field value."headers"— strip whitespace from header names only (effective only whenheadersistrue)."all"— strip whitespace from both field values and header names.
buffer_size_records: Maximum number of records to batch into a single output message (output connectors only). When this threshold is reached, the encoder flushes the buffer to the transport. The default is10000.