SWAP |
swap <column1> <column2> |
Swaps the column names of two columns. |
ENCODE |
encode <base32 |
base64 |
XPATH |
xpath <column> <destination> <xpath> |
Extract a single XML element or attribute using XPath. |
GENERATE-UUID |
generate-uuid <column> |
Populates a column with a universally unique identifier (UUID) of the record. |
LOWERCASE |
lowercase <column> |
Changes the column values to lowercase. |
WRITE-AS-CSV |
write-as-csv <column> |
Writes the records files as well-formatted CSV |
PARSE-AS-PROTOBUF |
parse-as-protobuf <column> <schema-id> <record-name> [version] |
Parses column as protobuf encoded memory representations. |
HASH |
hash <column> <algorithm> [<encode=true |
false>] |
JSON-PATH |
json-path <source> <destination> <json-path-expression> |
Parses JSON elements using a DSL (a JSON path expression). |
MASK-NUMBER |
mask-number <column> <pattern> |
Masks a column value using the specified masking pattern. |
TEXT-DISTANCE |
text-distance <method> <column1> <column2> <destination> |
Calculates a text distance measure between two columns containing string. |
PARSE-XML-TO-JSON |
parse-xml-to-json <column> [<depth>] |
Parses a XML document to JSON representation. |
PARSE-AS-HL7 |
parse-as-hl7 <column> [<depth>] |
Parses <column> for Health Level 7 Version 2 (HL7 V2) messages; <depth> indicates at which point JSON object enumeration terminates. |
FIND-AND-REPLACE |
find-and-replace <column> <sed-expression> |
Finds and replaces text in column values using a sed-format expression. |
RENAME |
rename <old> <new> |
Renames an existing column. |
PARSE-AS-AVRO |
parse-as-avro <column> <schema-id> <json |
binary> [version] |
FILL-NULL-OR-EMPTY |
fill-null-or-empty <column> <fixed-value> |
Fills a value of a column with a fixed value if it is either null or empty. |
SET-TYPE |
set-type <column> <type> |
Converting data type of a column. |
RTRIM |
rtrim <column> |
Trimming whitespace from right side of a string. |
INVOKE-HTTP |
invoke-http <url> <column>[,<column>] <header>[,<header>] |
[EXPERIMENTAL] Invokes an HTTP endpoint, passing columns as a JSON map (potentially slow). |
COLUMNS-REPLACE |
columns-replace <sed-expression> |
Modifies column names in bulk using a sed-format expression. |
SEND-TO-ERROR |
send-to-error <condition> |
Send records that match condition to the error collector. |
SET-RECORD-DELIM |
set-record-delim <column> <delimiter> [<limit>] |
Sets the record delimiter. |
SET-VARIABLE |
set-variable <variable> <expression> |
Sets the value for a transient variable for the record being processed. |
SET-CHARSET |
set-charset <column> <charset> |
Sets the character set decoding to UTF-8. |
WRITE-AS-JSON-OBJECT |
write-as-json-object <dest-column> [<src-column>[,<src-column>] |
Creates a JSON object based on source columns specified. JSON object is written into dest-column. |
KEEP |
keep <column>[,<column>*] |
Keeps the specified columns and drops all others. |
CUT-CHARACTER |
cut-character <source> <destination> <type> <range |
indexes> |
SPLIT-TO-ROWS |
split-to-rows <column> <separator> |
Splits a column into multiple rows, copies the rest of the columns. |
XPATH-ARRAY |
xpath-array <column> <destination> <xpath> |
Extract XML element or attributes as JSON array using XPath. |
FAIL |
fail <condition> |
Fails when the condition is evaluated to true. |
INCREMENT-VARIABLE |
increment-variable <variable> <value> <expression> |
Wrangler - A interactive tool for data cleansing and transformation. |
PARSE-AS-XML |
parse-as-xml <column> |
Parses a column as XML. |
PARSE-AS-FIXED-LENGTH |
parse-as-fixed-length <column> <width>[,<width>*] [<padding-character>] |
Parses fixed-length records using the specified widths and padding-character. |
CHANGE-COLUMN-CASE |
change-column-case lower |
upper |
SPLIT-EMAIL |
split-email <column> |
Split a email into account and domain. |
URL-ENCODE |
url-encode <column> |
URL encode a column value. |
WRITE-AS-JSON-MAP |
write-as-json-map <column> |
Writes all record columns as JSON map. |
MASK-SHUFFLE |
mask-shuffle <column> |
Masks a column value by shuffling characters while maintaining the same length. |
DROP |
drop <column>[,<column>*] |
Drop one or more columns. |
DECODE |
decode <base32 |
base64 |
SPLIT |
split <source> <delimiter> <new-column-1> <new-column-2> |
[DEPRECATED] Use 'split-to-columns' or 'split-to-rows'. |
PARSE-AS-SIMPLE-DATE |
parse-as-simple-date <column> <format> |
Parses a column as date using format. |
DIFF-DATE |
diff-date <column1> <column2> <destination> |
Calculates the difference in milliseconds between two Date objects.Positive if <column2> earlier. Must use 'parse-as-date' or 'parse-as-simple-date' first. |
INDEXSPLIT |
indexsplit <source> <start> <end> <destination> |
[DEPRECATED] Use the 'split-to-columns' or 'parse-as-fixed-length' directives instead. |
PARSE-AS-AVRO-FILE |
parse-as-avro-file <column> |
parse-as-avro-file <column>. |
FILTER-ROW-IF-TRUE |
filter-row-if-true <condition> |
[DEPRECATED] Filters rows if condition is evaluated to true. Use 'filter-rows-on' instead. |
SPLIT-URL |
split-url <column> |
Split a url into it's components host,protocol,port,etc. |
FORMAT-DATE |
format-date <column> <format> |
Formats a column using a date-time format. Use 'parse-as-date` beforehand. |
QUANTIZE |
quantize <source> <destination> <[range1:range2)=value>,[<range1:range2=value>]* |
Quanitize the range of numbers into label values. |
PARSE-AS-EXCEL |
parse-as-excel <column> [<sheet number |
sheet name>] |
PARSE-AS-DATE |
parse-as-date <column> [<timezone>] |
Parses column values as dates using natural language processing and automatically identifying the format (expensive in terms of time consumed). |
TABLE-LOOKUP |
table-lookup <column> <table> |
Uses the given column as a key to perform a lookup into the specified table. |
FILTER-ROWS-ON |
filter-rows-on empty-or-null-columns <column>[,<column>*] |
Filters row that have empty or null columns. |
TRIM |
trim <column> |
Trimming whitespace from both sides of a string. |
URL-DECODE |
url-decode <column> |
URL decode a column value. |
FLATTEN |
flatten <column>[,<column>*] |
Separates array elements of one or more columns into indvidual records, copying the other columns. |
UPPERCASE |
uppercase <column> |
Changes the column values to uppercase. |
CATALOG-LOOKUP |
catalog-lookup <catalog> <column> |
Looks-up values from pre-loaded (static) catalogs. |
PARSE-AS-LOG |
parse-as-log <column> <format> |
Parses Apache HTTPD and NGINX logs. |
LTRIM |
ltrim <column> |
Trimming whitespace from left side of a string. |
EXTRACT-REGEX-GROUPS |
extract-regex-groups <column> <regex-with-groups> |
Extracts data from a regex group into its own column. |
PARSE-AS-CSV |
parse-as-csv <column> <delimiter> [<header=true |
false>] |
FILTER-ROW-IF-MATCHED |
filter-row-if-matched <column> <regex> |
[DEPRECATED] Filters rows if the regex is matched. Use 'filter-rows-on' instead. |
PARSE-AS-JSON |
parse-as-json <column> [<depth>] |
Parses a column as JSON. |
SET COLUMN |
set column <column> <jexl-expression> |
Sets a column by evaluating a JEXL expression. |
STEMMING |
stemming <column> |
Apply Porter Stemming on the column value. |
COPY |
copy <source> <destination> [<force=true |
false>] |
SET-COLUMN |
set-column <column> <expression> |
Sets a column the result of expression execution. |
SPLIT-TO-COLUMNS |
split-to-columns <column> <regex> |
Splits a column into one or more columns around matches of the specified regular expression. |
CLEANSE-COLUMN-NAME |
cleanse-column-names |
Sanatizes column names: trims, lowercases, and replaces all but [A-Z][a-z][0-9].with an underscore ''. |
SET COLUMNS |
set columns <columm>[,<column>*] |
Sets the name of columns, in the order they are specified. |
TITLECASE |
titlecase <column> |
Changes the column values to title case. |
MERGE |
merge <column1> <column2> <new-column> <separator> |
Merges values from two columns using a separator into a new column. |
TEXT-METRIC |
text-metric <method> <column1> <column2> <destination> |
Calculates the metric for comparing two string values. |
SET FORMAT |
set format csv <delimiter> <skip empty lines> |
[DEPRECATED] Parses the predefined column as CSV. Use 'parse-as-csv' instead. |
FORMAT-UNIX-TIMESTAMP |
format-unix-timestamp <column> <format> |
Formats a UNIX timestamp using the specified format |
FILTER-ROW-IF-NOT-MATCHED |
filter-row-if-not-matched <column> <regex> |
Filters rows if the regex does not match |
FILTER-ROW-IF-FALSE |
filter-row-if-false <condition> |
Filters rows if the condition evaluates to false |