Skip to content

Latest commit

 

History

History
83 lines (82 loc) · 10.1 KB

cheatsheet.md

File metadata and controls

83 lines (82 loc) · 10.1 KB

Cheatsheet

Name Usage Description
SWAP swap <column1> <column2> Swaps the column names of two columns.
ENCODE encode <base32 base64
XPATH xpath <column> <destination> <xpath> Extract a single XML element or attribute using XPath.
GENERATE-UUID generate-uuid <column> Populates a column with a universally unique identifier (UUID) of the record.
LOWERCASE lowercase <column> Changes the column values to lowercase.
WRITE-AS-CSV write-as-csv <column> Writes the records files as well-formatted CSV
PARSE-AS-PROTOBUF parse-as-protobuf <column> <schema-id> <record-name> [version] Parses column as protobuf encoded memory representations.
HASH hash <column> <algorithm> [<encode=true false>]
JSON-PATH json-path <source> <destination> <json-path-expression> Parses JSON elements using a DSL (a JSON path expression).
MASK-NUMBER mask-number <column> <pattern> Masks a column value using the specified masking pattern.
TEXT-DISTANCE text-distance <method> <column1> <column2> <destination> Calculates a text distance measure between two columns containing string.
PARSE-XML-TO-JSON parse-xml-to-json <column> [<depth>] Parses a XML document to JSON representation.
PARSE-AS-HL7 parse-as-hl7 <column> [<depth>] Parses <column> for Health Level 7 Version 2 (HL7 V2) messages; <depth> indicates at which point JSON object enumeration terminates.
FIND-AND-REPLACE find-and-replace <column> <sed-expression> Finds and replaces text in column values using a sed-format expression.
RENAME rename <old> <new> Renames an existing column.
PARSE-AS-AVRO parse-as-avro <column> <schema-id> <json binary> [version]
FILL-NULL-OR-EMPTY fill-null-or-empty <column> <fixed-value> Fills a value of a column with a fixed value if it is either null or empty.
SET-TYPE set-type <column> <type> Converting data type of a column.
RTRIM rtrim <column> Trimming whitespace from right side of a string.
INVOKE-HTTP invoke-http <url> <column>[,<column>] <header>[,<header>] [EXPERIMENTAL] Invokes an HTTP endpoint, passing columns as a JSON map (potentially slow).
COLUMNS-REPLACE columns-replace <sed-expression> Modifies column names in bulk using a sed-format expression.
SEND-TO-ERROR send-to-error <condition> Send records that match condition to the error collector.
SET-RECORD-DELIM set-record-delim <column> <delimiter> [<limit>] Sets the record delimiter.
SET-VARIABLE set-variable <variable> <expression> Sets the value for a transient variable for the record being processed.
SET-CHARSET set-charset <column> <charset> Sets the character set decoding to UTF-8.
WRITE-AS-JSON-OBJECT write-as-json-object <dest-column> [<src-column>[,<src-column>] Creates a JSON object based on source columns specified. JSON object is written into dest-column.
KEEP keep <column>[,<column>*] Keeps the specified columns and drops all others.
CUT-CHARACTER cut-character <source> <destination> <type> <range indexes>
SPLIT-TO-ROWS split-to-rows <column> <separator> Splits a column into multiple rows, copies the rest of the columns.
XPATH-ARRAY xpath-array <column> <destination> <xpath> Extract XML element or attributes as JSON array using XPath.
FAIL fail <condition> Fails when the condition is evaluated to true.
INCREMENT-VARIABLE increment-variable <variable> <value> <expression> Wrangler - A interactive tool for data cleansing and transformation.
PARSE-AS-XML parse-as-xml <column> Parses a column as XML.
PARSE-AS-FIXED-LENGTH parse-as-fixed-length <column> <width>[,<width>*] [<padding-character>] Parses fixed-length records using the specified widths and padding-character.
CHANGE-COLUMN-CASE change-column-case lower upper
SPLIT-EMAIL split-email <column> Split a email into account and domain.
URL-ENCODE url-encode <column> URL encode a column value.
WRITE-AS-JSON-MAP write-as-json-map <column> Writes all record columns as JSON map.
MASK-SHUFFLE mask-shuffle <column> Masks a column value by shuffling characters while maintaining the same length.
DROP drop <column>[,<column>*] Drop one or more columns.
DECODE decode <base32 base64
SPLIT split <source> <delimiter> <new-column-1> <new-column-2> [DEPRECATED] Use 'split-to-columns' or 'split-to-rows'.
PARSE-AS-SIMPLE-DATE parse-as-simple-date <column> <format> Parses a column as date using format.
DIFF-DATE diff-date <column1> <column2> <destination> Calculates the difference in milliseconds between two Date objects.Positive if <column2> earlier. Must use 'parse-as-date' or 'parse-as-simple-date' first.
INDEXSPLIT indexsplit <source> <start> <end> <destination> [DEPRECATED] Use the 'split-to-columns' or 'parse-as-fixed-length' directives instead.
PARSE-AS-AVRO-FILE parse-as-avro-file <column> parse-as-avro-file <column>.
FILTER-ROW-IF-TRUE filter-row-if-true <condition> [DEPRECATED] Filters rows if condition is evaluated to true. Use 'filter-rows-on' instead.
SPLIT-URL split-url <column> Split a url into it's components host,protocol,port,etc.
FORMAT-DATE format-date <column> <format> Formats a column using a date-time format. Use 'parse-as-date` beforehand.
QUANTIZE quantize <source> <destination> <[range1:range2)=value>,[<range1:range2=value>]* Quanitize the range of numbers into label values.
PARSE-AS-EXCEL parse-as-excel <column> [<sheet number sheet name>]
PARSE-AS-DATE parse-as-date <column> [<timezone>] Parses column values as dates using natural language processing and automatically identifying the format (expensive in terms of time consumed).
TABLE-LOOKUP table-lookup <column> <table> Uses the given column as a key to perform a lookup into the specified table.
FILTER-ROWS-ON filter-rows-on empty-or-null-columns <column>[,<column>*] Filters row that have empty or null columns.
TRIM trim <column> Trimming whitespace from both sides of a string.
URL-DECODE url-decode <column> URL decode a column value.
FLATTEN flatten <column>[,<column>*] Separates array elements of one or more columns into indvidual records, copying the other columns.
UPPERCASE uppercase <column> Changes the column values to uppercase.
CATALOG-LOOKUP catalog-lookup <catalog> <column> Looks-up values from pre-loaded (static) catalogs.
PARSE-AS-LOG parse-as-log <column> <format> Parses Apache HTTPD and NGINX logs.
LTRIM ltrim <column> Trimming whitespace from left side of a string.
EXTRACT-REGEX-GROUPS extract-regex-groups <column> <regex-with-groups> Extracts data from a regex group into its own column.
PARSE-AS-CSV parse-as-csv <column> <delimiter> [<header=true false>]
FILTER-ROW-IF-MATCHED filter-row-if-matched <column> <regex> [DEPRECATED] Filters rows if the regex is matched. Use 'filter-rows-on' instead.
PARSE-AS-JSON parse-as-json <column> [<depth>] Parses a column as JSON.
SET COLUMN set column <column> <jexl-expression> Sets a column by evaluating a JEXL expression.
STEMMING stemming <column> Apply Porter Stemming on the column value.
COPY copy <source> <destination> [<force=true false>]
SET-COLUMN set-column <column> <expression> Sets a column the result of expression execution.
SPLIT-TO-COLUMNS split-to-columns <column> <regex> Splits a column into one or more columns around matches of the specified regular expression.
CLEANSE-COLUMN-NAME cleanse-column-names Sanatizes column names: trims, lowercases, and replaces all but [A-Z][a-z][0-9].with an underscore ''.
SET COLUMNS set columns <columm>[,<column>*] Sets the name of columns, in the order they are specified.
TITLECASE titlecase <column> Changes the column values to title case.
MERGE merge <column1> <column2> <new-column> <separator> Merges values from two columns using a separator into a new column.
TEXT-METRIC text-metric <method> <column1> <column2> <destination> Calculates the metric for comparing two string values.
SET FORMAT set format csv <delimiter> <skip empty lines> [DEPRECATED] Parses the predefined column as CSV. Use 'parse-as-csv' instead.
FORMAT-UNIX-TIMESTAMP format-unix-timestamp <column> <format> Formats a UNIX timestamp using the specified format
FILTER-ROW-IF-NOT-MATCHED filter-row-if-not-matched <column> <regex> Filters rows if the regex does not match
FILTER-ROW-IF-FALSE filter-row-if-false <condition> Filters rows if the condition evaluates to false