-
Notifications
You must be signed in to change notification settings - Fork 2
Parsing common data formats via LPeg
License
spc476/LPeg-Parsers
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The code herein contains LPeg [1] routines for parsing some common data formats. The current formats are: abnf The core ruleset from RFC-5234. These rules are used often in RFCs. ascii ascii.char ascii.control ascii.ctrl Match a single ASCII character. The top level module will match both graphical charaters and control characters. The "ascii.char" module only matches the graphical characters; the "ascii.control" module only matches the control codes. The "ascii.ctrl" will return the name of a control character, or nil if not a control character. iso iso.char iso.control iso.ctrl Match a single ISO character. The top level module will match both graphical, control characters and control sequences. The "iso.char" only matches the ISO graphical characters; the iso.control" module mathces only control characters (for example, <ESC>E or \133). The "iso.ctrl" will return the name of the control character, plus any associated data as appropriate. For example: <ESC>[32;40m will return a name of "SGR" and a two element array containing 32 and 40. NOTE: These modules only deal with the ISO defined characters, and will NOT match those defined by ASCII. To match a graphical character that matches both ASCII and ISO: char = require "org.conman.parsers.ascii.char + require "org.conman.parsers.iso.char email Parses email headers as defined in: RFC-0822 Internet Message Format RFC-1036 Standard for Interchange of USENET Messages RFC-2045 Multipurpose Internet Mail Extensions I RFC-2046 Multipurpose Internet Mail Extensions II RFC-2047 Multipurpose Internet Mail Extensions III RFC-2048 Multipurpose Internet Mail Extensions IV RFC-2369 The Use of URLs as Meta-Syntax for Core Mail List Commands and their Transport through Message Header Fields RFC-2822 Internet Message Format RFC-2919 A Structured Field and Namespace for the Identification of Mailing Lists RFC-5064 The Archived-At Message Header Field RFC-5322 Internet Message Format Headers are returned in a Lua table. For example, the following headers: Return-Path: <[email protected]> Received: from brevard.conman.org (brevard.conman.org [66.252.224.242]) by mail.example.com (Postfix) with ESMTP id 538562EA5D07 for <[email protected]>; Fri, 28 Dec 2012 21:40:11 -0500 From: Sean Conner <[email protected]> To: Sherlock Holmes <[email protected]>, the-scooby-gang: Fred <[email protected]>, Daphne <[email protected]>, Velma <[email protected]>, Shaggy <[email protected]>, Scobby-Doo <[email protected]>;, The Batman <[email protected]> Subject: I know who did it! Date: Fri, 28 Dec 2012 21:40:11 -0500 Message-ID: <[email protected]> Will return the following Lua table: { received = { [1] = { with = "ESMTP", from = "brevard.conman.org", id = "538562EA5D07", when = { min = 0.000000, zone = -18000.000000, day = 28.000000, month = 12.000000, year = 2012.000000, sec = 1.000000, hour = 1.000000, weekday = "Fri", }, for = { address = "[email protected]", }, by = "mail.example.com", }, }, to = { [1] = { name = "Sherlock Holmes", address = "[email protected]", }, [2] = { ['the-scooby-gang'] = { [1] = { name = "Fred", address = "[email protected]", }, [2] = { name = "Daphne", address = "[email protected]", }, [3] = { name = "Velma", address = "[email protected]", }, [4] = { name = "Shaggy", address = "[email protected]", }, [5] = { name = "Scobby-Doo", address = "[email protected]", }, }, }, [3] = { name = "The Batman", address = "[email protected]", }, }, from = { [1] = { name = "Sean Conner", address = "[email protected]", }, }, date = { min = 0.000000, zone = -18000.000000, day = 28.000000, month = 12.000000, year = 2012.000000, sec = 1.000000, hour = 1.000000, weekday = "Fri", }, return_path = { [1] = { address = "[email protected]", }, }, message_id = "[email protected]", subject = "I know who did it!", } The only fields not supported are the Resent-* fields; they are rarely used and the semantics are particularly hard to support via parsing only. These fields, as well as any other fields not otherwise understood or parsable will end up on a field called 'generic' with the key being the raw header name. json Implements a JSON parser. It requires some additional modules [2] to run. This will parse a JSON file into a Lua table. The full grammar is supported, but it expects the input to be valid UTF-8. A JSON null value will be converted to nil. If you won't want this behavior, define a global variable called "null" to be the value you want for a JSON null. jsons Another implementation of a JSON parser. This one "streams" the input, meaning it will handle large JSON files the other one won't, and is a drop in replacement. You can also pass in a function that will return more data so you can actually "stream" data into the parser. ip Provides two LPeg patterns, IPv4 and IPv6 which parse and convert said addresses directly into their network-order binary formats. ip-text Provides two LPeg patterns, IPv4 and IPv6 which parse and return said addresses as text, unlike the ip module above. ini Provides a INI file parser that returns a Lua table from a INI file. A sample INI file such as: ; we allow "default" values default = ok [section1] var1 = foo var2 = 12,23,34,54,44 VAR3 = "var3=foo",33,44,55 var2 = apple Var4 = 55 [section2] # another comment ; and so is this one var1 = foo bar baz ; this is a comment var2 = "foo bar baz ; this is not a comment" [section1] var4=this is a test var5= this is also a test var2 = pear var3 = 88,99 will result in a Lua table of: { section1 = { var1 = "foo", var5 = "this is also a test", var4 = { [1] = "55", [2] = "this is a test", }, var3 = { [1] = "var3=foo", [2] = "33", [3] = "44", [4] = "55", [5] = "88", [6] = "99", }, var2 = { [1] = "12", [2] = "23", [3] = "34", [4] = "54", [5] = "44", [6] = "apple", [7] = "pear", }, }, default = "ok", section2 = { var1 = "foo bar baz ", var2 = "foo bar baz ; this is not a comment", }, } strftime Parses the format string for strftime() (or os.date() for Lua) and returns an LPeg expression that can parse that format, with the exceptions of "%c", "%x" and "%X" (all system specific formats). For example, the format, "%A, %d %B %Y @ %H:%M:%S" will return the LPeg expression to parse this: Monday, 02 July 2018 @ 16:02:48 into { min = 2.000000, wday = 2.000000, day = 2.000000, month = 7.000000, sec = 48.000000, hour = 16.000000, year = 2018.000000, } This will even work with other locales, such as "se_NO.UTF-8", which will allow LPeg to parse: vuossárga, 02 suoidnemánu 2018 @ 16:02:48 into { min = 2,000000, wday = 2,000000, day = 2,000000, month = 7,000000, sec = 48,000000, hour = 16,000000, year = 2018,000000, } url Parses URLs per RFC-3986. By default, it will handle the following URL types: http: https: file: ftp: Given the following URL: http://www.conman.org/people/spc/index.cgi?one=1%3F&two=2&three=3#target1 It will be broken down into a Lua table as follows: { scheme = "http", host = "www.conman.org", port = 80, path = "/people/spc/index.cgi", query = "one=1%3F&two=2&three=3", fragment = "target1", } Other URLs can be parsed, but a URL like: mailto:[email protected][email protected],[email protected]&subject=Current%20Mystery will be broken down as: { scheme = "mailto", path = "[email protected]", query = "[email protected],[email protected]&subject=Current%20Mystery", } which may require more parsing than provided here. url.gopher Parses "gopher:" URLs per RFC-4266. Given this URL: gopher://gopher.conman.org/0foobar%09search%20String%09plus it will be broken down as: { scheme = "gopher", host = "gopher.conman.org", port = 70.000000, type = "file", selector = "foobar", search = "search String", plus = "plus", } If you need to parse other URLs in addition to "gopher:" types, you can do: gopher = require "org.conman.parsers.url.gopher" url = require "org.conman.parsers.url" url = gopher + url info = url:match(my_url) url.siptel Parses "sip:" and "sips:" URIs per RFC-3261. Parses "tel:" URIs per RFC-3966. Examples: sip = require "org.conman.parsers.url.sip" u = sip:match [[sip:[email protected];play=file://fs.example.net//clips/my-intro.dvi;content-type=video/mpeg%3bencode%d3314M-25/625-50]] results in: { host = "example.com", port = 5060.000000, user = "annc", scheme = "sip", parameters = { play = "file://fs.example.net//clips/my-intro.dvi", ["content-type"] = "video/mpeg%3bencode%d3314M-25/625-50", }, } and u = sip:match [[sip:+1-(555)-555-1212;[email protected];user=phone]] results in: { host = "example.net", port = 5060.000000, user = { number = "15555551212", global = true, parameters = { ext = "1234", }, }, scheme = "sip", parameters = { user = "phone", }, } and tel = require "org.conman.parsers.url.tel" u = tel:match "tel:+1-(555)-555-1212;ext=1234" results in: { scheme = "tel", number = "15555551212", global = true, parameters = { ext = "1234", }, } If you need to parse other URLs in addition to these types, you can do: siptel = require "org.conman.parsers.url.sip" url = require "org.conman.parsers.url" url = siptel + url info = url:match(my_url) soundex.lua Implements the Soundex algorithm. [1] http://www.inf.puc-rio.br/~roberto/lpeg/ [2] https://github.com/spc476/lua-conmanorg
About
Parsing common data formats via LPeg
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published