Aggregators For PaSh

Currently aggregators are WIP. The new ones are in cpp/bin. They are automatically built during setup_pash.sh and the unit tests in cpp/tests are run during run_tests.sh. The interface is like the following:

aggregator inputFile1 inputFile2 args

Where args are the arguments that were passed to the command that produced the input files. The aggregator outputs to stdout.

Adding new aggregators

Let's assume that the aggregator being implemented is for a command called cmd.

Create a folder named cmd inside cpp/aggregators
For each OS supported by PaSh:

2.1 Create a file named OS-agg.h inside that folder

2.2. Implement the aggregator inside that file using the instructions provided in cpp/common/main.h or use a different aggregator as an example. Remember about the include guard.

2.3 You may create additional files in the aggregator directory. This can be used to share code between aggregator implementations for different OSes. When #includeing, assume that the aggregator directory is in the include path.
Add unit tests for the created aggregator in cpp/tests/test-OS.sh for each OS. Consult the instructions in that file. Remember to test all options and flags of the aggregator.

Note: after completing these steps the aggregator will automatically be built by the Makefile with no changes to it required.

Automated Synthesis of Scalable Aggregators

Overview

Command-specific aggregators for POSIX commands
/agg-synthesis:
- Aggregators:
  - grep , wc, sort , uniq
  - /tail_head: tail, head (under development)
  - /grep-n: under development -- not used by any current benchmark scripts
- Util Functions: read, write, settings (locale and padding length)
- /Benchmarks: test correctness and identify implemented/not implemented aggregators
  - covid-mts
  - nlp
  - oneliners
  - unix50
/agg-mult-input:
- draft for agg combinging results when a single cmd takes in multiple inputs
Development Journal

Single File Argument Aggregators

Overview

Aggregates parallel results when commands are applied to single file input (i.e. wc hi.txt)
How to run: ./s_wc.py -c [parallel output result 1] [parallel output result 2] ...

Script	Additional info. needed	Description	Notes
`./s_wc.py`	No	Combines count results by adding relative values and add paddings to match result format Supports flags `-l, -c, -w, -m`
`./s_grep.py`	No	Combines `grep` results Supports flags `-c`, flags that don't change concat nature (`-i`, `-e`...)
`./s_uniq.py`	No	Combines `uniq` , merge same lines at end of files/beginning of files Support flags `-c`
`./s_sort.py`	No	Combines `sort` results Support flags `-n`, `-k`, `-r`, `-u`, `-f`
`./s_head.py`	No	Combines `head` results by always returning former split document when given multiple split documents	Under development
`./s_tail.py`	No	Combines `tail` results by always returning later split document when given multiple split documents	Under development

Benchmarks

Structure of each benchmark suite:

./inputs.sh: retrieve all required inputs
./run.sh : run benchmark scripts with bash and agg
./verify.sh --generate: generate hashes for all outputs to verify correctness
./cleanup.sh: removes all output + intermediate files generated by current run

Directory	Description	Notes
unix50	Collection of oneline scripts to run on input `txt` files	use `--reg` flag for current available inputs retrieved by input script
oneliners	Collection of oneline scripts to run on input `txt` files	some scripts involving `mkfifo` cannot be tested currently due to current parsing simplicity
covid-mts	Script to process data on covid mass-transports
nlp	Collection of oneline scripts to run on input `txt` files

Automation of using aggregators in benchmarks:

./agg_run.sh [script] [input] : applies available agg on individual commands parsed out with | as delimiter
- parse script into CMDLIST, running below with each cmd:
  - if current cmd has implemented agg, split file into SIZE=2 and apply ./test-par-driver.sh to run each split file with cmd and apply agg
```
Parallel:
cat file-0 | $CMD > file-0-par
cat file-1 | $CMD > file-1-par
agg file-0-par file-1-par > file-par.txt
```
  - if current cmd doesn't have implemented agg, run script through this command sequentially with ./test-seq-driver.sh
```
Sequential:
cat file | $CMD > file-seq.txt
```
  - output becomes the new input to next iteration (next command in in CMDLIST)
- records script + input ran and whether each cmd has a agg to log.txt
./find-missing.sh [log.txt]: output cmd that doesn't have a agg implemented; log.txt is produced with each run of an entire benchmark suite
run-all.sh: Run all current benchmark suites through one script (check script for flags)

Platforms

Linux Distributions: Ubuntu, Debian
BSD Utils: MacOS

[DRAFT] Multiple File Argument Aggregators

Commands when ran on single file input vs. multiple file input often produce different results as file name often gets appended to the result
Multiple inputs to a command looks like: wc hi.txt bye.txt and would produce outputs that looks like

     559    4281   25733 inputs/hi.txt
     354    2387   14041 inputs/bye.txt
     913    6668   39774 total

directly takes input argument from system argument; for example, enter in your terminal python m_wc.py [parallel output file 1] [parallel output file 2]

File To Run	Additional info. needed	Description	Notes
`m_wc.py`	N/A	Combines count results, appends source file name to end, includes total count Supports flags `-l, -c, -w, -m`	Discripancy with combining byte size (might be due to manually splitting file to create parallel input in testing)
`m_grep.py`	after parallel output args: `full [path to original file 1] [path to original file 2] <more if needed>`	Combines `grep` results, sort output based on source file
`m_grep_c.py`	N/A	Combines `grep -c`, apprend prefix source file name, includes total count
`m_grep_n.py`	Yes	Combines `grep -n`, makes line correction accordingly to file Requires info on entire file before splitting to for line number correction	Needs to be refactored still

Note: all multiple argument combiners requires a [file_list] argument that is a list of all the full files utilized in the call

Testing

testing scripts produce all relevant files directed to /outputs when given files in /inputs to produce sequential / parallel results on
Run ./test-mult.sh in test-old directory:
1. manually split files (multiple) into 2 -- put in /input
2. apply command to entire file for sequential output (expected)
3. apply command to file-1 > output/output-1 apply command to file-2 > output/output-2
4. apply aggregators to combine output/output-1 output/output-2 for parallel outpus (requires path of the full files for functions such as line correction in grep -n)
5. eye check that parallel outputs = sequential output NOTE: use m_combine from the [cmd].py file as aggregators

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Aggregators For PaSh

Adding new aggregators

Automated Synthesis of Scalable Aggregators

Overview

Single File Argument Aggregators

Overview

Benchmarks

Structure of each benchmark suite:

Automation of using aggregators in benchmarks:

Platforms

[DRAFT] Multiple File Argument Aggregators

Testing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Aggregators For PaSh

Adding new aggregators

Automated Synthesis of Scalable Aggregators

Overview

Single File Argument Aggregators

Overview

Benchmarks

Structure of each benchmark suite:

Automation of using aggregators in benchmarks:

Platforms

[DRAFT] Multiple File Argument Aggregators

Testing