The munge function

Author : Ashley G Truter
Updated: 22-Jun-2016
Purpose: This document describes the munge function.

The munge script patches the remove-each function in R3 to return the modified series.


USAGE:
    MUNGE data spec /sheet name
                    /update action /delete /part columns /where condition /compact
                    /only
                    /group having
                    /order
                    /save file /list

DESCRIPTION:
    Load and/or manipulate a block of tabular (column and row) values.
    MUNGE is a function value.

ARGUMENTS:
    data -- REBOL block, CSV or Excel file (block! file! binary! vector!)
    spec -- Size of each record or block of heading words (none! gets cols? file) (integer! block! none!)

REFINEMENTS:
    /sheet -- Excel worksheet (default is 1)
        name (string! word! integer!)
    /update -- Offset/value pairs (returns original block)
        action (block!)
    /delete -- Delete matching rows (returns original block)
    /part -- Offset position(s) to retrieve
        columns (block! integer! word!)
    /where -- Expression that can reference columns as c1, c2, etc
        condition (block!)
    /compact -- Remove blank rows
    /only -- Remove duplicate rows
    /group -- One of count, max, min or sum
        having -- Word or expression that can reference the initial result set column as count, max, etc (word! block!)
    /order -- Sort result set
    /save -- Write result to a delimited file
        file -- csv or txt (file!)
    /list -- Return new-line records

Reading from Excel uses Powershell and is dependant upon the relevant software being installed. For further options when loading data see the load-excel function.

Arguments

data

Must be a block or delimited file.

>> munge [a 1 b 2] 2
== [a 1 b 2]
>> munge %test.csv 2
== [a 1 b 2]

spec

An integer specifying the number of columns or a block of column name words. Specifying none for a file will use the cols? function to determine the number of columns.

>> munge [a 1 b 2] 2
>> munge [a 1 b 2] [A B]
>> munge %test.xlsx none

If an integer is used then column names will default to c1, c2, etc up to the number of columns specified. Column names can be referenced in the /where condition, /update action and /group having blocks.

Refinements

Refinements are evaluated in the order presented so, for example, a /where/group will exclude rows prior to grouping.

File

/sheet

Expects a sheet number or name. Default is 1.

>> munge/sheet %test.xlsx 2 2
== [a b 1 2]
>> munge/sheet %test.xlsx 2 "Sheet 2"
== [a b 1 2]

Base

/update and /update/where

This refinement alters the contents of the original block and returns once completed.

>> munge/update [a 1 b 2] 2 [2 0]
== [a 0 b 0]
>> munge/update [a 1 b 2] 2 [2 [c2 * 2]]
== [a 2 b 4]
>> munge/update/where [a 1 b 2] 2 [2 0] [c1 = 'b]
== [a 1 b 0]

Note that:

>> munge/update ["1"] 1 [1 integer!]
== [1]

is shorthand for:

>> munge/update ["1"] 1 [1 [to integer! c1]]
== [1]

/delete and /delete/where

This refinement alters the contents of the original block and returns once completed.

>> munge/delete [a 1 b 2] 2
== []
>> munge/delete/where [a 1 b 2] 2 [c1 = 'b]
== [a 1]

/part and/or /where

/part

Specifies the column(s) to extract via one or more integers and/or column name words.

>> munge/part [a 1 b 2] 2 1
== [a b]
>> munge/part [a 1 b 2] 2 'c1
== [a b]
>> munge/part [a 1 b 2] [A B] 'A
== [a b]
>> munge/part [a 1 b 2] 2 [2 1]
== [1 a 2 b]
>> munge/part [a 1 b 2] 2 [c2 c1]
== [1 a 2 b]
>> munge/part [a 1 b 2] [A B] [B A]
== [1 a 2 b]

/where

Expects a block that can include references to column names.

>> munge/where [a 1 b 2] 2 [odd? c2]
== [a 1]
>> munge/where [a 1 b 2] [A B] [B = 2]
== [b 2]
>> munge/where [a 1 b 2] 2 [c2 > 1 c2 < 3]
== [b 2]

This last condition takes advantage of the fact that munge wraps the condition within an all [] block.

/compact

This refinement removes blank rows and alters the contents of the original block.

>> munge/compact ["" "" a 1 b 2] 2
== [a 1 b 2]

Unique

/only

This refinement removes duplicate rows and alters the contents of the original block.

>> munge/only [a 1 b 2 a 1] 2
== [a 1 b 2]

Group

Each of the following operations "groups" the data according to the specified having expression.

'count

>> munge/group [a b a b c d] 1 'count
== [a 2 b 2 c 1 d 1]
>> munge/group [a b a b c d] 2 'count
== [a b 2 c d 1]
>> munge/group [a b a b c d] 2 [count > 1]
== [a b 2]

This refinement sorts the original block if neither /part or /where are specified.

'avg

>> munge/group [1 2 3 4 5 6] 1 'avg
== 3.5
>> munge/group [a 1 a 2 c 3 c 4] 2 'avg
== [a 1.5 c 3.5]
>> munge/group [a 1 a 2 c 3 c 4] 2 [avg > 2]
== [c 3.5]

This refinement sorts the original block if neither /part or /where are specified.

'max

>> munge/group [1 2 3 4 5 6] 1 'max
== [6]
>> munge/group [a 1 a 2 c 3 c 4] 2 'max
== [a 2 c 4]
>> munge/group [a 1 a 2 c 3 c 4] 2 [max = 4]
== [c 4]

This refinement sorts the original block if neither /part or /where are specified.

'min

>> munge/group [1 2 3 4 5 6] 1 'min
== [1]
>> munge/group [a 1 a 2 c 3 c 4] 2 'min
== [a 1 c 3]
>> munge/group [a 1 a 2 c 3 c 4] 2 [min = 1]
== [a 1]

This refinement sorts the original block if neither /part or /where are specified.

'sum

>> munge/group [1 2 3 4 5 6] 1 'sum
== [21]
>> munge/group [a 1 a 2 c 3 c 4] 2 'sum
== [a 3 c 7]
>> munge/group [a 1 a 2 c 3 c 4] 2 [sum > 5]
== [c 7]

This refinement sorts the original block if neither /part or /where are specified.

Sort

/order

>> munge/order [b 1 a 2] 2
== [a 2 b 1]

Emit

/save

>> munge/save [a 1 b 2 c 3] 2 %test.csv ; CSV delimited file
>> munge/save [a 1 b 2 c 3] 2 %test.txt ; tab delimited file

/list

>> munge/list [a 1 b 2 c 3] 2
== [
    a 1
    b 2
    c 3
]