-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
74 lines (52 loc) · 2.71 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
Method Enumerable#reduce provides functionality for
reducing streams of sorted records (google "MapReduce").
Let's imagine a container with records [key, count].
Reducing several records with same key to one record can
be done like this:
[ [key, c1], [key, c2], [key, c3] ] #=> [key, c1 + c2 + c3]
There is the way to do it:
records.sort.reduce{|a,b| (a||=0) + b}
Generally speaking reducing is injecting values with same key.
The reduce operation is equivalent to group-and-inject-values:
records.inject{|groups,r| (groups[r.key]||=[]) << r.value; groups}.
map{|key,values| [key, values.inject{|a,b| (a||=0) + b}}
Reduce algorithm implementation is different from this version.
It's based on the fact that input records are already grouped by key.
Reduce can by combined with map, but before each reduce operation
records should be sorted (or grouped) by key.
If one has records [phrase, frequency] and wants to get
records [word, frequency] he should run
records.
map{|phrase, frequency| phrase.split.map{|word| [word, frequency]}}.
inject([]){|new_records, records| new_records.push(*records)}.
sort.
reduce{|a,b| (a||0) + b}
One can make calculations lazy just by converting container to lazy one:
records.to_lazy. ...
LazyEnumerable allows to dramatically reduce memory usage when
using piped transfromations: map, select, uniq, pipe, and reduce.
The only problem is the method +sort+, which requires all records to be
loaded to memory. But one can use any scalable external sort:
cat records.txt | \
ruby -r reduce -e 'STDIN.to_lazy.map(&:to_record).each{|p,f| p.split.each{|w| puts [w,f].to_line}}' | \
sort | \
ruby -r reduce -e 'STDIN.to_lazy.map(&:to_record).reduce{|a,b| (a||0) + b.to_i}.each{|r| puts r.to_line}'
Method String#to_record is defined as <tt>spit("\t")</tt>.
Redefine it (together with Object#to_line) on you taste.
Some shortcuts:
* <tt>records</tt> for <tt>STDIN.to_lazy.map(&:to_record)</tt>
* <tt>r.put_record</tt> for <tt>puts r.to_line</tt>
* <tt>enum.put_records</tt> for <tt>enum.each{|r| puts r.to_line}</tt>
Example:
cat records.txt | \
ruby -r reduce -e 'records.each{|p,f| p.split.each{|w| puts [w,f].to_line}}' | \
sort | \
ruby -r reduce -e 'records.reduce{|a,b| (a||0) + b.to_i}.put_records'
One can use pipe operator +|+, defined for LazyEnumerable
and put +sort+ command inside ruby code:
cat records.txt | ruby -r reduce -e \
'records.map_out{|o,r| r[0].split.each{|w| o << [w,r[1]]}}.|("sort"). \
reduce{|a,b| (a||0) + b.to_i}}.put_records'
Methods Object#to_line and String#to_record are used for converting records to line
and backwards around pipe command.
Iterator +map_out+ sends to block output container as first argument.