-
Notifications
You must be signed in to change notification settings - Fork 0
avorozhtsov/mr_shell
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Method Enumerable#reduce provides functionality for reducing streams of sorted records (google "MapReduce"). Let's imagine a container with records [key, count]. Reducing several records with same key to one record can be done like this: [ [key, c1], [key, c2], [key, c3] ] #=> [key, c1 + c2 + c3] There is the way to do it: records.sort.reduce{|a,b| (a||=0) + b} Generally speaking reducing is injecting values with same key. The reduce operation is equivalent to group-and-inject-values: records.inject{|groups,r| (groups[r.key]||=[]) << r.value; groups}. map{|key,values| [key, values.inject{|a,b| (a||=0) + b}} Reduce algorithm implementation is different from this version. It's based on the fact that input records are already grouped by key. Reduce can by combined with map, but before each reduce operation records should be sorted (or grouped) by key. If one has records [phrase, frequency] and wants to get records [word, frequency] he should run records. map{|phrase, frequency| phrase.split.map{|word| [word, frequency]}}. inject([]){|new_records, records| new_records.push(*records)}. sort. reduce{|a,b| (a||0) + b} One can make calculations lazy just by converting container to lazy one: records.to_lazy. ... LazyEnumerable allows to dramatically reduce memory usage when using piped transfromations: map, select, uniq, pipe, and reduce. The only problem is the method +sort+, which requires all records to be loaded to memory. But one can use any scalable external sort: cat records.txt | \ ruby -r reduce -e 'STDIN.to_lazy.map(&:to_record).each{|p,f| p.split.each{|w| puts [w,f].to_line}}' | \ sort | \ ruby -r reduce -e 'STDIN.to_lazy.map(&:to_record).reduce{|a,b| (a||0) + b.to_i}.each{|r| puts r.to_line}' Method String#to_record is defined as <tt>spit("\t")</tt>. Redefine it (together with Object#to_line) on you taste. Some shortcuts: * <tt>records</tt> for <tt>STDIN.to_lazy.map(&:to_record)</tt> * <tt>r.put_record</tt> for <tt>puts r.to_line</tt> * <tt>enum.put_records</tt> for <tt>enum.each{|r| puts r.to_line}</tt> Example: cat records.txt | \ ruby -r reduce -e 'records.each{|p,f| p.split.each{|w| puts [w,f].to_line}}' | \ sort | \ ruby -r reduce -e 'records.reduce{|a,b| (a||0) + b.to_i}.put_records' One can use pipe operator +|+, defined for LazyEnumerable and put +sort+ command inside ruby code: cat records.txt | ruby -r reduce -e \ 'records.map_out{|o,r| r[0].split.each{|w| o << [w,r[1]]}}.|("sort"). \ reduce{|a,b| (a||0) + b.to_i}}.put_records' Methods Object#to_line and String#to_record are used for converting records to line and backwards around pipe command. Iterator +map_out+ sends to block output container as first argument.
About
MapReduce and functional programming utils, including lazy container
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published