Map(flatMap)

October 29, 2018    Spark

Map 예제

  • lines = [[“$w_1 \quad w_2 \quad w_3$”], [“$w_4 \quad w_5 \quad w_6$”]]
  • lines를 map/flatmap을 이용하여 split하게 되면 아래와 같다.
    • map: one2one mapping
      • Array(Array($w_1, w_2, w_3$),Array($w_4, w_5, w_6$))
    • flatmap: one example $\rightarrow$ one result(flatten)
      • Array($w_1, w_2, w_3, w_4, w_5, w_6$)


코드 예시(notebook 코드 참조)

import org.apache.spark._


val lines = sc.textFile("./data/redfox.txt")
lines: org.apache.spark.rdd.RDD[String] = ./data/redfox.txt MapPartitionsRDD[9] at textFile at <console>:34


println(lines.collect().toList)
List(The quick red fox jumped over the lazy brown dogs, I am donghwa)


  • 대문자로 변환해주는 RDD생성
val rageCaps = lines.map(x => x.toUpperCase)
rageCaps: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:36
println(rageCaps.collect().toList(0))
THE QUICK RED FOX JUMPED OVER THE LAZY BROWN DOGS


flatMap

  • flatMap는 아웃풋이 single value가 아니라 list(the set of values)로 반환 해준다.
    • map: Array(Array(a),Array(b),Array(c),Array(d))
    • flatMap: Array((a,b,c,d))
  • 사용된 목적은 아웃풋의 shape이 flatten되다는 특징이 있다. 만약 문장마다 parsed 단어를 유지하고 싶으면 flatMap 대신 map을 사용하자.


1) map을 사용할 때

val _words = lines.map(x => x.split(" "))
_words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[11] at map at <console>:36


_words.collect()
res18: Array[Array[String]] = Array(Array(The, quick, red, fox, jumped, over, the, lazy, brown, dogs), Array(I, am, donghwa))


2) flatMap을 사용할 때

val words = lines.flatMap(x => x.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:36


words.collect()
res19: Array[String] = Array(The, quick, red, fox, jumped, over, the, lazy, brown, dogs, I, am, donghwa)


val input = sc.textFile("./data/book.txt")
input: org.apache.spark.rdd.RDD[String] = ./data/book.txt MapPartitionsRDD[16] at textFile at <console>:34


input.take(8).foreach(println)
Self-Employment: Building an Internet Business of One
Achieving Financial and Personal Freedom through a Lifestyle Technology Business
By Frank Kane



Copyright � 2015 Frank Kane. 
All rights reserved worldwide.


val words = input.flatMap( x => x.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at flatMap at <console>:36


  • 문장단위의 구분없이 map 아웃풋이 flatten되었다는 것을 확인할 수 있다.
words.take(15).foreach(println)
Self-Employment:
Building
an
Internet
Business
of
One
Achieving
Financial
and
Personal
Freedom
through
a
Lifestyle


val wordCounts= words.countByValue()
wordCounts: scala.collection.Map[String,Long] = Map(foolproof -> 1, precious -> 1, inflammatory -> 1, referrer, -> 1, hourly -> 3, embedded -> 1, way). -> 1, touch, -> 1, of. -> 3, salesperson -> 5, Leeches -> 1, expansion. -> 1, rate -> 7, appropriate. -> 2, CPA�s -> 1, 2014 -> 2, WELL-MEANING -> 1, Talk -> 5, Much -> 1, Builder" -> 1, plugin -> 3, headache -> 1, purchasing -> 9, China" -> 1, looks -> 2, site, -> 7, ranking -> 2, scare -> 1, hard-earned -> 1, freedom� -> 1, Seattle, -> 3, PULLING -> 1, action. -> 1, accident -> 3, scale. -> 2, looking. -> 1, physically -> 1, 27, -> 1, call. -> 1, contracts -> 3, twofold. -> 1, scenario -> 1, Advertising -> 3, way? -> 2, nudge -> 1, gamble -> 1, ideas -> 19, sketches -> 1, static -> 1, freelancer. -> 1, �PR:� -> 1, joining -> 1, particu...


wordCounts.take(10).foreach(println)
(foolproof,1)
(precious,1)
(inflammatory,1)
(referrer,,1)
(hourly,3)
(embedded,1)
(way).,1)
(touch,,1)
(of.,3)
(salesperson,5)

DSBA