Map(flatMap)

October 29, 2018 Spark

Map 예제

lines = [[“$w_1 \quad w_2 \quad w_3$”], [“$w_4 \quad w_5 \quad w_6$”]]
lines를 map/flatmap을 이용하여 split하게 되면 아래와 같다.
- map: one2one mapping
  - Array(Array($w_1, w_2, w_3$),Array($w_4, w_5, w_6$))
- flatmap: one example $\rightarrow$ one result(flatten)
  - Array($w_1, w_2, w_3, w_4, w_5, w_6$)

코드 예시(notebook 코드 참조)

import org.apache.spark._

val lines = sc.textFile("./data/redfox.txt")

lines: org.apache.spark.rdd.RDD[String] = ./data/redfox.txt MapPartitionsRDD[9] at textFile at <console>:34

println(lines.collect().toList)

List(The quick red fox jumped over the lazy brown dogs, I am donghwa)

대문자로 변환해주는 RDD생성

val rageCaps = lines.map(x => x.toUpperCase)

rageCaps: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:36

println(rageCaps.collect().toList(0))

THE QUICK RED FOX JUMPED OVER THE LAZY BROWN DOGS

flatMap

flatMap는 아웃풋이 single value가 아니라 list(the set of values)로 반환 해준다.
- map: Array(Array(a),Array(b),Array(c),Array(d))
- flatMap: Array((a,b,c,d))
사용된 목적은 아웃풋의 shape이 flatten되다는 특징이 있다. 만약 문장마다 parsed 단어를 유지하고 싶으면 flatMap 대신 map을 사용하자.

1) map을 사용할 때

val _words = lines.map(x => x.split(" "))

_words: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[11] at map at <console>:36

_words.collect()

res18: Array[Array[String]] = Array(Array(The, quick, red, fox, jumped, over, the, lazy, brown, dogs), Array(I, am, donghwa))

2) flatMap을 사용할 때

val words = lines.flatMap(x => x.split(" "))

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[12] at flatMap at <console>:36

words.collect()

res19: Array[String] = Array(The, quick, red, fox, jumped, over, the, lazy, brown, dogs, I, am, donghwa)

val input = sc.textFile("./data/book.txt")

input: org.apache.spark.rdd.RDD[String] = ./data/book.txt MapPartitionsRDD[16] at textFile at <console>:34

input.take(8).foreach(println)

Self-Employment: Building an Internet Business of One
Achieving Financial and Personal Freedom through a Lifestyle Technology Business
By Frank Kane



Copyright � 2015 Frank Kane. 
All rights reserved worldwide.

val words = input.flatMap( x => x.split(" "))

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at flatMap at <console>:36

문장단위의 구분없이 map 아웃풋이 flatten되었다는 것을 확인할 수 있다.

words.take(15).foreach(println)

Self-Employment:
Building
an
Internet
Business
of
One
Achieving
Financial
and
Personal
Freedom
through
a
Lifestyle

val wordCounts= words.countByValue()

wordCounts: scala.collection.Map[String,Long] = Map(foolproof -> 1, precious -> 1, inflammatory -> 1, referrer, -> 1, hourly -> 3, embedded -> 1, way). -> 1, touch, -> 1, of. -> 3, salesperson -> 5, Leeches -> 1, expansion. -> 1, rate -> 7, appropriate. -> 2, CPA�s -> 1, 2014 -> 2, WELL-MEANING -> 1, Talk -> 5, Much -> 1, Builder" -> 1, plugin -> 3, headache -> 1, purchasing -> 9, China" -> 1, looks -> 2, site, -> 7, ranking -> 2, scare -> 1, hard-earned -> 1, freedom� -> 1, Seattle, -> 3, PULLING -> 1, action. -> 1, accident -> 3, scale. -> 2, looking. -> 1, physically -> 1, 27, -> 1, call. -> 1, contracts -> 3, twofold. -> 1, scenario -> 1, Advertising -> 3, way? -> 2, nudge -> 1, gamble -> 1, ideas -> 19, sketches -> 1, static -> 1, freelancer. -> 1, �PR:� -> 1, joining -> 1, particu...

wordCounts.take(10).foreach(println)

(foolproof,1)
(precious,1)
(inflammatory,1)
(referrer,,1)
(hourly,3)
(embedded,1)
(way).,1)
(touch,,1)
(of.,3)
(salesperson,5)

Map(flatMap)

October 29, 2018 Spark

Related Posts

July 25, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Finetune)

July 24, 2021

Fairseq 코드리뷰 Wav2vec 2.0 (Pretrain)

July 5, 2021

Docker container를 vscode로 원격제어하기