Apache Spark 介绍
1.Apache Spark 介绍
Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用并行框架
引用WIKI的说法, Apache Spark is an opensource cluster computing framework,
它提供了Resilient distributed dataset (RDD),Apache Spark 技术栈包括:
Spark Core、Spark SQL、Spark Streaming、MLlib、Graphx
1.1 Spark Core包括了任务分配,调度,基础I/O,实现了应用接口对RDD的访问
val conf = new SparkConf().setAppName("wiki_test") // create a spark config object
val sc = new SparkContext(conf) // Create a spark context
val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs.
val tokens = data.map(_.split(" ")) // Split each file into a list of tokens (words).
val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then sum the counts per word type.
wordFreq.sortBy(s=>s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word and count to sort by count.
1.2 Spark SQL 通过一个名为DataFrames数据抽象结构来访问RDD,提供更友好的访问方式。
import org.apache.spark.sql.SQLContext
val url = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" // URL for your database server.
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Create a sql context object
val df = sqlContext
.read
.format("jdbc")
.option("url", url)
.option("dbtable", "people")
.load()
df.printSchema() // Looks the schema of this DataFrame.
val countsByAge = df.groupBy("age").count() // Counts people by age
1.3 Spark Streaming 提供秒级别(Storm是毫秒级别)的流式计算能力
1.4 MLlib Machine Learning Library
1.5 Graphx