Spark Streaming 实时词频统计
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Spark Streaming 实时词频统计
# 本示例演示如何使用Spark Streaming处理实时数据流
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# 初始化SparkContext
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
# 创建DStream从TCP源接收数据
lines = ssc.socketTextStream("localhost", 9999)
# 分割行成单词
words = lines.flatMap(lambda line: line.split(" "))
# 将每个单词映射为(word, 1)
pairs = words.map(lambda word: (word, 1))
# 按单词聚合计数
wordCounts = pairs.reduceByKey(lambda x, y: x + y)
# 打印结果
wordCounts.pprint()
# 启动流计算
ssc.start()
ssc.awaitTermination()
$
spark-submit word_count.py
22:15:30 INFO SparkContext: Running Spark version 3.4.1
22:15:30 INFO SparkContext: Submitted application: NetworkWordCount
22:15:31 INFO SecurityManager: Changing view acls to: student
22:15:32 INFO StreamingContext: StreamingContext started
✓ Application started successfully!
-------------------------------------------
Time: 1707651331000 ms
-------------------------------------------
('hello', 15)
('world', 12)
('spark', 10)
('big', 6)
('data', 4)
$