我在 spark scala 中有两个数据框注册为表.从这两个表
I am having two dataframes in spark scala registered as tables. From these two tables
表 1:
+-----+--------+
|id |values |
+-----+----- +
| 0 | v1 |
| 0 | v2 |
| 1 | v3 |
| 1 | v1 |
+-----+----- +
表 2:
+-----+----+--- +----+
|id |v1 |v2 | v3
+-----+-------- +----+
| 0 | a1| b1| - |
| 1 | a2| - | c2 |
+-----+---------+----+
我想用上面两个表生成一个新表.
I want to generate a new table using the above two tables.
表 3:
+-----+--------+--------+
|id |values | field |
+-----+--------+--------+
| 0 | v1 | a1 |
| 0 | v2 | b1 |
| 1 | v3 | c2 |
| 1 | v1 | a2 |
+-----+--------+--------+
这里 v1 的形式是
Here v1 is of the form
v1: struct (nullable = true)
| |-- level1: string (nullable = true)
| |-- level2: string (nullable = true)
| |-- level3: string (nullable = true)
| |-- level4: string (nullable = true)
| |-- level5: string (nullable = true)
我在 scala 中使用 spark sql.
I am using spark sql in scala .
是否可以通过在数据帧上编写一些 sql 查询或使用一些 spark 函数来完成所需的操作.
Is it possible to do the desired thing by writing some sql query or using some spark functions on dataframes.
这是您可以使用的示例代码,它将生成此输出:
Here is the sample code that you can use , that will generate this output :
代码如下:
val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
val joinedDF=df1.join(df2,"id")
val resultDF=joinedDF.rdd.map{row=>
val id=row.getAs[Int]("id")
val values=row.getAs[String]("values")
val feilds=row.getAs[String](values)
(id,values,feilds)
}.toDF("id","values","feilds")
在控制台上测试时:
scala> val df1=sc.parallelize(Seq((0,"v1"),(0,"v2"),(1,"v3"),(1,"v1"))).toDF("id","values")
df1: org.apache.spark.sql.DataFrame = [id: int, values: string]
scala> df1.show
+---+------+
| id|values|
+---+------+
| 0| v1|
| 0| v2|
| 1| v3|
| 1| v1|
+---+------+
scala> val df2=sc.parallelize(Seq((0,"a1","b1","-"),(1,"a2","-","b2"))).toDF("id","v1","v2","v3")
df2: org.apache.spark.sql.DataFrame = [id: int, v1: string ... 2 more fields]
scala> df2.show
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| 0| a1| b1| -|
| 1| a2| -| b2|
+---+---+---+---+
scala> val joinedDF=df1.join(df2,"id")
joinedDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 3 more fields]
scala> joinedDF.show
+---+------+---+---+---+
| id|values| v1| v2| v3|
+---+------+---+---+---+
| 1| v3| a2| -| b2|
| 1| v1| a2| -| b2|
| 0| v1| a1| b1| -|
| 0| v2| a1| b1| -|
+---+------+---+---+---+
scala> val resultDF=joinedDF.rdd.map{row=>
| val id=row.getAs[Int]("id")
| val values=row.getAs[String]("values")
| val feilds=row.getAs[String](values)
| (id,values,feilds)
| }.toDF("id","values","feilds")
resultDF: org.apache.spark.sql.DataFrame = [id: int, values: string ... 1 more field]
scala>
scala> resultDF.show
+---+------+------+
| id|values|feilds|
+---+------+------+
| 1| v3| b2|
| 1| v1| a2|
| 0| v1| a1|
| 0| v2| b1|
+---+------+------+
我希望这可能是您的问题.谢谢!
I hope this might your problem. Thanks!
这篇关于在spark sql中转换两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持html5模板网!
如何有效地使用窗口函数根据 N 个先前值来决定How to use windowing functions efficiently to decide next N number of rows based on N number of previous values(如何有效地使用窗口函数根据
在“GROUP BY"中重用选择表达式的结果;条款reuse the result of a select expression in the quot;GROUP BYquot; clause?(在“GROUP BY中重用选择表达式的结果;条款?)
Pyspark DataFrameWriter jdbc 函数的 ignore 选项是忽略整Does ignore option of Pyspark DataFrameWriter jdbc function ignore entire transaction or just offending rows?(Pyspark DataFrameWriter jdbc 函数的 ig
使用 INSERT INTO table ON DUPLICATE KEY 时出错,使用 Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array(使用 INSERT INTO table ON DUPLICATE KEY 时出错,使用 for 循环数组
pyspark mysql jdbc load 调用 o23.load 时发生错误 没有合pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver(pyspark mysql jdbc load 调用 o23.load 时发生错误 没有合适的
如何将 Apache Spark 与 MySQL 集成以将数据库表作为How to integrate Apache Spark with MySQL for reading database tables as a spark dataframe?(如何将 Apache Spark 与 MySQL 集成以将数据库表作为