At the end of the talk, about time index 48:00, there was an audience question which generated an interesting answer. I was sharing this as possible good ideas for Kamanja, which is also implemented in Scala, and integrating with Python and R.
q) How to transfer data between different language API’s?
a) Spark is implemented in Scala. For the DataFramesAPI, the data stays there, on the Scala side. When you need data in Python, you use Python4j library to connect with the JVM through sockets, to serialize the data from the JVM to the Python side. If you use DataFrames operations, nothing is happening on the Python side, you get native performance. The only overhead is if you get the data out, to see it in Python. If you have UDF’s defined in Python, then you need to go through data serialization – then there is overhead. We try to provide common UDF’s so you don’t have to create them.
See the SparkR documentation, for an example of Scala to R integration, using the DataFrames API (the RDD API is not used for R integration).