Home Forums Kamanja Forums Enhancement Requests & Contributions Efficient language integration from Scala to Python and R

This topic contains 0 replies, has 1 voice, and was last updated by  Greg Makowski 1 year, 3 months ago.

  • Author
  • #18299 Reply

    Greg Makowski


    As part of creating MLlib PMML examples for Kamanja, I was going through this video, “Spark MLlib: Making Practical machine Learning Easy and Scalable”, presented by Xiangrui Meng of Databricks on April of 2015.

    At the end of the talk, about time index 48:00, there was an audience question which generated an interesting answer.  I was sharing this as possible good ideas for Kamanja, which is also implemented in Scala, and integrating with Python and R.

    q) How to transfer data between different language API’s?

    a) Spark is implemented in Scala.  For the DataFrames API, the data stays there, on the Scala side.  When you need data in Python, you use Python4j library to connect with the JVM through sockets, to serialize the data from the JVM to the Python side.  If you use DataFrames operations, nothing is happening on the Python side, you get native performance.  The only overhead is if you get the data out, to see it in Python.  If you have UDF’s defined in Python, then you need to go through data serialization – then there is overhead.  We try to provide common UDF’s so you don’t have to create them.

    See the SparkR documentation, for an example of Scala to R integration, using the DataFrames API (the RDD API is not used for R integration).


Reply To: Efficient language integration from Scala to Python and R
Your information: