From Pandas to pySpark Dataframes
This short post is intended for those Pandas users who experience some initial struggle in transitioning to Spark dataframes. I also experienced certain animosity towards Spark concepts. If it is pySpark why it is quite different from usual python pandas after all? Below I will highlight few key differences between two dataframe implementations.
While most of Spark resources point out on lazy evaluation as the main difference, I find that it is immutability and distributed execution that explain the difference the best when narrowing it down to dataframe operations.
Pandas dataframe: single node, mutable data structure. Given that everything is happening on single node, python “sees and owns” whole table. Row indexing and all the goodies that come with it are possible as a result.
Given it is mutable type, one can modify table right at the spot. Simple as that.
pySpark dataframe: distributed on several worker nodes, immutable type. Because data processing is distributed across several nodes, pySpark doesn’t “own” the data it is using, and it cannot reliably monitor changes, and as a consequence cannot maintain indices.
Because, pySpark tables are immutable it is not possible to modify them right on the spot like in Pandas. This probably comes as price for ability to handle Big Data. Spark dataframes are basically an abstraction implemented on top of RDDs, therefore any transformations will give you just a view of data. Hence, new syntax for API functions.
Below is short cheatsheet table with pandas-pySpark commands comparison (applicable for Spark 2.3+).