Apache Spark / Wick: Type-Safe Spark API
Whether you prefer writing queries in Spark SQL or you are more used to calling functions via the Dataframe API, it’s only a matter of time before you run into a classic problem. If you make a typo in a column name or accidentally try to add a number to text, the code will compile without any issues. You will only find out about the error at runtime. In practice, this often means a long wait for the build process to finish and resources allocated on the cluster, only to find out that the job failed after a few seconds due to a trivial typo.
The Wick library, developed by Netflix engineers, offers an elegant solution to this problem. It is a type-safe API for Spark that moves error detection from runtime to the compilation phase, saving hours of unnecessary work and frustration.
Hand in hand with type safety comes better support for the development environment – the IDE automatically suggests existing columns and their data types thanks to this. If you are familiar with the older Frameless library (designed for Scala 2), Wick serves as its modern successor. It utilizes the new Named Tuples functionality available since Scala 3.7. Unlike the standard Dataset API, Wick also functions as a “zero-cost” abstraction. This means that all type checking occurs exclusively at compile time and does not bring any runtime performance penalty. However, it is important to mention one current limitation: the library does not yet support Spark 4.
The basis for working with the Wick library is creating a simple case class that represents the data structure. Subsequently, the classic untyped DataFrame from Spark is wrapped in a DataSeq structure, bringing the desired type safety into play.
case class Livestock(
entity: String,
code: String,
year: Int,
asses: Double,
buffalo: Double,
cattle: Double,
goats: Double,
horses: Double,
mules: Double,
pigs: Double,
sheep: Double,
turkeys: Double,
chickens: Double
)
val df = sparkSession.read
.option("header", value = true)
.schema(summon[Encoder[Livestock]].schema)
.csv("data/livestock.csv")
val livestock = DataSeq[Livestock](df)
Once the data is wrapped in DataSeq, we can work with it using functions that are very similar to the standard Dataframe API. When accessing values (e.g., row.entity), the IDE automatically suggests what properties are available. The addition operator (+) is defined only for numeric types, which ensures that calculating the total number of livestock in millions will not fail on incompatible data.
livestock
.filter(_.year === 2014)
.select(row =>
(
entity = row.entity,
total_millions = (row.cattle + row.sheep + row.chickens) / 1_000_000
)
)
.orderBy(row => desc(row.total_millions))
.show()
For comparison, let’s look at what the same query would look like written in pure untyped Spark SQL. A potential typo in the word “cattle” would go unnoticed here, and the error would only pop up at runtime.
sparkSession
.sql(
"""SELECT entity
| , (cattle + sheep + chickens) / 1000000 AS total_millions
| FROM livestock
| WHERE year = 2014
| ORDER BY total_millions DESC
|""".stripMargin
)
.show()
Type safety, of course, also applies to more complex operations, such as aggregations. The agg method requires its inputs to form scalar expressions. If you were to omit the sum(...) aggregation function in the following code and try to directly divide the row.cattle value, the compiler would immediately report an error. It won’t allow you to aggregate an unaggregated value.
livestock .groupBy(row => (entity = row.entity)) .agg(row => (cattle_total_millions = sum(row.cattle) / 1_000_000)) .orderBy(row => desc(row.cattle_total_millions)) .show()
And again, the SQL alternative of the same case. The result is identical, but any guarantee of correct names and types before execution is missing here.
sparkSession
.sql(
"""SELECT entity
| , SUM(cattle) / 1000000 AS cattle_total_millions
| FROM livestock
| GROUP BY entity
| ORDER BY cattle_total_millions DESC
|""".stripMargin
)
.show()
Although type checking in Spark existed before, it often meant compromises in convenience or performance. Wick, on the other hand, fully benefits from Named Tuples in Scala 3 and brings a modern, zero-cost abstraction. The result is not only cleaner code, but most importantly a much faster and more reliable development cycle.
The entire project with examples is available on github.