Skip to content

sivaponting/spark-sas7bdat

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark SAS Data Source (sas7bdat)

A library for reading SAS data (.sas7bdat) with Spark.

Build Status Join the chat at https://gitter.im/saurfang/spark-sas7bdat

Requirements:

Download:

The latest jar can be downloaded from spark-packages.

Version Scala Version Spark Version
3.0.0-s_2.11 2.11.x 2.4.x
3.0.0-s_2.12 2.12.x 3.0.x

Features:

  • This package allows reading SAS files from local and distributed filesystems, into Spark DataFrames.
  • Schema is automatically inferred from metadata embedded in the SAS file. (Behaviour can be customised, see parameters below)
  • The SAS format is splittable when not file-system compressed, thus we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
  • This library uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY).

NOTE: this package does not support writing sas7bdat files

Docs:

Parameters:

  • extractLabel (Default: false)
    • Boolean: extract column labels as column comments for Parquet/Hive
  • forceLowercaseNames (Default: false)
    • Boolean: force column names to lower case
  • inferDecimal (Default: false)
    • Boolean: infer numeric columns with format width >0 and format precision >0, as Decimal(Width, Precision)
  • inferDecimalScale (Default: each column's format width)
    • Int: scale of inferred decimals
  • inferFloat (Default: false)
    • Boolean: infer numeric columns with <=4 bytes, as Float
  • inferInt (Default: false)
    • Boolean: infer numeric columns with <=4 bytes, format width >0 and format precision =0, as Int
  • inferLong (Default: false)
    • Boolean: infer numeric columns with <=8 bytes, format width >0 and format precision =0, as Long
  • inferShort (Default: false)
    • Boolean: infer numeric columns with <=2 bytes, format width >0 and format precision =0, as Short
  • metadataTimeout (Default: 60)
    • Int: number of seconds to allow reading of file metadata (stops corrupt files hanging)
  • minSplitSize (Default: mapred.min.split.size)
    • Long: minimum byte length of input splits (splits are always at least 1MB, to ensure correct reads)
  • maxSplitSize (Default: mapred.max.split.size)
    • Long: maximum byte length of input splits, (can be decreased to force higher parallelism)

NOTE:

  • the order of precedence for numeric type inference is: Long -> Int -> Short -> Decimal -> Float -> Double
  • sas doesn’t have a concept of Long/Int/Short, instead people typically use column formatters with 0 precision

Scala API

val df = {
  spark.read
    .format("com.github.saurfang.sas.spark")
    .option("forceLowercaseNames", true)
    .option("inferLong", true)
    .load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")

You can also use the implicit readers:

import com.github.saurfang.sas.spark._

// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")

// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")

(Note: you cannot use parameters like inferLong with the implicit readers.)

Python API

df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)

R API

df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)

SQL API

SAS data can be queried in pure SQL by registering the data as a (temporary) table.

CREATE TEMPORARY VIEW cars
USING com.github.saurfang.sas.spark
OPTIONS (path="cars.sas7bdat")

SAS Export Runner

We included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet files:

sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"

To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark cluster, you can always run it in local mode and take advantage of multi-core.

Spark Shell

spark-shell --master local[4] --packages saurfang:spark-sas7bdat:3.0.0-s_2.12

Caveats

  1. spark-csv writes out null as "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead of null. The safest option is to export in parquet format where null is properly recorded. See databricks/spark-csv#147 for alternative solution.

Related Work

Acknowledgements

This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others. We are hornored to be a recipient of 2020 WiseWithData ELEVATE Awards and appreciate their generous donations.

About

Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 100.0%