Welcome to Sparser
A DataSource Api V1 for parsing and querying ASN.1 encoded data with Apache Spark, for Spark SQL and DataFrames. Head over to the Introduction to learn more, or jump straight to our Getting started with ASN.1 DataSource guide.
What is ANS.1 DataSource ?
ANS.1 DataSource helps teams save time and efficiently parse ASN.1 encoded data by offering a direct connector to the raw files. Process job reads directly the data and applies business logic instead of depending on a prepossessing phase that parses the data.
Why would I use ASN.1 DataSource ?
To parse ASN.1 encoded data efficiently. Teams use ASN.1 DataSource to
- Connect Spark to Your Own Datasource
- Reduce prepossessing time and cost
- Efficiently parse your data
Generate decoding classes
A compiler which creates Java classes from ASN.1 syntax. The generated classes can then be used together with the ASN.1 DataSource to efficiently decode messages using the Basic Encoding Rules (BER). The encoded bytes also conform to the Distinguished Encoding Rules (DER) which is a subset of BER. The generated classes contain methods to infer schema and return the record in Row format.
Automatically infers the schema (everything is assumed string) using the schema inference methods from the generated classes.
Verify correctness of the data
When reading ASN.1 encoded files with a specified schema, it is possible that the data in the files does not match the schema. The consequences depend on the mode that the parser runs in:
- PERMISSIVE (default): nulls are inserted for fields that could not be parsed correctly
- DROPMALFORMED: drops lines that contain fields that could not be parsed
- FAILFAST: aborts the reading if any malformed data is found
In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add_corrupt_record column to the schema.
Most of the time when using Apache Spark, existing data source connectors will solve our purpose. Unfortunately, Spark doesn’t have a native data source connector for ASN.1 files.
To handle ASN.1 encoded data files a preprocessing step is necessary to parse the raw data to format that can be handled by Apache Spark ( json, csv, etc… ).
The preprocessing results in an extra cost that can be removed with the ASN.1 DataSource.
With this custom data source you can parse the data in parallel in an efficient way.
How do I get started?
Check out Getting started with ASN.1 DataSource to use the datasource, and learn important concepts along the way.
If you’re interested in a paid support contract or consulting services for ASN.1 DataSource, please see options here
For other questions and resources, please visit Community resources.
This section contains a collection of practical step-by-step instructions to help you use ASN.1 DataSource.
These tutorials will teach you the basics of what you need to know to get up and running with ASN.1 DataSource.
- If you’re the impatient type, head to Quick start to get going with no fuss or explanation.
- If you’re a new user of ASN.1 DataSource, check out Getting started with ASN.1 DataSource to learn important concepts of the DataSource. We recommend you get started here!
- If you already have a working deployment of ASN.1 DataSource and want a deep dive into how to use the datasource, go to the How to use ASN.1 DataSource.
You can link against this library in your program at the following coordinates:
- groupId: fr.databeans
- artifactId: spark-asn1_2.11
- version: 1.0.0
- groupId: fr.databeans
- artifactId: spark-asn1_2.12
- version: 1.0.
Using with Spark shell
This package can be added to Spark using the –packages command line option. For example, to include it when starting the spark shell:
Spark compiled with Scala 2.11
$SPARK_HOME/bin/spark-shell –packages fr.databeans:spark-asn1_2.11:1.0.0
Spark compiled with Scala 2.12
$SPARK_HOME/bin/spark-shell –packages fr.databeans:spark-asn1_2.12:1.0.0
After that, if you want to start using the library, check out Getting started with ASN.1 DataSource.
Getting started with ASN.1 DataSource
Welcome to Great Expectations! This tutorial will help you set up your first local deployment of Great Expectations that contains a small Expectation Suite to validate some sample data. We’ll also introduce important concepts, with links to detailed material you can dig into later.
The tutorial will walk you through the following steps:
- First, we will introduce you to decoding class generation.
- Then you will learn how to use the Datasource to connect to your data.
- Finally, in the optional section, you will learn how to tune the DataSource for your specific use case.
The how-to guides in this section contain (mostly bite-size) instructions that will explain specific aspects of using ASN.1 DataSource.
How to generate decoding classes
The decoding and the schema inference logic is generated using a compiler which creates Java classes from ASN.1 syntax. The compilation of multiple inter-dependent ASN.1 modules defined in multiple files is supported.The Compiler has a static method called generateDecodingClasses that takes the following parameters:
- GeneratedSrcDir: the base directory for the generated Java classes. The class files will be saved in subfolders of the base directory corresponding to the name of the defined modules.
- RootPackageName: the base package name. Added to this will be a name generated from the module name.
- FilesPaths: ASN.1 files defining one or more modules.
Use the generateDecodingClasses() method from the Compiler Class under the package fr.databeans.compiler.Compiler to generate the classes.
The generated classes contain a class representing the module that represent the main record.
The main module class name is necessary for the datasource use.
How to use the ASN.1 DataSource
When reading files the API accepts the following options:
- path: location of ASN.1 encoded files. It can accept standard Hadoop globbing expressions.
- mainDecodingClass: the fully qualified name of the generated main class decoder.
- inferSchema: automatically infers the schema. It requires one extra pass over the data and is false by default.
The fully-qualified name for a class or interface is the package name followed by the class/interface name, separated by a period “.”.
For example: fr.databeans.utils.MyClass
The schema inference or explicit schema definition are necessary :
The fully qualified name of the generated main class decoder mainDecodingClass is necessary for schema inference
Explicit schema definition
You can manually specify the schema when reading data:
The InputFormat describes the input-specification for a Spark job. Spark relies on the InputFormat of the job to do the following:
- Validate the input-specification of the job.
- Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual executor.
- Provide the RecordReader implementation to be used to clean input records from the logical InputSplit for processing by the executor.
An input split is the chunk of data that is present in HDFS. Each executor will work on each input split. RecordReader will work on the input splits and arrange the records in key-value format.
The library contains an implement of the InputFormat of Hadoop to process ASN.1 encoded data, similar to Hadoop. which you may make direct use of as follows: