- Hadoop InputFormat
A Sparser Api V1 for parsing and querying ASN.1 encoded data with Apache Spark, for Spark SQL and DataFrames. Head over to the Introduction to learn more, or jump straight to our Getting started with ASN.1 DataSource guide.
What is ANS.1 DataSource ? #
ANS.1 DataSource helps teams save time and efficiently parse ASN.1 encoded data by offering a direct connector to the raw files. Process job reads directly the data and applies business logic instead of depending on a prepossessing phase that parses the data.
Why would I use ASN.1 DataSource ? #
To parse ASN.1 encoded data efficiently. Teams use ASN.1 DataSource to
Key features #
Generate decoding classes #
A compiler which creates Java classes from ASN.1 syntax. The generated classes can then be used together with the ASN.1 DataSource to efficiently decode messages using the Basic Encoding Rules (BER). The encoded bytes also conform to the Distinguished Encoding Rules (DER) which is a subset of BER. The generated classes contain methods to infer schema and return the record in Row format.
Schema inference #
Automatically infers the schema (everything is assumed string) using the schema inference methods from the generated classes.
Verify correctness of the data #
When reading ASN.1 encoded files with a specified schema, it is possible that the data in the files does not match the schema. The consequences depend on the mode that the parser runs in:
In the PERMISSIVE mode it is possible to inspect the rows that could not be parsed correctly. To do that, you can add_corrupt_record column to the schema.
Workflow advantages #
Most of the time when using Apache Spark, existing data source connectors will solve our purpose. Unfortunately, Spark doesn’t have a native data source connector for ASN.1 files.
To handle ASN.1 encoded data files a preprocessing step is necessary to parse the raw data to format that can be handled by Apache Spark ( json, csv, etc… ).
The preprocessing results in an extra cost that can be removed with the ASN.1 DataSource.
With this custom data source you can parse the data in parallel in an efficient way.
How do I get started? #
Check out Getting started with ASN.1 DataSource to use the datasource, and learn important concepts along the way.
If you’re interested in a paid support contract or consulting services for ASN.1 DataSource, please see options here
For other questions and resources, please visit Community resources.
This section contains a collection of practical step-by-step instructions to help you use ASN.1 DataSource.
These tutorials will teach you the basics of what you need to know to get up and running with ASN.1 DataSource.
Quick start #
You can link against this library in your program at the following coordinates:
Scala 2.11 #
Scala 2.12 #
Using with Spark shell #
This package can be added to Spark using the –packages command line option. For example, to include it when starting the spark shell:
Spark compiled with Scala 2.11 #
$SPARK_HOME/bin/spark-shell –packages fr.databeans:spark-asn1_2.11:1.0.0
Spark compiled with Scala 2.12 #
$SPARK_HOME/bin/spark-shell –packages fr.databeans:spark-asn1_2.12:1.0.0
After that, if you want to start using the library, check out Getting started with ASN.1 DataSource.
Getting started with ASN.1 DataSource #
Welcome to Great Expectations! This tutorial will help you set up your first local deployment of Great Expectations that contains a small Expectation Suite to validate some sample data. We’ll also introduce important concepts, with links to detailed material you can dig into later.
The tutorial will walk you through the following steps:
- First, we will introduce you to decoding class generation.
- Then you will learn how to use the Datasource to connect to your data.
- Finally, in the optional section, you will learn how to tune the DataSource for your specific use case.
How-to guides #
The how-to guides in this section contain (mostly bite-size) instructions that will explain specific aspects of using ASN.1 DataSource.
How to generate decoding classes #
The decoding and the schema inference logic is generated using a compiler which creates Java classes from ASN.1 syntax. The compilation of multiple inter-dependent ASN.1 modules defined in multiple files is supported.The Compiler has a static method called generateDecodingClasses that takes the following parameters:
Use the generateDecodingClasses() method from the Compiler Class under the package fr.databeans.compiler.Compiler to generate the classes.
The generated classes contain a class representing the module that represent the main record.
The main module class name is necessary for the datasource use.
How to use the ASN.1 DataSource #
When reading files the API accepts the following options:
The fully-qualified name for a class or interface is the package name followed by the class/interface name, separated by a period “.”.
For example: fr.databeans.utils.MyClass
The schema inference or explicit schema definition are necessary :
Schema inference #
The fully qualified name of the generated main class decoder mainDecodingClass is necessary for schema inference
Explicit schema definition #
You can manually specify the schema when reading data:
Hadoop InputFormat #
The InputFormat describes the input-specification for a Spark job. Spark relies on the InputFormat of the job to do the following:
An input split is the chunk of data that is present in HDFS. Each executor will work on each input split. RecordReader will work on the input splits and arrange the records in key-value format.
The library contains an implement of the InputFormat of Hadoop to process ASN.1 encoded data, similar to Hadoop. which you may make direct use of as follows: