Antenna for wireless network. Telecommunication cellular station. Broadcasting tower. Mast Lte aerial. Tech background

Spark ASN.1 DataSource for Telecom Mediation


We are progressing towards an era of information which can and must be converted into real time actionable acumen, to enable the companies respond in real-time to behavioral changes in the customer mindset or to swiftly respond to threats from the market competition. This is exactly where Big Data and its analysis can win the battle against the traditional BI tools. Meanwhile, Telecom companies are unaware about the volume of data which could, on proper analysis, can get deeper insights into customer behavior, preferences, interests and their service usage patterns. In this article we will focus on the mediation use case and how to approach it with Big Data solutions.

Sam’s Precognition

Sam is a skilled data engineer working for one of the leading telecom companies in the world. In the Telecom industry the big stars are mainly the ones handling money or subscriber information – including OCS (online-charging systems), billing, revenue assurance, analytics and customer touch points. All those services are part of the Business support systems (BSS). Together with operations support systems (OSS), they are used to support various end-to-end telecommunication services. BSS and OSS have their own data and service responsibilities. But the truth is none of these actors would accomplish much unless mediation was behind the scenes. Mediation services are the responsibility of Sam and his team. Mediation collects network and usage data across a wide variety of networks for business intelligence as well as for charging, billing, and policy management. As defined in industry standards, mediation is a key telecom node that sits between the data generators (such as network or IT nodes) and the data consumers, which are the downstream operations and business support systems. When Sam started working with the mediation service, their objective  was mainly collecting call data records (CDRs) with usage information from fixed network switches, processing them and distributing them to billing systems so the service provider could charge its subscribers.

Behind the curtains, Sam’s team was collecting the raw CDRs in a batch of files via a plain and simple file transfer protocol (FTP) with some internal logic in the middle. The team was responsible for parsing the CDRs, filtering out the non-relevant data, aggregating the partial data records, combining necessary information and transforming the data as per the format required by the data consumers  (like Rating Engine, Interconnect System, Roaming Clearinghouse, RA, FMS, EDW, Reporting, etc.). Finally the processes CDRs are distributed to Northbound systems. Sam was responsible for the parsing of the CDR files that were encoded in ASN.1. Actually every landline or cellular telephone call made over a Public Land Mobile Network (PLMN) creates one or more call records. These Call Detail Records (CDRs) or Usage Detail Records (UDRs) are generated by the mobile switching center (MSC). The MSC is the primary GSM/CDMA service delivery node responsible for routing voice calls, SMS, and other services.

CDRs contain information that the network operator uses for subscriber identification, call charging, services obtained, call routing, etc. After a “data collector” in the network switch captures the CDRs, a Java program that Sam created converts the ASN.1 encoded CDRs from a binary format into a flat file format (json, csv, etc…).  Flat CDRs are used by downstream billing and analytic applications which need to consume that data. Indeed, most data integration (ETL) and telco application service provider (ASP) operations have been relying on mediation to convert and enrich the data first starting with Sam’s parsing job, because they cannot natively process the raw, binary ASN.1 formats themselves. That is because ASN is designed to stringently encode machine-generated data for communicating to a non-specific, downstream processor. Mediation has been necessary because of unknown processors, and because the records are structured, macroed, and not human readable.

In raw form, in general, CDR data can have more than 700 fields which may or may not appear in the actual runtime stream. But again, the character strings and values are encoded in octets which are not human-readable. Sam’s solution was very efficient at the time a batch process executed regularly on massive amounts of data but is moving toward real time.

Evolving challenges

Then 2G, 3G and 4G mobile networks came along and added not only extra complexity to the source data records due to the introduction of new services (for example, data and SMS), but also dramatically increased the amount of CDRs to be processed. At the same time, the evolution of the telecom market meant there were more information consumers passing through the mediation platforms, including: service assurance, fraud detection, performance monitoring, business intelligence and analytics, big data and external partners. The market dynamics demanded for information to be transferred more quickly, and new business models were introduced faster than ever before. All this relied heavily on mediation functionalities to bridge all network elements between Operations and Business Support Systems (OSS/BSS).

To keep up with the fast rate of evolution the mediation team started searching for a solution and found two successful stories for telecommunication use cases for Big Data: China Unicom and SK telecom.

China Unicom 

The world’s fourth largest telco, China Unicom is deep into a broad rebuild of its internal software stack, built largely around open source. Its booming business in 4G and emerging 5G networks was already putting a heavy load on the legacy network data processing system. Open source software is helping China Unicom lead the way to expand services and improve performance for its more than 320 million subscribers. The key to Unicom’s project success was the emergence of new big data frameworks like Apache Kafka, Spark, Hive, and Alluxio, which allowed the company to re-imagine its software stack to support batch and stream processing business requirements by using very similar open source-based architectures orchestrated by Kubernetes. The new open source-based architecture his team built now supports seven different lines of business services at China Unicom, Ce said. He estimates they run about 3 terabytes of data daily in Alluxio memory to support the batch and streaming requirements of the business workloads, or more than 200,000 daily Spark jobs.

SK telecom

 SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day. Their data engineering team was present in the 2019 Spark+AI Summit to showcase how they analyzed, predicted, and visualized network quality data, as a spark AI use case in a telecommunications company.

Wind of change

After studying closely those two cases the common thing is the use of open source Big Data solutions. After several meetings and various doability tests an open source stack replaced the old system that processed data for mediation, at the center of the stack lies Apache Spark. The results were very good, the team was satisfied with the benchmark between the old and the new system. The massive parallel processing was the perfect solution for their use case. Unfortunately not everyone benefited from the change. Most of the time, existing data source connectors will solve our purpose. But, some of the time, when we need to work with legacy datastores. Unfortunately, Spark doesn’t have a native data source connector for ASN.1 files. Sam was not able to use Spark, he kept his old program with some tuning to adapt with the situation but He was sure that the end of his solution was near. 

When 4G came into the image it enabled very fast mobile broadband communication. To be successful, 4G networks required proven technologies, capable of delivering massive amounts of user data and network control information in fractions of a second. Spark handled well the change thanks to its high scalability. But Sam was struggling to keep up with the flow, it was a real wake up call. Sam was aware that the the ASN.1 parsing solution needs to evolve to meet the required standards.

The future is now

From online to offline, from fixed to mobile networks, from voice to SMS, and from hundreds to billions of records – telecom mediation systems have been through a lot, and they’ve always been obligated to deliver. 

5G, the fifth generation technology of cellular mobile communications, raises the bar even further. It targets a higher data rate, reduced latency, energy efficiency, cost reduction, increased system capacity, and massive device connectivity. Because ASN.1 can meet all these requirements, 3GPP has elected to continue using ASN.1 for 5G. Consequently, many of the 3GPP 5G and LTE protocols such as RRC, S1AP, X2AP, NGAP, XnAP, E1AP, F1AP, and LPPa are defined using ASN.1.  5G is bringing not only faster and better-quality connections for the consumer market but also revolutionizing the IoT business by enabling new use cases only possible with 5G capabilities – such as network slicing, ultra-high speed and ultra-low latency. This means the number of connected devices will grow exponentially, and as a result there’ll be a lot more sources sending data through the mediation, and  increasing the amount of processed information. The nature of these use cases in new industries will also require more from the enhanced mediation logic capabilities and will need to be integrated with more consumers.

Spark of hope

Most of the time, existing data source connectors will solve our purpose. But, some of the time, when we need to work with legacy datastores. Unfortunately, at the time when Sam’s team integrated Apache Spark it didn’t have a native data source connector for ASN.1 files.

The process of CDR data analysis involves large amounts of data. Hadoop distributed file system is chosen to store the vast amount of CDR data as it provides an easy and flexible storage platform. We get the raw CDR data from HDFS and do the parsing of the data in parallel tasks. When Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the DataSource. The parsing results in a DataFrame composed of partitions. Ech partition contains data from a single block of HDFS.

This DataSource allows reading ASN.1 encoded files in a local or distributed file system as a Spark DataFrames. When reading files the solution takes ASN.1 encoded files and their definition files as input.

Sam was extremely happy with the results, now he has the opportunity to work synchronously with his team. This is the beginning of a new era, an era where we handle efficiently massive data and we concentrate on delivering more business value.

New Era, New Challenges

Powered by the new ASN.1 DataSource the journey started smoothly, but the need for this solution is the handling of the 5g flow. Sam handled well his old scalability challenges but he is going to have a taste of what is coming next. Sam started to notice some performance issues in his pipeline and found out that the performance bottleneck is due to inefficient reading of the files. Actually, the CDR files are small but they are collected in large numbers. Sam remembered that his friend Bob who works as a data engineer told him about a similar problem and he said that it is a known performance issue due to the reading of many small files. Sam contacted him to know more about this problem and how to solve it.

If you want to know more about the challenges that Bob faced including the many small files challenge visit …

Have you any questions

Share content