Posts

Source code for Data profiling using pyspark

Image
Profiling plays major role in today's data science world. Understanding about data is required in order to identify and apply machine learning techniques. I hope below code help some people to profile the data  In this code I will be using pyspark code to process the data and html, pandas to render the output into html file. Firstly, I read the file using spark read command and assigned it to a dataframe. I have used this dataframe to filter the data to display the profiling attributes. I used pandas in some places to data formatting/reporting purposes. The file I have used is CSV file with a header inside. For getting data types I have used df.types. so date type will be showed as String. the output will show like this. For any custom enhancements/feedback, please contact me dileep.psdk@gmail.com

How to read CSV file using Pyspark?

Reading data files using pyspark is pretty much easy with simple command. However you need to add options depending on the format & content of the file. Simple file read command: Reading CSV file by passing header Other options you can add are escape : "\"" multiLine : "true" delimiter : "|" nullValue : "\\N" inferSchema : "true"

How to get Azure Storage metrics through REST API?

Image
I have spent sometime on getting Storage Metrics from Azure REST APIs. Though Azure documentation is a bit tricky at first instance. I was able to figure out Storage space utilized under each of the storage account. Note: Below storage metrics are at Storage Account/Blog/Table/Queue level. AVAILABILITY : The following example shows how to read metric data at account level: GET "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Storage/storageAccounts/{storageAccountName}/providers/microsoft.insights/metrics?metricnames=Availability&api-version=2018-01-01&aggregation=Average" Response: { "cost": 0, "timespan": "2017-09-07T17:27:41Z/2017-09-07T18:27:41Z", "interval": "PT1H", "value": [ { "id": "/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Stor

How to remove a String in python list

Below is the snippet to remove a string from python list of strings a = ['fname_encr','lname_encr','address_encr']  # encr = encrypted value column a = list(map(lambda x: x.replace('_encr',''),a) print(a) #python 3 > ['fname','lname','address']

Measuring the Data Quality

Image
In today’s information-driven world, implementing an effective data quality management or DQM strategy cannot be overlooked. DQM refers to a business principle that requires a combination of the right people, processes and technologies all with the common goal of improving the measures of data quality. The subject is the single most important concept in the modern data quality approach. The subject is the entity which will be the target of the data quality investigation at the most granular level. Before we begin any data quality initiative we must discover what the subject of the study is. Like most concepts in our approach, the subject is a concept reflected in the data but not attached to any Technical object. For ex: Employee Status, Hours, Earnings belongs to subject "Employee". If we implement a Telecom Data warehouse, subject areas can be Subscriber, Finance, Marketing.Once identified, the subject becomes more than a concept and will define the

How to Connect to Databricks Delta table using JDBC?

Image
Connect Databricks Delta tables using JDBC  (Microsoft Azure) This post covers Databricks Delta JDBC connection cobnfiguration. Step 1 Download Databricks Spark JDBC driver from below location. This might require to fill in basic information before the download. After you fill in the required field you would receive Driver download links to the email you provided in the form. keep the Jar into C:\Downloads (or any location. this jar need to be added to the classpath) location. Step 2 -> Open Databricks URL. -> Navigate to Cluster tab and click on cluster. -> In the below page copy any one of the JDBC URLs. (you may need to click on advanced settings under configuration tab in cluster config page) Step 3 Navigate to top right Corner to the profile page and click on profile. Generate Token and keep it safely in local machine. Step 4 Below is the scala program to connect to Databricks Delta from Outside Azure/Cloud environment. I have us

Convert ORC file into Txt using PIG

Script to convert ORC file into Text. REGISTER '/usr/hdp/current/hive-client/lib/hive-exec-1.2.1000.2.4.2.0-258.jar'; target = Load '/user/hdfspath/000000_0' using OrcStorage('-c SNAPPY'); store target into '/user/hdfspath/myfile' using PigStorage('|'); Register statement is to register snappy jar. '-c SNAPPY'  to include snappy param if orc is created using snappy compression. if not you can just remove quoted part;