Posts

Showing posts from 2021

Source code for Data profiling using pyspark

Image
Profiling plays major role in today's data science world. Understanding about data is required in order to identify and apply machine learning techniques. I hope below code help some people to profile the data  In this code I will be using pyspark code to process the data and html, pandas to render the output into html file. Firstly, I read the file using spark read command and assigned it to a dataframe. I have used this dataframe to filter the data to display the profiling attributes. I used pandas in some places to data formatting/reporting purposes. The file I have used is CSV file with a header inside. For getting data types I have used df.types. so date type will be showed as String. the output will show like this. For any custom enhancements/feedback, please contact me dileep.psdk@gmail.com