Source code for Data profiling using pyspark




Profiling plays major role in today's data science world. Understanding about data is required in order to identify and apply machine learning techniques. I hope below code help some people to profile the data 

In this code I will be using pyspark code to process the data and html, pandas to render the output into html file.

Firstly, I read the file using spark read command and assigned it to a dataframe. I have used this dataframe to filter the data to display the profiling attributes. I used pandas in some places to data formatting/reporting purposes.

The file I have used is CSV file with a header inside. For getting data types I have used df.types. so date type will be showed as String.




the output will show like this.



For any custom enhancements/feedback, please contact me dileep.psdk@gmail.com


Comments