Skip to main content

How to profile Datasets of Amorphic?

headerImage

info

Create Dataset with data profiling option

  • Click on 'DATASETS' --> 'Datasets' from left navigation-bar.
  • Click on ➕ icon at the top right corner.
  • Enter the following information.
{
"Dataset Name": "covid19_daily_us_profile_<your_userid>"
"Description": "Covid-19 daily report of US as of 5/31/2021. The target location is Redshift."
"Domain": "workshop(workshop)"
"Data Classifications":
"Keywords": "Retail"
"Connection Type": "API (default)"
"File Type": "csv"
"Target Location": "Redshift"
"Update Method": "Append"
"My Data Files Have Headers": "Yes"
"Custom Delimiter": ","
"Enable Malware Detection": "No"
"Enable Data Profiling": "Yes"
}

Create Dataset

  • Click on 'Register' button at the bottom to move to the next step.
  • Click on the following CSV file to download it to your computer.

COVID-19 CSV File

  • Click on 'Click to upload' to upload the covid-19 csv file.
  • Click on 'Extract Schema' as shown below.
  • You will get a message 'File uploaded successfully'. Click OK.
  • A new screen will appear with the schema extracted as shown below.
  • Verify the columns and data types.
  • Change the 'Sort Key Type' to None.
  • Click on 'Publish Dataset'. You will get a 'completed the registration process successfully' message. Click OK.
  • Click on 'Files' tab. Upload the same covid-19 csv file again.
  • Data profiling jobs run at 12 AM UTC everyday.
  • Under the 'Profile' tab, you will see 'Schema & Data Profile'. If the data profiling option is disabled, you'd see 'Schema' only. Data profile is updated in this section as shown below.

Create Dataset

Enable data profiling for existing datasets.

  • You may enable data profiling for existing datasets with target as Redshift or S3Athena. S3 datasets cannot be profiled.
  • Data profiling is enabled by clicking 'edit' ✏️ icon and change 'Enable Data Profiling' as 'Yes'.


tip
  • How long does it take for all data profiles to get updated?
    • Lets say there are 100 datasets to be profiled, and each dataset takes about 3 minutes (depends on the dataset size) for the data profile to be updated. With a concurrency factor of 5, the total time taken will be 20*3min = 60min.
  • What happens in case of failures?
    • In cases where a data profile fails to be extracted the error is displayed on the profile tab, and an email alert is sent to the subscribed user.