Skip to main content

How to migrate data from S3 bucket to the Amorphic dataset?

headerImage

info

Tidbits

  • S3 connections are used to migrate data from a remote S3 bucket to Amorphic's dataset.
  • Remote S3 bucket could be in a different AWS account.

Create a source connection

  • Click on 'Connections' widget on the home screen or click on INGESTION --> Connections on the left side navigation-bar or you may also click on Navigator on top right corner and search for Connections.
  • Click on a ➕ icon at the top right corner.
  • Enter the following details and click on Create Connection.
{
"Connection Name": "remote-s3-bkt-2-amorphic-<your-userid>"
"Connection Type": "S3"
"Description": "This connection transfers the data from a remote S3 bucket to Amorphic's dataset."
"Authorized Users": "Select your user name and any other user names you want to grant permission"
"Keywords": "Add relevant keywords like 'S3'. This will be useful for search"
"Version": "1.2"
"S3 Bucket": "amd-workshop-s3"
"Connection Access Type": "Bucket Policy"
"S3 Bucket Region": "us-east-1"
}

Create S3 Connection

Update bucket policy and test connection

  • Once the connection is created, bucket policy and KMS Key Policy are available on details tab as shown below. Create S3 Connection
  • Source bucket policy needs to be with the policy shown above.
  • If the source bucket has a custom KMS key attached, then please update your source KMS key policy with the policy shown above.
  • For this workshop, source bucket amd-workshop-s3 is already enabled with necessary permissions.
  • Test the connection by clicking ⚡ icon.
  • You must get Connection tested successfully message as shown below.
  • If connection fails, you need to correct the bucket policy of source bucket. Create S3 Connection

Create a target dataset

  • Click on 'DATASETS' --> 'Datasets' from left navigation-bar.
  • Click on ➕ icon at the top right corner.
  • Enter the following information and click on 'Register'.
{
"Dataset Name": "remote_s3_2_amd_ds_<your_userid>"
"Description": "This dataset is a destination for S3 connection remote-s3-bkt-2-amorphic-<your-userid>"
"Domain": "workshop(workshop)"
"Data Classifications":
"Keywords": "S3"
"Connection Type": "S3"
"File Type": "csv"
"Target Location": "S3"
"Update Method": "Append"
"Connection": "remote-s3-bkt-2-amorphic-<your-userid>"
"Directory Path": <-- leave it blank to pull all files.
"Enable Malware Detection": "No"
"Enable AI Services": "No"
"Enable Data Cleanup": "No"
}

Create S3 Connection

Setup a schedule

  • Click on 'SCHEDULES' from left navigation-bar.
  • Click on ➕ icon at the top right corner.
  • Enter the following information and click on 'Create'.
{
"Schedule Name": "remote_s3_2_ds_sched_<your_userid>"
"Description": "This schedule runs every 5 minutes to pull data from a remote S3 bucket to the Amorphic dataset."
"Type Of Job": "Data Ingestion"
"Select Dataset": "remote_s3_2_amd_ds_<your_userid> | s3" <-- Click ↩️ icon to refesh the list
"Keywords": "your_userid"
"Allocated Capacity":
"Schedule Type": "Time Based"
"Schedule Expression": "rate(5 minutes)"
}

Create S3 Connection

Add a file to source bucket

  • Press the ctrl button twice or click on Navigator at the top right corner.
  • Type add_files_to_bucket in the navigator's search bar.
  • Click on the jobs matched. This will take you to the job's detail page. If you are not able to access it, contact admin.
  • Click on the Run Job ▶️ icon and click on submit.
  • Go to Executions tab to monitor the status of the job. Once finished, it will add a file to the S3 bucket.
  • 💡 This job has been pre-configured to save time for you.

Check data transfer

  • Execution Status tab of the schedule shows the status of executions as shown below.

Create S3 Connection

  • Hover on the message icon ✅ to see the number of files transferred.
  • For more details, click on 'three dots' and check output logs.
  • Check the files tab of the dataset. The files added to the source bucket should appear here.

Create S3 Connection

Disable schedule

  • You don't want to keep running the schedule forever.
  • Click on the Disable Schedule icon of the schedule page.
  • Click Yes.
You can do more...
  • Create a new schedule to check the behaviour of data transfer.