2.3.1.6 Copying Data from Data Boutique's AWS S3 to Azure Blob Storage

Updated by Andrea Squatrito

Copying Data from Data Boutique's AWS S3 to Azure Blob Storage

Data Boutique provides access to datasets through a dedicated AWS S3 bucket, allowing you to retrieve the latest data directly. This guide covers two methods to transfer data from Data Boutique’s S3 bucket to Microsoft Azure Blob Storage: a Full Transfer method to transfer all files and an Incremental Transfer method to copy only new files if regularly scheduled.

Method 1: Full Transfer of All Data

A full transfer is ideal for first-time setups or when you occasionally need to refresh the entire dataset in Azure Blob Storage. This approach copies all files from Data Boutique’s S3 bucket to Azure Blob Storage.

Option A: Using azcopy Command-Line Tool

azcopy is a command-line utility from Azure that allows fast data transfer between AWS S3 and Azure Blob Storage.

Prerequisites

  1. Azure CLI and AzCopy: Install Azure CLI and AzCopy on your machine.
  2. AWS CLI: Install the AWS CLI for easy authentication with AWS.
  3. Access Permissions:
    • Ensure you have read permissions for your Data Boutique S3 bucket.
    • Ensure write permissions for your Azure Blob Storage container.

Steps for Full Transfer

  1. Configure AWS CLI:
    • Run aws configure and enter your AWS credentials from Data Boutique.
  2. Authenticate with Azure:
    • Log in with Azure by running:
      az login
  3. Run AzCopy for Full Transfer:
    • Use the following command to copy all files from Data Boutique’s S3 bucket to your Azure Blob Storage container:
      azcopy copy "https://databoutique.com/buyers/[Your_AWS_Access_Key]/?AWSAccessKeyId=YOUR_AWS_ACCESS_KEY&Signature=YOUR_SIGNATURE" "https://<your-storage-account-name>.blob.core.windows.net/<your-container-name>" --recursive
    • Replace the AWSAccessKeyId and Signature with your AWS S3 access credentials, and specify your Azure Blob container URL.
  4. Verify the Transfer:
    • Once the transfer completes, check your Azure Blob container for the files to confirm the data was copied correctly.

Option B: Using AWS Data Sync with Azure Blob Storage

AWS Data Sync can be set up to automate data transfers from AWS S3 to Azure Blob Storage. This approach is ideal for large data transfers and frequent updates.

Prerequisites

  1. AWS Data Sync: Set up an AWS Data Sync agent with access to both S3 and Azure Blob Storage.
  2. Azure Blob Storage: Create a container in Azure for the destination.

Steps for Full Transfer with AWS Data Sync

  1. Create a Data Sync Task in AWS:
    • In AWS Data Sync, create a new transfer task specifying your Data Boutique S3 bucket as the source.
  2. Configure Azure Blob Storage as the Destination:
    • Set up the Azure Blob Storage container as the destination, and specify any permissions required.
  3. Run the Full Transfer Task:
    • Start the Data Sync task to transfer all files from S3 to Azure. You can monitor progress in the Data Sync dashboard.
  4. Verify the Transfer:
    • Check the Azure Blob container to confirm that all files were transferred.

Method 2: Incremental Transfer (Only New Files)

Incremental transfers are useful when you need to copy only new or modified files on a regular schedule. This approach saves bandwidth and reduces storage costs.

Option A: Using azcopy with File Tracking

The azcopy command-line tool can be adapted for incremental transfers by tracking previously transferred files locally.

Steps for Incremental Transfer

  1. Create a State File:
    • Set up a local file, transferred_files.txt, to store a list of files that have already been transferred.
  2. Use AzCopy Script for Incremental Transfer:
    • Use the following script to copy only new files from Data Boutique’s S3 bucket to Azure Blob Storage.
    # Define paths
    S3_BUCKET="https://databoutique.com/buyers/YOUR_AWS_ACCESS_KEY/"
    AZURE_CONTAINER="https://<your-storage-account-name>.blob.core.windows.net/<your-container-name>"
    STATE_FILE="transferred_files.txt"

    # Ensure state file exists
    touch "$STATE_FILE"

    # List files in S3 bucket
    aws s3 ls "$S3_BUCKET" --recursive | awk '{print $4}' | while read -r file; do
    if ! grep -q "$file" "$STATE_FILE"; then
    # Copy file to Azure Blob and add to state file if successful
    azcopy copy "$S3_BUCKET$file" "$AZURE_CONTAINER"
    echo "$file" >> "$STATE_FILE"
    echo "Transferred new file: $file"
    fi
    done
  3. Schedule Regular Incremental Transfers:
    • Use a cron job (Linux/Mac) or Task Scheduler (Windows) to run this script on a regular basis, ensuring that only new files are transferred.

Option B: Using Python Script with State Tracking

For custom workflows, you can use Python to track previously transferred files and handle incremental updates.

Python Script for Incremental Transfer

import boto3
from azure.storage.blob import BlobServiceClient
import os

# AWS and Azure configuration
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'
AZURE_CONTAINER = 'your-container-name'
AZURE_CONNECTION_STRING = 'Your-Azure-Connection-String'
STATE_FILE = 'transferred_files.txt'

# Initialize AWS S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY
)

# Initialize Azure Blob Service client
blob_service_client = BlobServiceClient.from_connection_string(AZURE_CONNECTION_STRING)
container_client = blob_service_client.get_container_client(AZURE_CONTAINER)

# Load transferred files from state file
def load_transferred_files():
if os.path.exists(STATE_FILE):
with open(STATE_FILE, 'r') as f:
return set(line.strip() for line in f)
return set()

# Save updated list of transferred files
def save_transferred_files(files):
with open(STATE_FILE, 'a') as f:
for file in files:
f.write(f"{file}\n")

def transfer_new_files():
transferred_files = load_transferred_files()
new_transfers = []

# List objects in the S3 bucket
s3_objects = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)

for obj in s3_objects.get('Contents', []):
s3_key = obj['Key']

if s3_key not in transferred_files:
# Download from S3
file_data = s3_client.get_object(Bucket=S3_BUCKET, Key=s3_key)['Body'].read()

# Upload to Azure Blob Storage
blob_client = container_client.get_blob_client(s3_key)
blob_client.upload_blob(file_data)
print(f"Uploaded {s3_key} to Azure Blob Storage")

# Add to list of new transfers
new_transfers.append(s3_key)

# Update the state file with new transfers
save_transferred_files(new_transfers)

# Run the transfer function
transfer_new_files()

This script:

  1. Loads a list of previously transferred files from transferred_files.txt.
  2. Downloads and uploads only files that are not in the list, minimizing redundant transfers.
  3. Updates transferred_files.txt with newly transferred files.

Conclusion

This guide provides methods for both full and incremental data transfers from Data Boutique’s AWS S3 bucket to Azure Blob Storage. Whether you need a one-time full transfer or a scheduled incremental setup, these approaches allow you to move data efficiently across cloud providers, ensuring that your Azure Blob Storage always contains the latest datasets from Data Boutique.


How did we do?