2.3.1.5 Copying Data from Data Boutique's AWS S3 to Google Cloud Storage

Updated by Andrea Squatrito

Copying Data from Data Boutique's AWS S3 to Google Cloud Storage

Data Boutique provides access to datasets through a dedicated AWS S3 bucket, allowing you to retrieve the latest data directly. This guide covers two methods to transfer data from Data Boutique’s S3 bucket to Google Cloud Storage (GCS): a Full Transfer method for transferring all files and an Incremental Transfer method to copy only new files on a regular schedule.

Method 1: Full Transfer of All Data

A full transfer is ideal when you want to copy the entire dataset from your Data Boutique S3 bucket to Google Cloud Storage. This approach is simple and efficient if you’re setting up the data for the first time or occasionally need to refresh the entire dataset.

Option A: Using gsutil Command

The gsutil command-line tool from Google Cloud SDK makes it easy to copy all files from AWS S3 to GCS.

Prerequisites

  1. Google Cloud SDK: Install the Google Cloud SDK, which includes gsutil.
  2. AWS CLI: Install the AWS CLI for authentication with AWS.
  3. Access Permissions: Ensure you have read permissions for your Data Boutique S3 bucket and write permissions for your GCS bucket.

Steps for Full Transfer

  1. Configure AWS CLI:
    • Run aws configure and enter your AWS credentials from Data Boutique.
  2. Authenticate gsutil:
    • Log in with Google Cloud by running:
      gcloud auth login
  3. Execute Full Transfer:
    • Run the following command to copy all files from the Data Boutique S3 bucket to your GCS bucket:
      gsutil -m cp -r s3://databoutique.com/buyers/[Your_AWS_Access_Key]/* gs://your-gcs-bucket-name/
    • Replace [Your_AWS_Access_Key] with your unique AWS access key from Data Boutique.
    • -m enables parallel processing, and -r copies recursively for all files in the bucket.

Option B: Using Google Cloud Storage Transfer Service

Google Cloud’s Storage Transfer Service allows for fully automated, scheduled transfers from AWS S3 to GCS, ideal for one-time or recurring full transfers.

Prerequisites

  1. AWS IAM User: Ensure you have an IAM user with AmazonS3ReadOnlyAccess permissions.
  2. Access Key and Secret Key: Use the access key and secret key provided by Data Boutique.
  3. GCS Bucket: Create a Google Cloud Storage bucket as your destination.

Steps for Full Transfer with Storage Transfer Service

  1. Open Storage Transfer Service:
    • In Google Cloud Console, go to Storage Transfer Service.
  2. Create a Transfer Job:
    • Click Create transfer.
    • Under Source, select Amazon S3 bucket and enter the bucket name databoutique.com.
    • Enter your AWS Access Key ID and Secret Access Key from Data Boutique.
  3. Configure Destination:
    • Select your Google Cloud Storage bucket as the destination.
  4. Define Transfer Settings:
    • Choose options like overwriting files, deleting source files post-transfer, and email notifications.
  5. Start the Transfer Job:
    • Review and confirm settings, then start the job. Storage Transfer Service will handle the full transfer.

Method 2: Incremental Transfer (Only New Files)

Incremental transfers are optimal when you need to copy only new or modified files regularly. This approach saves bandwidth and storage space by only copying files that aren’t already in GCS.

Option A: Using gsutil with State Tracking

The gsutil command can be adapted for incremental transfers by keeping a record of previously transferred files.

Steps for Incremental Transfer

  1. Create a State File:
    • Create a local file, transferred_files.txt, to store a list of already transferred files. This file will be used to track new files in future transfers.
  2. Run Incremental Transfer Script:
    • Use the following script to copy only files not listed in transferred_files.txt.
    # Define paths
    S3_BUCKET="s3://databoutique.com/buyers/YOUR_AWS_ACCESS_KEY/"
    GCS_BUCKET="gs://your-gcs-bucket-name/"
    STATE_FILE="transferred_files.txt"

    # Ensure state file exists
    touch "$STATE_FILE"

    # Get list of files in S3 bucket
    aws s3 ls "$S3_BUCKET" --recursive | awk '{print $4}' | while read -r file; do
    if ! grep -q "$file" "$STATE_FILE"; then
    # Copy file to GCS and add to state file if successful
    gsutil cp "s3://$S3_BUCKET/$file" "$GCS_BUCKET"
    echo "$file" >> "$STATE_FILE"
    echo "Transferred new file: $file"
    fi
    done
    • Explanation: This script checks each file in S3 against transferred_files.txt before copying it to GCS. It then logs transferred files into the state file for future reference.
  3. Schedule Regular Incremental Transfers:
    • Use cron (Linux/Mac) or Task Scheduler (Windows) to run the script regularly for incremental updates.

Option B: Using Storage Transfer Service with Filters

Google Cloud’s Storage Transfer Service can be set to filter out previously transferred files based on the last modification date, enabling efficient incremental transfers.

Steps for Incremental Transfer with Storage Transfer Service

  1. Create a Transfer Job:
    • In the Storage Transfer Service, create a new transfer job and select your Data Boutique S3 bucket as the source and your GCS bucket as the destination.
  2. Add Filters for Incremental Transfer:
    • Under Source Options, set a filter based on the file’s last modification time (e.g., to transfer only files modified in the last day). This ensures that only new files are transferred each time the job runs.
  3. Schedule Incremental Transfers:
    • Set the job to run on a regular schedule, such as daily or weekly. The filter settings will ensure that only new or modified files are transferred.
  4. Review Logs and Status:
    • After each run, review logs to ensure that only new files were transferred.

Option C: Using Python Script with State Tracking

For more control, a Python script can be used to track previously transferred files and handle incremental updates.

Python Script for Incremental Transfer

import boto3
from google.cloud import storage
import os

# AWS and GCS configuration
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'
GCS_BUCKET = 'your-gcs-bucket-name'
LOCAL_DOWNLOAD_PATH = '/tmp/databoutique_files/'
STATE_FILE = 'transferred_files.txt'

# Initialize AWS S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY
)

# Initialize Google Cloud Storage client
gcs_client = storage.Client()
gcs_bucket = gcs_client.bucket(GCS_BUCKET)

# Load transferred files from state file
def load_transferred_files():
if os.path.exists(STATE_FILE):
with open(STATE_FILE, 'r') as f:
return set(line.strip() for line in f)
return set()

# Save updated list of transferred files
def save_transferred_files(files):
with open(STATE_FILE, 'a') as f:
for file in files:
f.write(f"{file}\n")

def transfer_new_files():
transferred_files = load_transferred_files()
new_transfers = []

# List objects in the S3 bucket
s3_objects = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)

for obj in s3_objects.get('Contents', []):
s3_key = obj['Key']
local_file_path = os.path.join(LOCAL_DOWNLOAD_PATH, os.path.basename(s3_key))

if s3_key not in transferred_files:
# Download from S3
s3_client.download_file(S3_BUCKET, s3_key, local_file_path)
print(f"Downloaded {s3_key} from Data Boutique S3 bucket")

# Upload to GCS
blob = gcs_bucket.blob(s3_key)
blob.upload_from_filename(local_file_path)
print(f"Uploaded {s3_key} to Google Cloud Storage")

# Add to list of new transfers
new_transfers.append(s3_key)

# Optionally delete the local file to save space
os.remove(local_file_path)

# Update the state file with new transfers
save_transferred_files(new_transfers)

# Run the transfer function
transfer_new_files()

This script:

  1. Loads a list of previously transferred files from transferred_files.txt.
  2. Downloads and uploads only files that aren’t in this list.
  3. Updates transferred_files.txt with newly transferred files.

Conclusion

By following this guide, you can transfer data from Data Boutique’s AWS S3 bucket to Google Cloud Storage using either a full transfer or an incremental transfer for new files only. Each approach supports flexible, efficient data movement for one-time setups or regular updates, minimizing bandwidth and storage costs across cloud platforms.


How did we do?