2.3.1.5 Copying Data from Data Boutique's AWS S3 to Google Cloud Storage
Copying Data from Data Boutique's AWS S3 to Google Cloud Storage
Data Boutique provides access to datasets through a dedicated AWS S3 bucket, allowing you to retrieve the latest data directly. This guide covers two methods to transfer data from Data Boutique’s S3 bucket to Google Cloud Storage (GCS): a Full Transfer method for transferring all files and an Incremental Transfer method to copy only new files on a regular schedule.
Method 1: Full Transfer of All Data
A full transfer is ideal when you want to copy the entire dataset from your Data Boutique S3 bucket to Google Cloud Storage. This approach is simple and efficient if you’re setting up the data for the first time or occasionally need to refresh the entire dataset.
Option A: Using gsutil
Command
The gsutil
command-line tool from Google Cloud SDK makes it easy to copy all files from AWS S3 to GCS.
Prerequisites
- Google Cloud SDK: Install the Google Cloud SDK, which includes
gsutil
. - AWS CLI: Install the AWS CLI for authentication with AWS.
- Access Permissions: Ensure you have read permissions for your Data Boutique S3 bucket and write permissions for your GCS bucket.
Steps for Full Transfer
- Configure AWS CLI:
- Run
aws configure
and enter your AWS credentials from Data Boutique.
- Run
- Authenticate
gsutil
:- Log in with Google Cloud by running:
gcloud auth login
- Log in with Google Cloud by running:
- Execute Full Transfer:
- Run the following command to copy all files from the Data Boutique S3 bucket to your GCS bucket:
gsutil -m cp -r s3://databoutique.com/buyers/[Your_AWS_Access_Key]/* gs://your-gcs-bucket-name/
- Replace
[Your_AWS_Access_Key]
with your unique AWS access key from Data Boutique. -m
enables parallel processing, and-r
copies recursively for all files in the bucket.
- Run the following command to copy all files from the Data Boutique S3 bucket to your GCS bucket:
Option B: Using Google Cloud Storage Transfer Service
Google Cloud’s Storage Transfer Service allows for fully automated, scheduled transfers from AWS S3 to GCS, ideal for one-time or recurring full transfers.
Prerequisites
- AWS IAM User: Ensure you have an IAM user with AmazonS3ReadOnlyAccess permissions.
- Access Key and Secret Key: Use the access key and secret key provided by Data Boutique.
- GCS Bucket: Create a Google Cloud Storage bucket as your destination.
Steps for Full Transfer with Storage Transfer Service
- Open Storage Transfer Service:
- In Google Cloud Console, go to Storage Transfer Service.
- Create a Transfer Job:
- Click Create transfer.
- Under Source, select Amazon S3 bucket and enter the bucket name
databoutique.com
. - Enter your AWS Access Key ID and Secret Access Key from Data Boutique.
- Configure Destination:
- Select your Google Cloud Storage bucket as the destination.
- Define Transfer Settings:
- Choose options like overwriting files, deleting source files post-transfer, and email notifications.
- Start the Transfer Job:
- Review and confirm settings, then start the job. Storage Transfer Service will handle the full transfer.
Method 2: Incremental Transfer (Only New Files)
Incremental transfers are optimal when you need to copy only new or modified files regularly. This approach saves bandwidth and storage space by only copying files that aren’t already in GCS.
Option A: Using gsutil
with State Tracking
The gsutil
command can be adapted for incremental transfers by keeping a record of previously transferred files.
Steps for Incremental Transfer
- Create a State File:
- Create a local file,
transferred_files.txt
, to store a list of already transferred files. This file will be used to track new files in future transfers.
- Create a local file,
- Run Incremental Transfer Script:
- Use the following script to copy only files not listed in
transferred_files.txt
.
# Define paths
S3_BUCKET="s3://databoutique.com/buyers/YOUR_AWS_ACCESS_KEY/"
GCS_BUCKET="gs://your-gcs-bucket-name/"
STATE_FILE="transferred_files.txt"
# Ensure state file exists
touch "$STATE_FILE"
# Get list of files in S3 bucket
aws s3 ls "$S3_BUCKET" --recursive | awk '{print $4}' | while read -r file; do
if ! grep -q "$file" "$STATE_FILE"; then
# Copy file to GCS and add to state file if successful
gsutil cp "s3://$S3_BUCKET/$file" "$GCS_BUCKET"
echo "$file" >> "$STATE_FILE"
echo "Transferred new file: $file"
fi
done- Explanation: This script checks each file in S3 against
transferred_files.txt
before copying it to GCS. It then logs transferred files into the state file for future reference.
- Use the following script to copy only files not listed in
- Schedule Regular Incremental Transfers:
- Use
cron
(Linux/Mac) or Task Scheduler (Windows) to run the script regularly for incremental updates.
- Use
Option B: Using Storage Transfer Service with Filters
Google Cloud’s Storage Transfer Service can be set to filter out previously transferred files based on the last modification date, enabling efficient incremental transfers.
Steps for Incremental Transfer with Storage Transfer Service
- Create a Transfer Job:
- In the Storage Transfer Service, create a new transfer job and select your Data Boutique S3 bucket as the source and your GCS bucket as the destination.
- Add Filters for Incremental Transfer:
- Under Source Options, set a filter based on the file’s last modification time (e.g., to transfer only files modified in the last day). This ensures that only new files are transferred each time the job runs.
- Schedule Incremental Transfers:
- Set the job to run on a regular schedule, such as daily or weekly. The filter settings will ensure that only new or modified files are transferred.
- Review Logs and Status:
- After each run, review logs to ensure that only new files were transferred.
Option C: Using Python Script with State Tracking
For more control, a Python script can be used to track previously transferred files and handle incremental updates.
Python Script for Incremental Transfer
import boto3
from google.cloud import storage
import os
# AWS and GCS configuration
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'
GCS_BUCKET = 'your-gcs-bucket-name'
LOCAL_DOWNLOAD_PATH = '/tmp/databoutique_files/'
STATE_FILE = 'transferred_files.txt'
# Initialize AWS S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY
)
# Initialize Google Cloud Storage client
gcs_client = storage.Client()
gcs_bucket = gcs_client.bucket(GCS_BUCKET)
# Load transferred files from state file
def load_transferred_files():
if os.path.exists(STATE_FILE):
with open(STATE_FILE, 'r') as f:
return set(line.strip() for line in f)
return set()
# Save updated list of transferred files
def save_transferred_files(files):
with open(STATE_FILE, 'a') as f:
for file in files:
f.write(f"{file}\n")
def transfer_new_files():
transferred_files = load_transferred_files()
new_transfers = []
# List objects in the S3 bucket
s3_objects = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)
for obj in s3_objects.get('Contents', []):
s3_key = obj['Key']
local_file_path = os.path.join(LOCAL_DOWNLOAD_PATH, os.path.basename(s3_key))
if s3_key not in transferred_files:
# Download from S3
s3_client.download_file(S3_BUCKET, s3_key, local_file_path)
print(f"Downloaded {s3_key} from Data Boutique S3 bucket")
# Upload to GCS
blob = gcs_bucket.blob(s3_key)
blob.upload_from_filename(local_file_path)
print(f"Uploaded {s3_key} to Google Cloud Storage")
# Add to list of new transfers
new_transfers.append(s3_key)
# Optionally delete the local file to save space
os.remove(local_file_path)
# Update the state file with new transfers
save_transferred_files(new_transfers)
# Run the transfer function
transfer_new_files()
This script:
- Loads a list of previously transferred files from
transferred_files.txt
. - Downloads and uploads only files that aren’t in this list.
- Updates
transferred_files.txt
with newly transferred files.
Conclusion
By following this guide, you can transfer data from Data Boutique’s AWS S3 bucket to Google Cloud Storage using either a full transfer or an incremental transfer for new files only. Each approach supports flexible, efficient data movement for one-time setups or regular updates, minimizing bandwidth and storage costs across cloud platforms.