2.3.1.4 Copying Locally and Managing Local Storage of Data from AWS S3 on Data Boutique
Copying Locally and Managing Local Storage of Data from AWS S3 on Data Boutique
Data Boutique delivers your purchased datasets to a dedicated AWS S3 bucket, giving you secure access to your files. For local processing and easy access in BI tools, you may want to copy these files from S3 to a specified local directory. This article provides step-by-step instructions for automating this download process, ensuring that your local storage is always synchronized with new datasets in your S3 bucket.
How Local Storage Management Works
This setup includes two key parts:
- Automate the Download of New Files: A Python script checks your AWS S3 bucket for new files and downloads them to a designated local directory.
- Schedule Regular Checks for Updates: Using system scheduling tools, you can automate the download process to ensure local storage is always up-to-date with the latest files from Data Boutique.
Step 1: Set Up Automated Download of New Files
Using Python and the boto3
library, you can create a script that checks for new files in your S3 bucket and downloads them to a specified directory on your local machine.
Python Code to Download New Files
- Install the
boto3
Library: This library enables you to interact with AWS services.pip install boto3
- Create the Python Script: Use the following code to list and download new files from your S3 bucket to a local directory.
import boto3
import os
# AWS credentials and configuration
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'
# Directory to save downloaded files
download_directory = './data_files/'
# Initialize S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name='eu-central-1'
)
def download_new_files():
# List objects in S3 bucket
response = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)
current_files = set(item['Key'] for item in response.get('Contents', []))
# Loop through files and download if not already present
for file_key in current_files:
local_file_path = os.path.join(download_directory, os.path.basename(file_key))
if not os.path.exists(local_file_path):
s3_client.download_file(S3_BUCKET, file_key, local_file_path)
print(f"Downloaded new file: {local_file_path}")
# Run the function
download_new_files()
Explanation
- AWS Credentials: Replace
YOUR_AWS_ACCESS_KEY
andYOUR_SECRET_KEY
with your Data Boutique credentials. download_new_files()
: This function checks for files in the S3 bucket and downloads only those that are not already present locally.
Step 2: Schedule the Download Script to Run Regularly
To keep your local storage in sync with the S3 bucket, schedule the download script to run at regular intervals using your operating system’s scheduler.
On Windows
- Open Task Scheduler.
- Select Create Basic Task and name the task.
- Set the trigger (e.g., daily or weekly).
- For Action, choose Start a Program and enter:
- Program/script:
python
- Add arguments:
path\to\your\script.py
- Program/script:
- Save and activate the task.
On macOS or Linux
Use cron to schedule the script:
- Open the terminal.
- Run
crontab -e
to edit your cron jobs. - Add a new line to schedule the script. For example, to run daily at 2 a.m.:
0 2 * * * python /path/to/your/script.py
Step 3: Verify and Monitor Local Storage
To ensure the process is working smoothly:
- Check Logs: Add print statements or logs within the script to monitor downloaded files.
- Clear Old Files (Optional): Periodically review the local directory and remove any outdated files if your storage space is limited.
This setup provides a streamlined and automated way to manage local storage of your Data Boutique datasets, readying them for further analysis or integration into BI tools.