2.3.1.3 Automating the Listing of New Files Delivered to Your AWS S3 Bucket in Data Boutique
Automating the Listing of New Files Delivered to Your AWS S3 Bucket
Data Boutique delivers purchased datasets to a dedicated AWS S3 bucket, allowing secure access to your data files. To help track new files without manual effort, you can automate the listing process, ensuring you always have the latest datasets readily available. This guide provides instructions and code samples in Python, Java, Bash, and Node.js to list all files in your S3 bucket and identify any new files since the previous scan.
How Automated File Listing Works
Each script in this guide saves a record of previously scanned files in a local file (previous_files.txt
). During each scan, the script will:
- Retrieve the current list of files from your S3 bucket.
- Compare this list to the previous record.
- Output any newly detected files since the last scan.
The local record is updated after each scan, ensuring that only genuinely new files are identified in future scans.
1. Listing New Files Using Python
In Python, the boto3
library provides an efficient way to interact with AWS S3. The script below will load the list of previously scanned files, retrieve the current list from S3, and output any new files.
Python Code Example
First, install the boto3
library:
pip install boto3
Use this Python script to detect new files:
import boto3
import os
# AWS credentials
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'
# Path to save previous file list
previous_files_path = 'previous_files.txt'
# Initialize S3 client
s3_client = boto3.client(
's3',
aws_access_key_id=AWS_ACCESS_KEY,
aws_secret_access_key=AWS_SECRET_KEY,
region_name='eu-central-1'
)
def load_previous_files():
if os.path.exists(previous_files_path):
with open(previous_files_path, 'r') as f:
return set(f.read().splitlines())
return set()
def save_current_files(file_list):
with open(previous_files_path, 'w') as f:
for file in file_list:
f.write(f"{file}\n")
def list_new_files_in_s3():
# Load the previous file list
previous_files = load_previous_files()
# List current files in S3
response = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)
current_files = set(item['Key'] for item in response.get('Contents', []))
# Identify new files
new_files = current_files - previous_files
if new_files:
print("New files detected:")
for file in new_files:
print(file)
else:
print("No new files found.")
# Save the current file list for future scans
save_current_files(current_files)
# Run the function
list_new_files_in_s3()
Explanation
load_previous_files()
: Reads the previous scan’s file list fromprevious_files.txt
.save_current_files()
: Saves the current scan’s file list toprevious_files.txt
.list_new_files_in_s3()
: Compares the previous and current lists, printing any new files.
2. Listing New Files Using Java
Java’s AWS SDK also supports listing files and identifying new additions in your S3 bucket.
Java Code Example
Add the AWS SDK dependency to your pom.xml
:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-s3</artifactId>
<version>1.12.0</version>
</dependency>
Use the following Java code to identify new files:
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListObjectsV2Request;
import com.amazonaws.services.s3.model.ListObjectsV2Result;
import com.amazonaws.services.s3.model.S3ObjectSummary;
import java.io.*;
import java.util.HashSet;
import java.util.Set;
public class S3NewFileDetector {
private static final String AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY";
private static final String AWS_SECRET_KEY = "YOUR_SECRET_KEY";
private static final String S3_BUCKET = "databoutique.com";
private static final String BUYER_PATH = "buyers/YOUR_BUYER_ACCESS_KEY/";
private static final String PREVIOUS_FILES_PATH = "previous_files.txt";
public static void main(String[] args) {
BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY);
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withRegion("eu-central-1")
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
.build();
Set<String> previousFiles = loadPreviousFiles();
Set<String> currentFiles = new HashSet<>();
ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(S3_BUCKET).withPrefix(BUYER_PATH);
ListObjectsV2Result result;
do {
result = s3Client.listObjectsV2(req);
for (S3ObjectSummary objectSummary : result.getObjectSummaries()) {
String fileName = objectSummary.getKey();
currentFiles.add(fileName);
if (!previousFiles.contains(fileName)) {
System.out.println("New file detected: " + fileName);
}
}
req.setContinuationToken(result.getNextContinuationToken());
} while (result.isTruncated());
saveCurrentFiles(currentFiles);
}
private static Set<String> loadPreviousFiles() {
Set<String> previousFiles = new HashSet<>();
try (BufferedReader br = new BufferedReader(new FileReader(PREVIOUS_FILES_PATH))) {
String line;
while ((line = br.readLine()) != null) {
previousFiles.add(line);
}
} catch (IOException e) {
System.out.println("No previous files record found.");
}
return previousFiles;
}
private static void saveCurrentFiles(Set<String> fileList) {
try (PrintWriter pw = new PrintWriter(new FileWriter(PREVIOUS_FILES_PATH))) {
for (String file : fileList) {
pw.println(file);
}
} catch (IOException e) {
System.err.println("Error saving current files list: " + e.getMessage());
}
}
}
3. Listing New Files Using Bash
A Bash script using the AWS CLI can quickly list and identify new files in your S3 bucket.
Bash Code Example
#!/bin/bash
# File paths
PREVIOUS_FILES="previous_files.txt"
CURRENT_FILES="current_files.txt"
# List current files in S3
aws s3 ls s3://databoutique.com/buyers/YOUR_BUYER_ACCESS_KEY/ --recursive | awk '{print $4}' > "$CURRENT_FILES"
# Compare with the previous file list
if [ -f "$PREVIOUS_FILES" ]; then
echo "New files detected:"
diff "$PREVIOUS_FILES" "$CURRENT_FILES" | grep "^>" | awk '{print $2}'
else
echo "No previous file list found. Saving current list as previous."
fi
# Update the previous file list
mv "$CURRENT_FILES" "$PREVIOUS_FILES"
Explanation
- The AWS CLI command lists files and outputs them to
current_files.txt
. diff
compares the new list toprevious_files.txt
, showing only new files.- The script updates
previous_files.txt
with the latest list for future scans.
4. Listing New Files Using Node.js
The AWS SDK in Node.js also allows file listing and new file detection.
Node.js Code Example
const AWS = require('aws-sdk');
const fs = require('fs');
AWS.config.update({
accessKeyId: 'YOUR_AWS_ACCESS_KEY',
secretAccessKey: 'YOUR_SECRET_KEY',
region: 'eu-central-1'
});
const s3 = new AWS.S3();
const S3_BUCKET = 'databoutique.com';
const BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/';
const previousFilesPath = 'previous_files.txt';
function loadPreviousFiles() {
if (fs.existsSync(previousFilesPath)) {
return new Set(fs.readFileSync(previousFilesPath, 'utf-8').split('\n').filter(Boolean));
}
return new Set();
}
function saveCurrentFiles(fileList) {
fs.writeFileSync(previousFilesPath, Array.from(fileList).join('\n'));
}
async function listNewFilesInS3() {
const previousFiles = loadPreviousFiles();
const currentFiles = new Set();
const params = { Bucket: S3_BUCKET, Prefix: BUYER_PATH };
const data = await s3.listObjectsV2(params).promise();
console.log("New files detected:");
data.Contents.forEach(file => {
currentFiles.add(file.Key);
if (!previousFiles.has(file.Key)) {
console.log(file.Key);
}
});
saveCurrentFiles(currentFiles);
}
listNewFilesInS3();
Conclusion
These scripts provide efficient ways to automate listing and detection of new files in your Data Boutique AWS S3 bucket. This setup helps streamline your data tracking, enabling you to access new datasets as they’re delivered and integrate them smoothly into your data workflows.