2.3.1.3 Automating the Listing of New Files Delivered to Your AWS S3 Bucket in Data Boutique

Updated 7 months ago by Andrea Squatrito

Automating the Listing of New Files Delivered to Your AWS S3 Bucket

Data Boutique delivers purchased datasets to a dedicated AWS S3 bucket, allowing secure access to your data files. To help track new files without manual effort, you can automate the listing process, ensuring you always have the latest datasets readily available. This guide provides instructions and code samples in Python, Java, Bash, and Node.js to list all files in your S3 bucket and identify any new files since the previous scan.

How Automated File Listing Works

Each script in this guide saves a record of previously scanned files in a local file (previous_files.txt). During each scan, the script will:

Retrieve the current list of files from your S3 bucket.
Compare this list to the previous record.
Output any newly detected files since the last scan.

The local record is updated after each scan, ensuring that only genuinely new files are identified in future scans.

1. Listing New Files Using Python

In Python, the boto3 library provides an efficient way to interact with AWS S3. The script below will load the list of previously scanned files, retrieve the current list from S3, and output any new files.

Python Code Example

First, install the boto3 library:

pip install boto3

Use this Python script to detect new files:

import boto3
import os

# AWS credentials
AWS_ACCESS_KEY = 'YOUR_AWS_ACCESS_KEY'
AWS_SECRET_KEY = 'YOUR_SECRET_KEY'
S3_BUCKET = 'databoutique.com'
BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/'

# Path to save previous file list
previous_files_path = 'previous_files.txt'

# Initialize S3 client
s3_client = boto3.client(
    's3',
    aws_access_key_id=AWS_ACCESS_KEY,
    aws_secret_access_key=AWS_SECRET_KEY,
    region_name='eu-central-1'
)

def load_previous_files():
    if os.path.exists(previous_files_path):
        with open(previous_files_path, 'r') as f:
            return set(f.read().splitlines())
    return set()

def save_current_files(file_list):
    with open(previous_files_path, 'w') as f:
        for file in file_list:
            f.write(f"{file}\n")

def list_new_files_in_s3():
    # Load the previous file list
    previous_files = load_previous_files()
    
    # List current files in S3
    response = s3_client.list_objects_v2(Bucket=S3_BUCKET, Prefix=BUYER_PATH)
    current_files = set(item['Key'] for item in response.get('Contents', []))
    
    # Identify new files
    new_files = current_files - previous_files
    if new_files:
        print("New files detected:")
        for file in new_files:
            print(file)
    else:
        print("No new files found.")
    
    # Save the current file list for future scans
    save_current_files(current_files)

# Run the function
list_new_files_in_s3()

Explanation

load_previous_files(): Reads the previous scan’s file list from previous_files.txt.
save_current_files(): Saves the current scan’s file list to previous_files.txt.
list_new_files_in_s3(): Compares the previous and current lists, printing any new files.

2. Listing New Files Using Java

Java’s AWS SDK also supports listing files and identifying new additions in your S3 bucket.

Java Code Example

Add the AWS SDK dependency to your pom.xml:

<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-java-sdk-s3</artifactId>
    <version>1.12.0</version>
</dependency>

Use the following Java code to identify new files:

import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.ListObjectsV2Request;
import com.amazonaws.services.s3.model.ListObjectsV2Result;
import com.amazonaws.services.s3.model.S3ObjectSummary;

import java.io.*;
import java.util.HashSet;
import java.util.Set;

public class S3NewFileDetector {
    private static final String AWS_ACCESS_KEY = "YOUR_AWS_ACCESS_KEY";
    private static final String AWS_SECRET_KEY = "YOUR_SECRET_KEY";
    private static final String S3_BUCKET = "databoutique.com";
    private static final String BUYER_PATH = "buyers/YOUR_BUYER_ACCESS_KEY/";
    private static final String PREVIOUS_FILES_PATH = "previous_files.txt";

    public static void main(String[] args) {
        BasicAWSCredentials awsCreds = new BasicAWSCredentials(AWS_ACCESS_KEY, AWS_SECRET_KEY);
        AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
                .withRegion("eu-central-1")
                .withCredentials(new AWSStaticCredentialsProvider(awsCreds))
                .build();

        Set<String> previousFiles = loadPreviousFiles();
        Set<String> currentFiles = new HashSet<>();

        ListObjectsV2Request req = new ListObjectsV2Request().withBucketName(S3_BUCKET).withPrefix(BUYER_PATH);
        ListObjectsV2Result result;

        do {
            result = s3Client.listObjectsV2(req);

            for (S3ObjectSummary objectSummary : result.getObjectSummaries()) {
                String fileName = objectSummary.getKey();
                currentFiles.add(fileName);
                if (!previousFiles.contains(fileName)) {
                    System.out.println("New file detected: " + fileName);
                }
            }
            req.setContinuationToken(result.getNextContinuationToken());
        } while (result.isTruncated());

        saveCurrentFiles(currentFiles);
    }

    private static Set<String> loadPreviousFiles() {
        Set<String> previousFiles = new HashSet<>();
        try (BufferedReader br = new BufferedReader(new FileReader(PREVIOUS_FILES_PATH))) {
            String line;
            while ((line = br.readLine()) != null) {
                previousFiles.add(line);
            }
        } catch (IOException e) {
            System.out.println("No previous files record found.");
        }
        return previousFiles;
    }

    private static void saveCurrentFiles(Set<String> fileList) {
        try (PrintWriter pw = new PrintWriter(new FileWriter(PREVIOUS_FILES_PATH))) {
            for (String file : fileList) {
                pw.println(file);
            }
        } catch (IOException e) {
            System.err.println("Error saving current files list: " + e.getMessage());
        }
    }
}

3. Listing New Files Using Bash

A Bash script using the AWS CLI can quickly list and identify new files in your S3 bucket.

Bash Code Example

#!/bin/bash

# File paths
PREVIOUS_FILES="previous_files.txt"
CURRENT_FILES="current_files.txt"

# List current files in S3
aws s3 ls s3://databoutique.com/buyers/YOUR_BUYER_ACCESS_KEY/ --recursive | awk '{print $4}' > "$CURRENT_FILES"

# Compare with the previous file list
if [ -f "$PREVIOUS_FILES" ]; then
    echo "New files detected:"
    diff "$PREVIOUS_FILES" "$CURRENT_FILES" | grep "^>" | awk '{print $2}'
else
    echo "No previous file list found. Saving current list as previous."
fi

# Update the previous file list
mv "$CURRENT_FILES" "$PREVIOUS_FILES"

Explanation

The AWS CLI command lists files and outputs them to current_files.txt.
diff compares the new list to previous_files.txt, showing only new files.
The script updates previous_files.txt with the latest list for future scans.

4. Listing New Files Using Node.js

The AWS SDK in Node.js also allows file listing and new file detection.

Node.js Code Example

const AWS = require('aws-sdk');
const fs = require('fs');

AWS.config.update({
    accessKeyId: 'YOUR_AWS_ACCESS_KEY',
    secretAccessKey: 'YOUR_SECRET_KEY',
    region: 'eu-central-1'
});

const s3 = new AWS.S3();
const S3_BUCKET = 'databoutique.com';
const BUYER_PATH = 'buyers/YOUR_BUYER_ACCESS_KEY/';
const previousFilesPath = 'previous_files.txt';

function loadPreviousFiles() {
    if (fs.existsSync(previousFilesPath)) {
        return new Set(fs.readFileSync(previousFilesPath, 'utf-8').split('\n').filter(Boolean));
    }
    return new Set();
}

function saveCurrentFiles(fileList) {
    fs.writeFileSync(previousFilesPath, Array.from(fileList).join('\n'));
}

async function listNewFilesInS3() {
    const previousFiles = loadPreviousFiles();
    const currentFiles = new Set();
    
    const params = { Bucket: S3_BUCKET, Prefix: BUYER_PATH };
    const data = await s3.listObjectsV2(params).promise();

    console.log("New files detected:");
    data.Contents.forEach(file => {
        currentFiles.add(file.Key);
        if (!previousFiles.has(file.Key)) {
            console.log(file.Key);
        }
    });

    saveCurrentFiles(currentFiles);
}

listNewFilesInS3();

Conclusion

These scripts provide efficient ways to automate listing and detection of new files in your Data Boutique AWS S3 bucket. This setup helps streamline your data tracking, enabling you to access new datasets as they’re delivered and integrate them smoothly into your data workflows.

2.3.1.3 Automating the Listing of New Files Delivered to Your AWS S3 Bucket in Data Boutique

Automating the Listing of New Files Delivered to Your AWS S3 Bucket

How Automated File Listing Works

1. Listing New Files Using Python

Python Code Example

Explanation

2. Listing New Files Using Java

Java Code Example

3. Listing New Files Using Bash

Bash Code Example

Explanation

4. Listing New Files Using Node.js

Node.js Code Example

Conclusion

How did we do?

Related Articles 2.3.1.5 Copying Data from Data Boutique's AWS S3 to Google Cloud Storage 2.3.1.6 Copying Data from Data Boutique's AWS S3 to Azure Blob Storage 2.4.3 Analyzing Data from Data Boutique’s AWS S3 in Looker

Contact

Related Articles

2.3.1.5 Copying Data from Data Boutique's AWS S3 to Google Cloud Storage

2.3.1.6 Copying Data from Data Boutique's AWS S3 to Azure Blob Storage

2.4.3 Analyzing Data from Data Boutique’s AWS S3 in Looker