2.3.1.8 Accessing and Querying Data from Data Boutique’s AWS S3 in Databricks

Updated by Andrea Squatrito

Accessing and Querying Data from Data Boutique’s AWS S3 in Databricks

Data Boutique provides datasets stored in AWS S3, making it possible to access and analyze this data directly in Databricks. Rather than duplicating the data in Databricks, which can incur additional costs, Databricks offers several methods to access, query, and process data directly from S3. This guide covers various approaches for accessing S3 data efficiently.

Considerations for Accessing Data in Databricks

Databricks can process large-scale data efficiently, but directly accessing data stored in AWS S3 provides benefits:

  • Cost Savings: Avoiding duplication of data in Databricks’ storage layer reduces storage costs.
  • Flexibility: Access data as needed from S3, allowing you to keep your Databricks storage streamlined.
  • Efficiency: Databricks integrates seamlessly with AWS S3, making it easy to query or analyze data without moving it into Databricks storage.

Note on Regions

Data Boutique’s AWS S3 data is stored in the eu1 region. If your Databricks workspace is in a different AWS region, be aware that cross-region data transfers may incur additional costs. To mitigate this, you can copy data to a bucket in your Databricks region, which can help optimize data access speed and costs.

Method 1: Mounting AWS S3 with Databricks File System (DBFS)

DBFS allows you to mount an S3 bucket as if it were a local directory, giving you easy access to S3 data. This is ideal for data you need to access frequently, as it integrates seamlessly with Databricks’ environment.

Steps to Mount S3 to DBFS

  1. Configure AWS Credentials in Databricks:
    • In your Databricks workspace, set up AWS credentials using Databricks secrets or directly in the notebook for testing purposes.
    ACCESS_KEY = "YOUR_AWS_ACCESS_KEY"
    SECRET_KEY = "YOUR_AWS_SECRET_KEY"
  2. Mount S3 to DBFS:
    • Use the following command to mount the Data Boutique S3 bucket to a DBFS path. Replace your-s3-bucket-path with your specific S3 path.
    dbutils.fs.mount(
    source = "s3a://{}/".format("databoutique.com/buyers/[Your_AWS_Access_Key]"),
    mount_point = "/mnt/databoutique_data",
    extra_configs = {"fs.s3a.access.key": ACCESS_KEY, "fs.s3a.secret.key": SECRET_KEY}
    )
  3. Access Data:
    • Once mounted, you can list files or read data directly from /mnt/databoutique_data/.
    display(dbutils.fs.ls("/mnt/databoutique_data"))

Method 2: Querying Data Directly from S3 Using Spark

For ad-hoc queries, you can load data from S3 directly into a Spark DataFrame in Databricks. This method is best for one-time analysis or temporary access to specific data.

Steps for Direct Querying with Spark

  1. Configure AWS Credentials:
    • In the notebook, set the AWS access keys as Spark configurations.
    spark.conf.set("fs.s3a.access.key", ACCESS_KEY)
    spark.conf.set("fs.s3a.secret.key", SECRET_KEY)
  2. Read Data into a Spark DataFrame:
    • Use the spark.read command to read data from S3 as a DataFrame.
    df = spark.read.format("csv") \
    .option("header", "true") \
    .load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/*.csv")

    display(df)
  3. Query and Analyze:
    • Perform your queries directly on this DataFrame. This approach allows you to analyze data without persisting it in Databricks.

Method 3: Using Delta Lake for Incremental Data Loading

If your data in S3 is updated frequently, consider using Delta Lake to create a Delta Table on top of your S3 data. Delta Lake provides support for versioning and incremental updates, making it ideal for workflows that rely on regularly updated datasets.

Steps to Set Up Delta Lake on S3

  1. Define the Delta Table:
    • Load data from S3 into a Spark DataFrame, and then save it as a Delta Table.
    df = spark.read.format("csv") \
    .option("header", "true") \
    .load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/*.csv")

    df.write.format("delta").mode("overwrite").save("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/delta_table")
  2. Query the Delta Table:
    • After setting up, you can use Delta Lake’s built-in features to query data incrementally or perform updates and deletes as needed.
    delta_table = spark.read.format("delta").load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/delta_table")
    display(delta_table)
  3. Use Incremental Updates:
    • Delta Lake allows you to use MERGE statements to apply incremental updates, which can help keep your data synchronized with new updates in S3.

Method 4: Using Materialized Views or Persisted DataFrames

For data that you query frequently, you can create materialized views or persist specific DataFrames. Materialized views in Databricks aren’t automatic like in some databases, but you can create and refresh them manually or through scheduled jobs.

Steps for Creating a Materialized View

  1. Load Data into a Spark DataFrame:
    • Use the spark.read method to load data from S3.
  2. Create a Temporary View:
    • Register the DataFrame as a temporary view, which allows you to query it as a table.
    df.createOrReplaceTempView("temp_databoutique_data")
  3. Persist the View for Frequent Access:
    • Optionally, persist this view to speed up access, especially for large datasets.
    spark.catalog.cacheTable("temp_databoutique_data")
  4. Refresh the View Regularly:
    • Set up a job to refresh this cached view at regular intervals, depending on your data access needs.

Additional Considerations for Cross-Region Data

Since Data Boutique’s S3 data is stored in the eu1 region, accessing it from a different AWS region may incur cross-region transfer fees. To optimize costs, consider:

  1. Copying Data to a Region-Specific S3 Bucket:
    • Use AWS Data Sync or AWS CLI to transfer data from eu1 to an S3 bucket in your Databricks region.
    • This will reduce data transfer latency and costs.
  2. Configuring Cross-Region Permissions:
    • If you choose to keep data in eu1, ensure that your S3 bucket policies allow cross-region access from your Databricks workspace.

Conclusion

Databricks offers flexible and efficient ways to access and query Data Boutique’s AWS S3 data directly. Whether you use DBFS mounting, direct Spark DataFrames, Delta Lake, or materialized views, each method helps reduce unnecessary storage duplication while optimizing access and processing costs. For datasets with frequent updates, Delta Lake offers a powerful solution for incremental updates, while materialized views support efficient queries on regularly accessed data.


How did we do?