2.3.1.8 Accessing and Querying Data from Data Boutique’s AWS S3 in Databricks
Accessing and Querying Data from Data Boutique’s AWS S3 in Databricks
Data Boutique provides datasets stored in AWS S3, making it possible to access and analyze this data directly in Databricks. Rather than duplicating the data in Databricks, which can incur additional costs, Databricks offers several methods to access, query, and process data directly from S3. This guide covers various approaches for accessing S3 data efficiently.
Considerations for Accessing Data in Databricks
Databricks can process large-scale data efficiently, but directly accessing data stored in AWS S3 provides benefits:
- Cost Savings: Avoiding duplication of data in Databricks’ storage layer reduces storage costs.
- Flexibility: Access data as needed from S3, allowing you to keep your Databricks storage streamlined.
- Efficiency: Databricks integrates seamlessly with AWS S3, making it easy to query or analyze data without moving it into Databricks storage.
Note on Regions
Data Boutique’s AWS S3 data is stored in the eu1
region. If your Databricks workspace is in a different AWS region, be aware that cross-region data transfers may incur additional costs. To mitigate this, you can copy data to a bucket in your Databricks region, which can help optimize data access speed and costs.
Method 1: Mounting AWS S3 with Databricks File System (DBFS)
DBFS allows you to mount an S3 bucket as if it were a local directory, giving you easy access to S3 data. This is ideal for data you need to access frequently, as it integrates seamlessly with Databricks’ environment.
Steps to Mount S3 to DBFS
- Configure AWS Credentials in Databricks:
- In your Databricks workspace, set up AWS credentials using Databricks secrets or directly in the notebook for testing purposes.
ACCESS_KEY = "YOUR_AWS_ACCESS_KEY"
SECRET_KEY = "YOUR_AWS_SECRET_KEY" - Mount S3 to DBFS:
- Use the following command to mount the Data Boutique S3 bucket to a DBFS path. Replace
your-s3-bucket-path
with your specific S3 path.
dbutils.fs.mount(
source = "s3a://{}/".format("databoutique.com/buyers/[Your_AWS_Access_Key]"),
mount_point = "/mnt/databoutique_data",
extra_configs = {"fs.s3a.access.key": ACCESS_KEY, "fs.s3a.secret.key": SECRET_KEY}
) - Use the following command to mount the Data Boutique S3 bucket to a DBFS path. Replace
- Access Data:
- Once mounted, you can list files or read data directly from
/mnt/databoutique_data/
.
display(dbutils.fs.ls("/mnt/databoutique_data"))
- Once mounted, you can list files or read data directly from
Method 2: Querying Data Directly from S3 Using Spark
For ad-hoc queries, you can load data from S3 directly into a Spark DataFrame in Databricks. This method is best for one-time analysis or temporary access to specific data.
Steps for Direct Querying with Spark
- Configure AWS Credentials:
- In the notebook, set the AWS access keys as Spark configurations.
spark.conf.set("fs.s3a.access.key", ACCESS_KEY)
spark.conf.set("fs.s3a.secret.key", SECRET_KEY) - Read Data into a Spark DataFrame:
- Use the
spark.read
command to read data from S3 as a DataFrame.
df = spark.read.format("csv") \
.option("header", "true") \
.load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/*.csv")
display(df) - Use the
- Query and Analyze:
- Perform your queries directly on this DataFrame. This approach allows you to analyze data without persisting it in Databricks.
Method 3: Using Delta Lake for Incremental Data Loading
If your data in S3 is updated frequently, consider using Delta Lake to create a Delta Table on top of your S3 data. Delta Lake provides support for versioning and incremental updates, making it ideal for workflows that rely on regularly updated datasets.
Steps to Set Up Delta Lake on S3
- Define the Delta Table:
- Load data from S3 into a Spark DataFrame, and then save it as a Delta Table.
df = spark.read.format("csv") \
.option("header", "true") \
.load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/*.csv")
df.write.format("delta").mode("overwrite").save("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/delta_table") - Query the Delta Table:
- After setting up, you can use Delta Lake’s built-in features to query data incrementally or perform updates and deletes as needed.
delta_table = spark.read.format("delta").load("s3a://databoutique.com/buyers/[Your_AWS_Access_Key]/delta_table")
display(delta_table) - Use Incremental Updates:
- Delta Lake allows you to use
MERGE
statements to apply incremental updates, which can help keep your data synchronized with new updates in S3.
- Delta Lake allows you to use
Method 4: Using Materialized Views or Persisted DataFrames
For data that you query frequently, you can create materialized views or persist specific DataFrames. Materialized views in Databricks aren’t automatic like in some databases, but you can create and refresh them manually or through scheduled jobs.
Steps for Creating a Materialized View
- Load Data into a Spark DataFrame:
- Use the
spark.read
method to load data from S3.
- Use the
- Create a Temporary View:
- Register the DataFrame as a temporary view, which allows you to query it as a table.
df.createOrReplaceTempView("temp_databoutique_data")
- Persist the View for Frequent Access:
- Optionally, persist this view to speed up access, especially for large datasets.
spark.catalog.cacheTable("temp_databoutique_data")
- Refresh the View Regularly:
- Set up a job to refresh this cached view at regular intervals, depending on your data access needs.
Additional Considerations for Cross-Region Data
Since Data Boutique’s S3 data is stored in the eu1
region, accessing it from a different AWS region may incur cross-region transfer fees. To optimize costs, consider:
- Copying Data to a Region-Specific S3 Bucket:
- Use AWS Data Sync or AWS CLI to transfer data from
eu1
to an S3 bucket in your Databricks region. - This will reduce data transfer latency and costs.
- Use AWS Data Sync or AWS CLI to transfer data from
- Configuring Cross-Region Permissions:
- If you choose to keep data in
eu1
, ensure that your S3 bucket policies allow cross-region access from your Databricks workspace.
- If you choose to keep data in
Conclusion
Databricks offers flexible and efficient ways to access and query Data Boutique’s AWS S3 data directly. Whether you use DBFS mounting, direct Spark DataFrames, Delta Lake, or materialized views, each method helps reduce unnecessary storage duplication while optimizing access and processing costs. For datasets with frequent updates, Delta Lake offers a powerful solution for incremental updates, while materialized views support efficient queries on regularly accessed data.