2.3.1.7 Accessing and Querying Data from Data Boutique's AWS S3 in Snowflake

Updated by Andrea Squatrito

Accessing and Querying Data from Data Boutique's AWS S3 in Snowflake

Data Boutique provides datasets stored in AWS S3, allowing you to access and analyze this data in Snowflake. While it’s possible to load data into Snowflake tables for ongoing analysis, this isn’t always necessary or the most efficient approach. Snowflake offers several methods to directly query data in S3 without duplicating it, minimizing storage costs and simplifying data management.

Considerations: Should You Load Data into Snowflake?

Before loading Data Boutique’s AWS S3 data into Snowflake, consider these efficient alternatives that allow you to access and query the data directly:

  1. External Tables with External Stages:
    • External tables let you query data stored in S3 as if it were in Snowflake, without actually loading it into tables. This is ideal for large, static datasets or data you want to access infrequently.
  2. On-Demand Loading with External Stages and COPY INTO:
    • You can load specific data subsets into Snowflake as needed, keeping costs low by avoiding full data loading and maintaining control over what’s available in Snowflake.
  3. Direct Queries Using Snowflake Partner Connect:
    • For live or frequently updated data, use Snowflake’s Partner Connect integrations, or query S3 through services like Amazon Athena or AWS Glue for efficient real-time access.
  4. Materialized Views for Frequently Accessed Data:
    • Materialized views allow you to store the results of frequently run queries, reducing the need to repeatedly query large datasets.

If any of these approaches align with your data needs, they may offer a more cost-effective and flexible solution than loading data into Snowflake tables.

Method 1: Using External Tables with External Stages for Direct Access

Snowflake External Tables allow you to query S3 data directly without loading it into Snowflake storage. This is an ideal solution for large datasets or data that doesn’t change frequently.

Steps to Set Up an External Table

  1. Create an External Stage:
    • Set up an external stage in Snowflake to point to your Data Boutique S3 bucket.
    CREATE STAGE my_data_boutique_stage
    URL='s3://databoutique.com/buyers/[Your_AWS_Access_Key]/'
    STORAGE_INTEGRATION = my_aws_integration;
  2. Configure Snowflake Storage Integration:
    • Set up an AWS IAM role to grant Snowflake access to the S3 bucket. If you don’t have a storage integration, create one:
    CREATE STORAGE INTEGRATION my_aws_integration
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = S3
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = '<your-role-arn>'
    STORAGE_ALLOWED_LOCATIONS = ('s3://databoutique.com/buyers/[Your_AWS_Access_Key]/');
  3. Create an External Table:
    • Define the table structure for querying the data directly.
    CREATE EXTERNAL TABLE my_external_table (
    col1 STRING,
    col2 STRING,
    col3 STRING
    )
    WITH LOCATION = 's3://databoutique.com/buyers/[Your_AWS_Access_Key]/'
    FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"');
    This setup allows you to run queries directly against data stored in S3 without copying it to Snowflake tables.

Method 2: On-Demand Data Loading with COPY INTO Command

If you need data only at specific times, you can use Snowflake’s external stages and the COPY INTO command to bring specific subsets of data into Snowflake on-demand.

Steps for On-Demand Loading

  1. Set Up an External Stage (as above).
  2. Use COPY INTO for On-Demand Loading:
    • Load data as needed with the COPY INTO command.
    COPY INTO my_table
    FROM @my_data_boutique_stage
    FILE_FORMAT = (TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"');
    This allows you to bring specific datasets into Snowflake only when needed, rather than continuously storing the full dataset.

Method 3: Direct Queries Using Snowflake Partner Connect for Live Data

For live or frequently updated data, you can use Snowflake Partner Connect to query data in S3 through services like Amazon Athena or AWS Glue.

Steps to Set Up Direct Queries

  1. Create a Data Source in Amazon Athena or AWS Glue:
    • Configure Athena or Glue to read from your Data Boutique S3 bucket and set up any transformations or aggregations as needed.
  2. Connect Snowflake to the Data Source:
    • Use Snowflake’s Partner Connect integrations to access data directly from Athena or Glue.

This is a good option for accessing real-time data or handling complex queries without loading the raw data into Snowflake tables.

Method 4: Using Materialized Views for Frequently Accessed Data

If you frequently run the same queries or need specific aggregations, Materialized Views can store query results and refresh on-demand, giving you efficient access without loading the full dataset.

Steps to Set Up Materialized Views

  1. Create an External Table (as in Method 1).
  2. Define a Materialized View for frequent queries:
    CREATE MATERIALIZED VIEW my_materialized_view AS
    SELECT col1, COUNT(col2) AS count_col2
    FROM my_external_table
    GROUP BY col1;
  3. Schedule Regular Refreshes:
    • Materialized views can be set to refresh automatically, keeping data up-to-date while minimizing storage needs.

This is ideal for data that you access often in specific formats or aggregations.

Additional Considerations for Cross-Region Data

Data Boutique’s AWS S3 data is stored in the eu1 region. If your Snowflake account is in a different AWS region, consider the following:

  • Cross-Region Transfer Costs: Cross-region access incurs extra fees. To avoid these, copy data from eu1 to an S3 bucket in the same region as your Snowflake account.
  • Using AWS Data Sync: Transfer data to a region-specific S3 bucket using AWS Data Sync or AWS CLI.
  • S3 Bucket Permissions: Ensure permissions are configured to allow cross-region access if you decide to query data directly from eu1.

Conclusion

When working with Data Boutique data, consider using Snowflake’s external tables, on-demand loading, Partner Connect, or materialized views for efficient access and analysis. Each of these options lets you query data from S3 in Snowflake without loading it fully, helping you optimize storage costs and streamline your workflows. Whether you need on-demand access, real-time querying, or frequently updated aggregations, these methods provide flexible solutions tailored to your specific data needs in Snowflake.


How did we do?