If you are somebody who uses AWS Athena to query large highly partitioned tables on a daily basis you must know how difficult it is to maintain the partitions. As your partitions grow, you also need to update the metadata in Glue Data Catalog, or else the new data isn’t scanned. Some of us even have standalone setups for this purpose; to add partitions to the table’s metadata. AWS offers a feature called Partition Projections in Athena which automates partition management. This also speeds up query processing.

What is partition projection

Generally, when queries are processed in Athena, the required information is first fetched from the metadata store such as AWS Glue Data Catalog or the Hive Metastore. In cases when our tables have a large number of partitions, retrieving metadata can take a huge amount of time. To avoid this, we can use partition projection.

Partition projection allows us to specify configuration information such as the patterns that are commonly used to form partitions (for example, YYYY/MM/DD). This gives Athena the information necessary to build partitions without retrieving metadata information from your metadata store. Athena will read the partition values and locations from the table’s configuration, rather than from a metadata repository like the AWS Glue Data Catalog or Hive Metastore.

Partition projections can significantly reduce query runtime for queries that are constrained on partition metadata retrieval. However, it is to be noted that the speed of the data scan is unaffected by partition projection but the time to fetch partition info from metadata is eliminated. The key benefit is the automated partition management, removing the need to manually add partitions in a day to day Data lake operations.

How to use partition projections

For using partition projection, we need to specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or external Hive metastore. These custom properties on the table allow Athena to know what partition patterns to expect when it runs a query on the table. During query execution, Athena will use this information to project the partition values instead of retrieving them from the AWS Glue Data Catalog or external Hive metastore.

General Use Cases

  1. Queries that take a significant amount of time to run against highly partitioned tables.
  2. When we have highly partitioned data in Amazon S3 and we need to read only small parts of it. In this case, we can use partition projections.

Structures of Projectable Partition

Partition projection can be easily configured when your partitions follow an easily predictable pattern such as, the following:

  1. Integers
    Any continuous sequence of integers such as [1, 2, 3, 4, …, 1000] or [0500, 0550, 0600, …, 2500] etc.
  2. Dates
    Any continuous sequence of dates or datetimes such as
    [20200201, 20200202, …, 20200831] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, …, 12-31-2020 23:00:00].
  3. Enumerated values
    A finite set of enumerated values such as airport codes or AWS Regions.
  4. AWS service logs
    AWS service logs typically have a known structure whose partition scheme you can specify in AWS Glue and that Athena can therefore use for partition projection. 

Important things to consider

  1. Partition projection eliminates the need to specify partitions manually in AWS Glue or an external Hive metastore.
  2. When you enable partition projection on a table, Athena ignores any partition metadata in the AWS Glue Data Catalog or external Hive metastore for that table.
  3. If a particular projected partition does not exist in Amazon S3, Athena will still project the partition. Athena will not throw an error, but no data is returned. However, If there are too many of the empty partitions, performance can be slower compared to traditional AWS Glue partitions. If more than half of your projected partitions are empty, it is recommended to use traditional partitions.
  4. Partition projection is usable only when the table is queried through Athena. If the same table is read through another service such as Amazon Redshift Spectrum or Amazon EMR, the standard partition metadata will be used.
  5. Views in Athena do not use projection configuration properties.

Enabling partition projection using the AWS Glue console

Setting up partition projection in a table’s properties can be done via AWS Glue. To configure and enable partition projection using the AWS Glue console :

  1. Sign in to the AWS Management Console and open the AWS Glue console at  
https://console.aws.amazon.com/glue/.
  1. Choose the Tables tab.
  2. On the Tables tab, you can edit the already existing tables, or you can choose Add tables to create new ones in athena.
  3. In the list of tables, select the table that you want to edit. Click on “Edit table”
  1. In the Edit table details dialog box, in the Table properties section, for each partitioned column, add the following key-value pair:
    1. For Key, add projection.columnName.type.
    2. For Value, add one of the supported types: enum, integer, date, or injected. 
  2. To enable partition projection we need to add a key-value pair in the table properties. For Key, enter projection.enabled, and for its Value, enter true. and we can disable partition projection at any time just by changing the value to false.
  1. When finished, choose Apply.
  2. In the Athena Query Editor, test query for the columns that you configured for the table. 

Supported types for partition projections

  1. ENUM
  2. INTEGER
  3. Date
  4. Injected Type

ENUM:

For partition columns whose values are members of an enumerated set (for example, airport codes or AWS Regions), we can use “enum”.

Property name Example
projection.columnName.type  enum
projection.columnName.valuesA,B,C,D,E,F,G,Unknown

INTEGER:

For partition columns whose possible values are integers within a defined range, we can use “integer”.

Property name Example
projection.columnName.type  integer
projection.columnName.range 0,100  or  -1,86755  or 001,999 etc.
projection.columnName.interval1  or 5 or 10 etc.
projection.columnName.digits1  or 5 etc.

Date:

For partition columns whose values are dates within a defined range, we can use “date”. Projected date columns are generated in Coordinated Universal Time (UTC) at query execution time.

Property name Example
projection.columnName.type  date
projection.columnName.range 201701, 201812 or 01-01-2010, 12-31-2018 or NOW-3YEARS, NOW or 201801, NOW+1MONTH etc.
projection.columnName.formatyyyyMM or dd-MM-yyyy or dd-MM-yyyy-HH-mm-ss etc.
projection.columnName.interval1  or 5 etc.
projection.columnName.interval.unitYEARS or MONTHS or WEEKS or DAYS or HOURS or MINUTES or SECONDS or MILLISECONDS

Injected Type:

For partition columns with possible values that cannot be procedurally generated within some logical range but that are provided in a query’s WHERE clause as a single value, we can use Injected Type.

Property name Example
projection.columnName.type  injected

I hope it was helpful. Thank you!

This story is authored by Ramu D. He specializes in Cloud Services based development.

Last modified: September 11, 2020

Author

Comments

Write a Reply or Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.