If you are somebody who uses AWS Athena to query large highly partitioned tables on a daily basis you must know how difficult it is to maintain the partitions. As your partitions grow, you also need to update the metadata in Glue Data Catalog, or else the new data isn’t scanned. Some of us even have standalone setups for this purpose; to add partitions to the table’s metadata. AWS offers a feature called Partition Projections in Athena which automates partition management. This also speeds up query processing.
What is partition projection
Generally, when queries are processed in Athena, the required information is first fetched from the metadata store such as AWS Glue Data Catalog or the Hive Metastore. In cases when our tables have a large number of partitions, retrieving metadata can take a huge amount of time. To avoid this, we can use partition projection.
Partition projection allows us to specify configuration information such as the patterns that are commonly used to form partitions (for example, YYYY/MM/DD). This gives Athena the information necessary to build partitions without retrieving metadata information from your metadata store. Athena will read the partition values and locations from the table’s configuration, rather than from a metadata repository like the AWS Glue Data Catalog or Hive Metastore.
Partition projections can significantly reduce query runtime for queries that are constrained on partition metadata retrieval. However, it is to be noted that the speed of the data scan is unaffected by partition projection but the time to fetch partition info from metadata is eliminated. The key benefit is the automated partition management, removing the need to manually add partitions in a day to day Data lake operations.
How to use partition projections
For using partition projection, we need to specify the ranges of partition values and projection types for each partition column in the table properties in the AWS Glue Data Catalog or external Hive metastore. These custom properties on the table allow Athena to know what partition patterns to expect when it runs a query on the table. During query execution, Athena will use this information to project the partition values instead of retrieving them from the AWS Glue Data Catalog or external Hive metastore.
General Use Cases
- Queries that take a significant amount of time to run against highly partitioned tables.
- When we have highly partitioned data in Amazon S3 and we need to read only small parts of it. In this case, we can use partition projections.
Structures of Projectable Partition
Partition projection can be easily configured when your partitions follow an easily predictable pattern such as, the following:
Any continuous sequence of integers such as [1, 2, 3, 4, …, 1000] or [0500, 0550, 0600, …, 2500] etc.
Any continuous sequence of dates or datetimes such as
[20200201, 20200202, …, 20200831] or [1-1-2020 00:00:00, 1-1-2020 01:00:00, …, 12-31-2020 23:00:00].
- Enumerated values
A finite set of enumerated values such as airport codes or AWS Regions.
- AWS service logs
AWS service logs typically have a known structure whose partition scheme you can specify in AWS Glue and that Athena can therefore use for partition projection.
Important things to consider
- Partition projection eliminates the need to specify partitions manually in AWS Glue or an external Hive metastore.
- When you enable partition projection on a table, Athena ignores any partition metadata in the AWS Glue Data Catalog or external Hive metastore for that table.
- If a particular projected partition does not exist in Amazon S3, Athena will still project the partition. Athena will not throw an error, but no data is returned. However, If there are too many of the empty partitions, performance can be slower compared to traditional AWS Glue partitions. If more than half of your projected partitions are empty, it is recommended to use traditional partitions.
- Partition projection is usable only when the table is queried through Athena. If the same table is read through another service such as Amazon Redshift Spectrum or Amazon EMR, the standard partition metadata will be used.
- Views in Athena do not use projection configuration properties.
Enabling partition projection using the AWS Glue console
Setting up partition projection in a table’s properties can be done via AWS Glue. To configure and enable partition projection using the AWS Glue console :
- Sign in to the AWS Management Console and open the AWS Glue console at
- Choose the Tables tab.
- On the Tables tab, you can edit the already existing tables, or you can choose Add tables to create new ones in athena.
- In the list of tables, select the table that you want to edit. Click on “Edit table”
- In the Edit table details dialog box, in the Table properties section, for each partitioned column, add the following key-value pair:
- For Key, add projection.columnName.type.
- For Value, add one of the supported types: enum, integer, date, or injected.
- To enable partition projection we need to add a key-value pair in the table properties. For Key, enter projection.enabled, and for its Value, enter true. and we can disable partition projection at any time just by changing the value to false.
- When finished, choose Apply.
- In the Athena Query Editor, test query for the columns that you configured for the table.
Supported types for partition projections
- Injected Type
For partition columns whose values are members of an enumerated set (for example, airport codes or AWS Regions), we can use “enum”.
For partition columns whose possible values are integers within a defined range, we can use “integer”.
|projection.columnName.range||0,100 or -1,86755 or 001,999 etc.|
|projection.columnName.interval||1 or 5 or 10 etc.|
|projection.columnName.digits||1 or 5 etc.|
For partition columns whose values are dates within a defined range, we can use “date”. Projected date columns are generated in Coordinated Universal Time (UTC) at query execution time.
|projection.columnName.range||201701, 201812 or 01-01-2010, 12-31-2018 or NOW-3YEARS, NOW or 201801, NOW+1MONTH etc.|
|projection.columnName.format||yyyyMM or dd-MM-yyyy or dd-MM-yyyy-HH-mm-ss etc.|
|projection.columnName.interval||1 or 5 etc.|
|projection.columnName.interval.unit||YEARS or MONTHS or WEEKS or DAYS or HOURS or MINUTES or SECONDS or MILLISECONDS|
For partition columns with possible values that cannot be procedurally generated within some logical range but that are provided in a query’s WHERE clause as a single value, we can use Injected Type.
I hope it was helpful. Thank you!
This story is authored by Ramu D. He specializes in Cloud Services based development.