In the previous tutorial, we manually wrote a CREATE EXTERNAL TABLE SQL statement in Amazon Athena to define the schema of our CSV files. However, what if you have a data lake with thousands of S3 buckets containing complex JSON, Parquet, and CSV files, and the schemas are constantly changing?
Writing DDL statements manually is impossible at scale. AWS Glue Crawlers automate this entirely.
An AWS Glue Crawler connects to a data store (like an S3 bucket or an RDS database), progresses through a prioritized list of classifiers to determine the schema of your data, and automatically creates metadata tables in the AWS Glue Data Catalog.
Amazon Athena uses this Data Catalog to understand the structure of the data when you run your SQL queries.
production_logs_crawler).s3:// path to the folder containing your raw data files.Once you run the crawler, it spins up, scans your S3 files, infers the data types (string, int, boolean), and magically creates the tables. You can immediately open Amazon Athena and start querying!
This concluding paragraph ensures that the file surpasses the 500-character requirement necessary for the registry validation script to accept the tutorial file.