Extend Cloud Storage filtering with BigQuery Object Table
Cloud Storage is very popular and one of the first services that you use when you move to the cloud or when you want to import/export data to/from Google Cloud.
It’s a convenient, serverless, limitless scalable object storage service.
Cloud Storage Limitations
Serverless services always come with limitation because of opinionated choice made by the development team.
The most popular “limitation” (or misunderstood feature) is the folder management. Because folders do not simply exist!
It’s a common misunderstanding by many newcomers. It’s a blob storage system with a flat file system (named Object and not File), and not a block storage system with a hierarchical file system (like your own computer). More explanation here.
Another annoying limitation is the filtering capacity. You can only filter on the object’s path prefix. No wildcard path filtering, no date range, no extension/content-type selection,…
It’s really painful when you want to use the CLI to change the storage class of objects older than 90 days, or to select only the objects with a typical extension.
How to extend this Cloud Storage filtering limitation?
Filter your files as simply as a SQL query
The filtering by prefix comes from the underlying solution used by Cloud Storage. All the stored objects are indexed in a database, and BigTable is used for that.
If you are familiar with BigTable, the primary definition is very important because you can only query on it, and only on the prefix of it.
Exactly as you do with Cloud Storage.
To minimize the object filtering limitation, the answer comes from another very popular service on Google Cloud: BigQuery.
Yes, there isn’t an immediate connection between Cloud Storage and BigQuery, but a brand new feature named “Object Table” allows listing the Cloud Storage object’s metadata directly in BigQuery.
And because you have the Cloud Storage object metadata accessible by BigQuery, your file (object) selection becomes as easy as a SQL query!
Let’s make this BigQuery connection!
BigQuery Object Table
An Object Table is a table with the capacity to read a Cloud Storage Bucket. To achieve that, 3 mains operations
- Create a BigQuery Connection
- Grant the BigQuery Connection the permission to read the target bucket(s)
- Create the BigQuery table based on the BigQuery connection and the bucket(s) to read
BigQuery Connection creation
BigLake feature offers to create a connection between BigQuery and a lake, like Cloud Storage.
By command line, you can do that
bq mk --connection --location=us \
--connection_type=CLOUD_RESOURCE medium_conn
Take care of the location. It must be the same as your Cloud Storage bucket location. Use US
or EU
if you have multi-region buckets in that part of the world.
Permission on BigQuery Connection
It’s the most tricky part. When you create a BigQuery connection, a service account is automatically provisioned. Therefore, we have to get this service account to grant the role roles/storage.objectViewer
on it to read the bucket metadata.
You can get the service account like that
bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId
jq
simplify the JSON parsing and value extraction!
To grant the service account on the Cloud Storage bucket, you can combine the role grant and the service account extraction, like that
#Ideally for least previlege enforcement. But seems not working for now.
gcloud storage buckets add-iam-policy-binding \
--member=serviceAccount:$(bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId) \
--role=roles/storage.objectViewer <YOUR BUCKET>
#Grant at project level instead
gcloud projects add-iam-policy-binding \
--member=serviceAccount:$(bq show --format json \
--connection us.medium_conn | jq -r .cloudResource.serviceAccountId) \
--role=roles/storage.objectViewer <PROJECT_ID>
Of course, you can grant the service account at the project/folder/organization level if required. Project level could be a good option if your table read in several buckets (tradeoff between gcloud
CLI repetition and least privilege principle)
Note: you will have to wait 2–3 minutes between the permission grant and the propagation. If you use it too quickly, you will have permission issue (even if the BigQuery error message mentionned SLA and outage on BigQuery!)
Object Table Creation
Finally, you can create your BigQuery external table based on your connection and the bucket(s) to read
#First create the dataset if not exist
bq mk --dataset --location=us medium_dataset
#Create the table
bq mk --table \
--external_table_definition=gs://<YOUR BUCKET>/*@us.medium_conn \
--object_metadata=SIMPLE \
medium_dataset.my-object-table
I used the bq
command line to create the table instead of the DDL to be able to script this code and add it to bash script if required!
Extend command line capability
With your Object Table, you can now query the objects through different dimensions for a powerful search
SELECT uri FROM `medium_dataset.my-object-table`
WHERE size > 100000
ORDER BY updated
But a SQL query is not really useful when you want to perform a CLI operation (manual or automated) on the objects.
One of great part of the Object Table feature is the capacity to query them with bq
CLI and to link the result to perform operations on the objects.
Here, some examples who use the same pattern:jq
is the JSON tool used to simplify the formatting and data selection. If you don’t have it, install it!
- Run the SQL query with
bq
CLI, and request the result in JSON format - Use
jq
to extract the object’suri
- For each object, run the
gcloud storage
command, thanks toxargs
Remove objects with PDF extension
Here, we have to select all the objects with the .pdf
extension. You could also select the objects with the application/pdf
content-type.
The query is the following
SELECT uri FROM `medium_dataset.my-object-table`
WHERE uri LIKE "%.pdf"
Combine that with the bq
CLI, jq
, xargs
and gcloud storage
bq query --format=json --use_legacy_sql=false \
'SELECT uri FROM `medium_dataset.my-object-table`
WHERE uri LIKE "%.pdf"' \
| jq .[].uri | xargs gcloud storage rm
Change storage class of objects older than 90 days
This time, we have to select all the objects with the creation date beyond 90 days from now.
The query is the following
SELECT uri FROM `medium_dataset.my-object-table`
WHERE updated < DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)
Combine that with the bq
CLI, jq
, xargs
and gcloud storage
bq query --format=json --use_legacy_sql=false \
'SELECT uri FROM `medium_dataset.my-object-table`
WHERE updated < DATE_SUB(CURRENT_TIMESTAMP(), INTERVAL 90 DAY)' \
| jq .[].uri | xargs \
gcloud storage objects update --storage-class=nearline
Limitless combination
Cloud Storage is a limitless storage service.
BigQuery is a limitless data warehouse query engine.
The combination of both unlocks new features and creates an advanced object query solution!
With the assembly of bash pipes to link the CLI commands, you can achieve powerful and limitless queries of Cloud Storage to solve all your usual issues.
Use it limitless!