Skip to main content

Google Cloud Storage (GCS)

Overview

This destination writes data to GCS bucket.

The Airbyte GCS destination allows you to sync data to cloud storage buckets. Each stream is written to its own directory under the bucket.

Getting started

Requirements

  1. Allow connections from Airbyte server to your GCS cluster (if they exist in separate VPCs).
  2. An GCP bucket with credentials (for the COPY strategy).

Setup guide

  • Fill up GCS info
    • GCS Bucket Name
      • See this for instructions on how to create a GCS bucket. The bucket cannot have a retention policy. Set Protection Tools to none or Object versioning.
    • GCS Bucket Region
    • HMAC Key Access ID
      • See this on how to generate an access key. For more information on hmac keys please reference the GCP docs
      • We recommend creating an Airbyte-specific user or service account. This user or account will require the following permissions for the bucket:
        storage.multipartUploads.abort
        storage.multipartUploads.create
        storage.objects.create
        storage.objects.delete
        storage.objects.get
        storage.objects.list
        You can set those by going to the permissions tab in the GCS bucket and adding the appropriate the email address of the service account or user and adding the aforementioned permissions.
    • Secret Access Key
      • Corresponding key to the above access ID.
  • Make sure your GCS bucket is accessible from the machine running Airbyte. This depends on your networking setup. The easiest way to verify if Airbyte is able to connect to your GCS bucket is via the check connection tool in the UI.

Sync mode support

Features

FeatureSupportNotes
Full Refresh SyncWarning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append SyncWarning: Airbyte provides at-least-once delivery. Depending on your source, you may see duplicated data. Learn more here
Incremental - Append + Deduped
NamespacesSetting a specific bucket path is equivalent to having separate namespaces.

Configuration

ParameterTypeNotes
GCS Bucket NamestringName of the bucket to sync data into.
GCS Bucket PathstringSubdirectory under the above bucket to sync the data into.
GCS RegionstringSee here for all region codes.
HMAC Key Access IDstringHMAC key access ID . The access ID for the GCS bucket. When linked to a service account, this ID is 61 characters long; when linked to a user account, it is 24 characters long. See HMAC key for details.
HMAC Key SecretstringThe corresponding secret for the access ID. It is a 40-character base-64 encoded string.
FormatobjectFormat specific configuration. See below for details.
Part SizeintegerArg to configure a block size. Max allowed blocks by GCS = 10,000, i.e. max stream size = blockSize * 10,000 blocks.

Currently, only the HMAC key is supported. More credential types will be added in the future, please submit an issue with your request.

Additionally, your bucket must be encrypted using a Google-managed encryption key (this is the default setting when creating a new bucket). We currently do not support buckets using customer-managed encryption keys (CMEK). You can view this setting under the "Configuration" tab of your GCS bucket, in the Encryption type row.

⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️

The full path of the output data is:

<bucket-name>/<sorce-namespace-if-exists>/<stream-name>/<upload-date>-<upload-mills>-<partition-id>.<format-extension>

For example:

testing_bucket/data_output_path/public/users/2021_01_01_1609541171643_0.csv.gz
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | | format extension
| | | | | | partition id
| | | | | upload time in millis
| | | | upload date in YYYY-MM-DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name

Please note that the stream name may contain a prefix, if it is configured on the connection.

The rationales behind this naming pattern are: 1. Each stream has its own directory. 2. The data output files can be sorted by upload time. 3. The upload time composes of a date part and millis part so that it is both readable and unique.

A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .

Output Schema

Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.

  • Under Full Refresh Sync mode, old output files will be purged before new files are created.
  • Under Incremental - Append Sync mode, new output files will be added that only contain the new data.

Avro

Apache Avro serializes data in a compact binary format. Currently, the Airbyte S3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.

Configuration

Here is the available compression codecs:

  • No compression
  • deflate
    • Compression level
      • Range [0, 9]. Default to 0.
      • Level 0: no compression & fastest.
      • Level 9: best compression & slowest.
  • bzip2
  • xz
    • Compression level
      • Range [0, 9]. Default to 6.
      • Level 0-3 are fast with medium compression.
      • Level 4-6 are fairly slow with high compression.
      • Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
  • zstandard
    • Compression level
      • Range [-5, 22]. Default to 3.
      • Negative levels are 'fast' modes akin to lz4 or snappy.
      • Levels above 9 are generally for archival purposes.
      • Levels above 18 use a lot of memory.
    • Include checksum
      • If set to true, a checksum will be included in each data block.
  • snappy

Data schema

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

CSV

Like most of the other Airbyte destination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.

ColumnConditionDescription
_airbyte_ab_idAlways existsA uuid assigned by Airbyte to each processed record.
_airbyte_emitted_atAlways exists.A timestamp representing when the event was pulled from the data source.
_airbyte_dataWhen no normalization (flattening) is needed, all data reside under this column as a json blob.
root level fieldsWhen root level normalization (flattening) is selected, the root level fields are expanded.

For example, given the following json object from a source:

{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
}

With no normalization, the output CSV is:

_airbyte_ab_id_airbyte_emitted_at_airbyte_data
26d73cde-7eb1-4e1e-b7db-a4c03b4cf2061622135805000{ "user_id": 123, name: { "first": "John", "last": "Doe" } }

With root level normalization, the output CSV is:

_airbyte_ab_id_airbyte_emitted_atuser_idname
26d73cde-7eb1-4e1e-b7db-a4c03b4cf2061622135805000123{ "first": "John", "last": "Doe" }

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .csv.gz).

JSON Lines (JSONL)

Json Lines is a text format with one JSON per line. Each line has a structure as follows:

{
"_airbyte_ab_id": "<uuid>",
"_airbyte_emitted_at": "<timestamp-in-millis>",
"_airbyte_data": "<json-data-from-source>"
}

For example, given the following two json objects from a source:

[
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
},
{
"user_id": 456,
"name": {
"first": "Jane",
"last": "Roe"
}
}
]

They will be like this in the output file:

{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }

Output files can be compressed. The default option is GZIP compression. If compression is selected, the output filename will have an extra extension (GZIP: .jsonl.gz).

Parquet

Configuration

The following configuration is available to configure the Parquet output:

ParameterTypeDefaultDescription
compression_codecenumUNCOMPRESSEDCompression algorithm. Available candidates are: UNCOMPRESSED, SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD.
block_size_mbinteger128 (MB)Block size (row group size) in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing.
max_padding_size_mbinteger8 (MB)Max padding size in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group.
page_size_kbinteger1024 (KB)Page size in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate.
dictionary_page_size_kbinteger1024 (KB)Dictionary Page Size in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary.
dictionary_encodingbooleantrueDictionary encoding. This parameter controls whether dictionary encoding is turned on.

These parameters are related to the ParquetOutputFormat. See the Java doc for more details. Also see Parquet documentation for their recommended configurations (512 - 1024 MB block size, 8 KB page size).

Data schema

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

Changelog

Expand to review
VersionDatePull RequestSubject
0.4.62024-02-1535285Adopt CDK 0.20.8
0.4.52024-02-0834745Adopt CDK 0.19.0
0.4.42023-07-14#28345Increment patch to trigger a rebuild
0.4.32023-07-05#27936Internal code update
0.4.22023-06-30#27891Internal code update
0.4.12023-06-28#27268Internal code update
0.4.02023-06-26#27725License Update: Elv2
0.3.02023-04-28#25570Fix: all integer schemas should be converted to Avro longs
0.2.172023-04-27#25346Internal code cleanup
0.2.162023-03-17#23788S3-Parquet: added handler to process null values in arrays
0.2.152023-03-10#23466Changed S3 Avro type from Int to Long
0.2.142023-11-23#21682Add support for buckets with Customer-Managed Encryption Key
0.2.132023-01-18#21087Wrap Authentication Errors as Config Exceptions
0.2.122022-10-18#17901Fix logging to GCS
0.2.112022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
0.2.102022-08-05#14801Fix multiple log bindings
0.2.92022-06-24#14114Remove "additionalProperties": false from specs for connectors with staging
0.2.82022-06-17#13753Deprecate and remove PART_SIZE_MB fields from connectors based on StreamTransferManager
0.2.72022-06-14#13483Added support for int, long, float data types to Avro/Parquet formats.
0.2.62022-05-1712820Improved 'check' operation performance
0.2.52022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.2.42022-04-22#12167Add gzip compression option for CSV and JSONL formats.
0.2.32022-04-22#11795Fix the connection check to verify the provided bucket path.
0.2.22022-04-05#11728Properly clean-up bucket when running OVERWRITE sync mode
0.2.12022-04-05#11499Updated spec and documentation.
0.2.02022-04-04#11686Use serialized buffering strategy to reduce memory consumption; compress CSV and JSONL formats.
0.1.222022-02-12#10256Add JVM flag to exist on OOME.
0.1.212022-02-12#10299Fix connection check to require only the necessary permissions.
0.1.202022-01-11#9367Avro & Parquet: support array field with unknown item type; default any improperly typed field to string.
0.1.192022-01-10#9121Fixed check method for GCS mode to verify if all roles assigned to user
0.1.182021-12-30#8809Update connector fields title/description
0.1.172021-12-21#8574Added namespace to Avro and Parquet record types
0.1.162021-12-20#8974Release a new version to ensure there is no excessive logging.
0.1.152021-12-03#8386Add new GCP regions
0.1.142021-12-01#7732Support timestamp in Avro and Parquet
0.1.132021-11-03#7288Support Json additionalProperties.
0.1.22021-09-12#5720Added configurable block size for stream. Each stream is limited to 10,000 by GCS
0.1.12021-08-26#5296Added storing gcsCsvFileLocation property for CSV format. This is used by destination-bigquery (GCS Staging upload type)
0.1.02021-07-16#4329Initial release.