Google BigQuery Source

Google Cloud’s BigQuery is a fully managed enterprise data warehouse that helps you to manage and analyze your data, which also provides built-in features such as ML, geospatial analysis, and business intelligence. The Google BigQuery integration gets data from a Google BigQuery table via a provided query.
Data Source
The Google BigQuery Integration fetches results of a query using BigQuery API.
Setup and Configuration
Follow the below steps to get the Service Account's Credential JSON file to run BigQuery jobs:
- Open IAM & Admin under Google Cloud Console.
- Select the Service Account tab.
- From the project dropdown button, select the project where you will run the BigQuery jobs.
- Click on Create a Service Account and follow the instructions in Create service accounts google cloud docs.
- Click on the email address provisioned during the creation and then click the KEYS tab.
- Click ADD KEY and choose Create new key.
- Select key type as JSON.
- Click Create. A JSON key file is downloaded to your computer.
States
Google BigQuery integration Source is a fully managed enterprise data warehouse that helps you to manage and analyze your data. When you create an Google BigQuery Source, it goes through the following stages:
- Pending. Once the Source is submitted, it is validated, stored, and placed in a Pending state.
- Started. A collection task is created on the Hosted Collector.
- Initialized. The task configuration is complete in Sumo Logic.
- Authenticated. The Source successfully authenticated with Google BigQuery.
- Collecting. The Source is actively collecting data from Google BigQuery.
If the Source has any issues during any one of these states, it is placed in an Error state.
When you delete the Source, it is placed in a Stopping state. When it has successfully stopped, it is deleted from your Hosted Collector. On the Collection page, the Health and Status for Sources is displayed. You can click the text in the Health column, such as Error, to open the issue in Health Events to investigate.
Create Google BigQuery Source
When you create an Google BigQuery Source, you add it to a Hosted Collector. Before creating the Source, identify the Hosted Collector you want to use or create a new Hosted Collector. For instructions, see Configure a Hosted Collector.
Before setting up the integration, test out the query with the checkpointing logic and a specific checkpoint value in the Google BigQuery console.
To configure an Google BigQuery Source:
- In Sumo Logic, select Manage Data > Collection > Collection.
- On the Collection page, click Add Source next to a Hosted Collector.
- Search for and select Google BigQuery.
- Enter a Name for the Source. The description is optional.
- (Optional) For Source Category, enter any string to tag the output collected from the Source. Category metadata is stored in a searchable field called
_sourceCategory
. - (Optional) Fields. Click the +Add button to define the fields you want to associate. Each field needs a name (key) and value.
A green circle with a check mark is shown when the field exists in the Fields table schema.
An orange triangle with an exclamation point is shown when the field doesn't exist in the Fields table schema. In this case, an option to automatically add the nonexistent fields to the Fields table schema is provided. If a field is sent to Sumo Logic that does not exist in the Fields schema it is ignored, known as dropped.
- Project ID. Enter the unique identifier number for your BigQuery project. You can find this from the Google Cloud Console.
- Checkpoint Field. Enter the name of the field in the query result to be used for checkpointing. This field has to be increasing and of type number or timestamp.
- Checkpoint Start. Enter the first value for the checkpoint that the integration will plug into the query.
- (Optional) Time Field. Enter the name of the field in the query result to be parsed as timestamp. If not provided, the current time will be used.
- Query. Enter the query that you need to run. You must include the phrase
%CHECKPOINT%
and sort the checkpoint field. - (Optional) Query Interval. Enter the time interval to run the query in the format:
Xm
(for X minutes) orXh
(for X hours). - Google BigQuery Credential. Upload the Credential JSON file downloaded from Google Cloud IAM & Admin.
- (Optional) Processing Rules for Logs. Configure any desired filters, such as allowlist, denylist, hash, or mask, as described in Create a Processing Rule.
- When you are finished configuring the Source, click Save.
Sample values for Query, Checkpoint, and Checkpoint Start fields
Each query must contain a phrase %CHECKPOINT%
. Integration will extract and save the current checkpoint and use it in place of this phrase. The value of Checkpoint Start must be the same type as the Checkpoint Field.
Quote the phrase as "%CHECKPOINT%" if the Checkpoint Field is a timestamp string.
Following are some examples that demonstrate what values to use for the Query, Checkpoint, Time Field, and Checkpoint Start fields.
Example 1: Checkpoint Field is timestamp.
You can see double quotes for the timestamp as it is a string.
Select * from MyProject.MyDataSet.MyTable where timestamp > "%CHECKPOINT%"
Field | Value |
---|---|
Checkpoint Field | timestamp |
Checkpoint Start | 2022-02-02 11:00:00.000+0700 |
Time Field | timestamp |
Specific example on a public dataset:
SELECT base_url,source_url,collection_category,collection_number,timestamp(sensing_time) as sensing_time FROM bigquery-public-data.cloud_storage_geo_index.landsat_index where sensing_time > '%CHECKPOINT%' order by sensing_time asc LIMIT 100
Field | Value |
---|---|
Checkpoint Field | sensing_time |
Checkpoint Start | 2022-02-02 11:00:00.000+0700 |
Time Field | sensing_time |
Example 2: Checkpoint Field is a numeric field.
SELECT trip_id,subscriber_type,start_time,duration_minutes FROM bigquery-public-data.austin_bikeshare.bikeshare_trips where trip_id > %CHECKPOINT% order by start_time asc LIMIT 100
Field | Value |
---|---|
Checkpoint Field | trip_id |
Checkpoint Start | 0 |
Time Field | start_time |
Example 3: Query Gmail Logs
In the example below, you'll need to replace MyProject
and MyDataSet
with values matching your environment.
SELECT gmail.message_info,gmail.event_info,gmail.event_info.timestamp_usec AS TIMESTAMP FROM `MyProject.MyDataSet.activity` WHERE gmail.event_info.timestamp_usec > %CHECKPOINT% order by TIMESTAMP LIMIT 30000
Field | Value |
---|---|
Checkpoint Field | TIMESTAMP |
Checkpoint Start | 1683053865563258 |
Time Field | TIMESTAMP |
Note that the value of Checkpoint Start
above is an epoch MICRO seconds timestamp (16 digits) for May 2, 2023 06:57:45.563258 PM GMT
and the query also sorts by the checkpoint field (TIMESTAMP
).
When setting up this source for Gmail logs for the first time and collecting historical Gmail logs, it is important to set the Checkpoint Start
in epoch microseconds (16 digits), and sort the checkpoint field explicitly in your query. Also note that it might take a long time for the source (and many BigQuery queries to execute) to backfill if the starting point is set far in the past - depending on your Gmail logs volume.
Error Types
When Sumo Logic detects an issue, it is tracked by Health Events. The following table shows the three possible error types, the reason for the error, if the source attempts to retry, and the name of the event log in the Health Event Index.
Type | Reason | Retries | Retry Behavior | Health Event Name |
---|---|---|---|---|
ThirdPartyConfig | Normally due to an invalid configuration. You'll need to review your Source configuration and make an update. | No retries are attempted until the Source is updated. | Not applicable | ThirdPartyConfigError |
ThirdPartyGeneric | Normally due to an error communicating with the third-party service APIs. | Yes | The Source will retry indefinitely. | ThirdPartyGenericError |
FirstPartyGeneric | Normally due to an error communicating with the internal Sumo Logic APIs. | Yes | The Source will retry indefinitely. | FirstPartyGenericError |
Restarting your Source
If your Source encounters ThirdPartyConfig errors, you can restart it from either the Sumo Logic UI or Sumo Logic API.
UI
To restart your source in the Sumo Logic platform, follow the steps below:
- Open the Collection page, and go to Manage Data > Collection > Collection.
- Select the source and click the information icon on the right side of the row.
- The API usage information popup is displayed. Click the Restart Source button on the bottom left.
- Click Confirm to send the restart request.
- The bottom left of the platform will provide a notification informing you the request was successful.
API
To restart your source using the Sumo Management API, follow the instructions below:
- Method:
POST
- Example endpoint:
https://api.sumologic.com/api/v1/collectors/{collector_id}/sources/{source_id}/action/restart
Sumo Logic endpoints like api.sumologic.com
are different in deployments outside us1
. For example, an API endpoint in Europe would begin api.eu.sumologic.com
. A service endpoint in us2
(Western U.S.) would begin service.us2.sumologic.com
. For more information, see Sumo Logic Endpoints.
JSON Configuration
Sources can be configured using UTF-8 encoded JSON files with the Collector Management API. See how to use JSON to configure Sources for details.
Parameter | Type | Required | Description | Access |
---|---|---|---|---|
config | JSON Object | Yes | Contains the configuration-parameters of the Source. | na |
schemaRef | JSON Object | Yes | Use {"type":"Google BigQuery"} for Google BigQuery Source. | not modifiable |
sourceType | String | Yes | Use Universal for Google BigQuery. | not modifiable |
Config Parameters
Parameter | Type | Required | Description | Access |
---|---|---|---|---|
name | String | Yes | Type the desired name of the Source and it must be unique per Collector. This value is assigned to the metadata field _source . | modifiable |
description | String | No | Type the description of the Source. | modifiable |
category | String | No | Type the category of the source. This value is assigned to the metadata field _sourceCategory . | modifiable |
fields | JSON Object | No | JSON map of key-value fields (metadata) to apply to the Collector or Source. Use the boolean field _siemForward to enable forwarding to SIEM. | modifiable |
projectId | String | Yes | The project ID is the globally unique identifier for your project. For example, pelagic-quanta-364805 . | modifiable |
credentialsJson | String | Yes | This field contains the credential JSON of the Service Account used for accessing BigQuery service. | modifiable |
Query | String | Yes | The query to be used in BigQuery. The special string %CHECKPOINT% will be replaced with the largest value seen in the checkpoint field. | modifiable |
timeField | String | No | The name of the column to be used to extract timestamp. If not specified, the C2C will use the current time for each row or record we collect. The TIMESTAMP data type is recommended, but any number type will be converted into a epoch milliseconds or epoch microseconds. | modifiable |
checkpointField | String | Yes | The column whose largest value will be used as the %CHECKPOINT% in the next search. The checkpoint field has to be of type number of timestamp. | modifiable |
checkpointStart | String | Yes | The very first value of the checkpoint to be used in the query. | modifiable |
JSON Example
{
"api.version":"v1",
"source":{
"schemaRef":{
"type":"Google BigQuery"
},
"config":{
"name":"MyBigQuerySource",
"checkpointField":"timestamp_usec",
"timeField":"timestamp_usec",
"checkpointStart":"0",
"query":"select message_info,event_info,event_info.timestamp_usec as timestamp_usec from `bigquery-dev-382704.BigQueryTest.GmailTest` where event_info.timestamp_usec > %CHECKPOINT% LIMIT 2",
"projectId":"********",
"fields":{
"_siemForward":false
},
"pollingInterval":"2m",
"credentialsJson":"********"
},
"state":{
"state":"Collecting"
},
"sourceType":"Universal"
}
}