This is a note of downloading GitHub Event data, deploying a ClickHouse Server container, and storing data into the DBMS.
Reference
- https://ghe.clickhouse.tech/#how-this-dataset-is-created
- https://ghe.clickhouse.tech/#download-the-dataset
- https://hub.docker.com/r/clickhouse/clickhouse-server/
ClickHouse Server
It is quite easy to deploy a docker container using docker-compose file. The only things that need to be noted are the port exposed and the drives mounted.
version: '3' services: ch_gh_raw: image: clickhouse/clickhouse-server container_name: ch_gh_raw ports: - "19000:9000" - "18123:8123" volumes: - /media/gh_event_data/clickhouse:/var/lib/clickhouse/ ulimits: nofile: soft: 262144 hard: 262144
Note that the overall size of GitHub event files from 2011 to 2023 August is around 2.7TB.
ClickHouse Schema
Suppose the clickhouse service is ready.
The following lines create the table and specifies datatypes for each record.
CREATE TABLE github_events ( file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5) ) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at);
Inserting records
The original instruction from ClickHouse’s example is:
find . -name '*.json.gz' | xargs -P$(nproc) -I{} bash -c " gzip -cd {} | jq -c ' [ (\"{}\" | scan(\"[0-9]+-[0-9]+-[0-9]+-[0-9]+\")), .type, .actor.login? // .actor_attributes.login? // (.actor | strings) // null, .repo.name? // (.repository.owner? + \"/\" + .repository.name?) // null, .created_at, .payload.updated_at? // .payload.comment?.updated_at? // .payload.issue?.updated_at? // .payload.pull_request?.updated_at? // null, .payload.action, .payload.comment.id, .payload.review.body // .payload.comment.body // .payload.issue.body? // .payload.pull_request.body? // .payload.release.body? // null, .payload.comment?.path? // null, .payload.comment?.position? // null, .payload.comment?.line? // null, .payload.ref? // null, .payload.ref_type? // null, .payload.comment.user?.login? // .payload.issue.user?.login? // .payload.pull_request.user?.login? // null, .payload.issue.number? // .payload.pull_request.number? // .payload.number? // null, .payload.issue.title? // .payload.pull_request.title? // null, [.payload.issue.labels?[]?.name // .payload.pull_request.labels?[]?.name], .payload.issue.state? // .payload.pull_request.state? // null, .payload.issue.locked? // .payload.pull_request.locked? // null, .payload.issue.assignee?.login? // .payload.pull_request.assignee?.login? // null, [.payload.issue.assignees?[]?.login? // .payload.pull_request.assignees?[]?.login?], .payload.issue.comments? // .payload.pull_request.comments? // null, .payload.review.author_association // .payload.issue.author_association? // .payload.pull_request.author_association? // null, .payload.issue.closed_at? // .payload.pull_request.closed_at? // null, .payload.pull_request.merged_at? // null, .payload.pull_request.merge_commit_sha? // null, [.payload.pull_request.requested_reviewers?[]?.login], [.payload.pull_request.requested_teams?[]?.name], .payload.pull_request.head?.ref? // null, .payload.pull_request.head?.sha? // null, .payload.pull_request.base?.ref? // null, .payload.pull_request.base?.sha? // null, .payload.pull_request.merged? // null, .payload.pull_request.mergeable? // null, .payload.pull_request.rebaseable? // null, .payload.pull_request.mergeable_state? // null, .payload.pull_request.merged_by?.login? // null, .payload.pull_request.review_comments? // null, .payload.pull_request.maintainer_can_modify? // null, .payload.pull_request.commits? // null, .payload.pull_request.additions? // null, .payload.pull_request.deletions? // null, .payload.pull_request.changed_files? // null, .payload.comment.diff_hunk? // null, .payload.comment.original_position? // null, .payload.comment.commit_id? // null, .payload.comment.original_commit_id? // null, .payload.size? // null, .payload.distinct_size? // null, .payload.member.login? // .payload.member? // null, .payload.release?.tag_name? // null, .payload.release?.name? // null, .payload.review?.state? // null ]' | clickhouse-client --port 19000 --input_format_null_as_default 1 --date_time_input_format best_effort --query 'INSERT INTO github_events FORMAT JSONCompactEachRow' || echo 'File {} has issues' "
We can modify the command a bit to show a progress bar so we won’t be confused by the silence.
- Finds all files in the current directory and its subdirectories with the extension .json.gz.
- Decompresses each file and processes its content with jq, a lightweight and flexible command-line JSON processor. The jq script extracts various pieces of information from each JSON object and formats them into an array.
- Inserts the processed data into a ClickHouse database table called github_events using the clickhouse-client.
- If any step fails, it prints a message indicating that the file has issues.
- The xargs command is used to process the files in parallel, with the -P option specifying the maximum number of parallel processes to run at a time. The $(nproc) command returns the number of available processors.
find . -name '*.json.gz' | pv -l -s $(find . -name '*.json.gz' | wc -l) | xargs -P$(nproc) -I{} bash -c " gzip -cd {} | jq -c ' [ (\"{}\" | scan(\"[0-9]+-[0-9]+-[0-9]+-[0-9]+\")), .type, .actor.login? // .actor_attributes.login? // (.actor | strings) // null, .repo.name? // (.repository.owner? + \"/\" + .repository.name?) // null, .created_at, .payload.updated_at? // .payload.comment?.updated_at? // .payload.issue?.updated_at? // .payload.pull_request?.updated_at? // null, .payload.action, .payload.comment.id, .payload.review.body // .payload.comment.body // .payload.issue.body? // .payload.pull_request.body? // .payload.release.body? // null, .payload.comment?.path? // null, .payload.comment?.position? // null, .payload.comment?.line? // null, .payload.ref? // null, .payload.ref_type? // null, .payload.comment.user?.login? // .payload.issue.user?.login? // .payload.pull_request.user?.login? // null, .payload.issue.number? // .payload.pull_request.number? // .payload.number? // null, .payload.issue.title? // .payload.pull_request.title? // null, [.payload.issue.labels?[]?.name // .payload.pull_request.labels?[]?.name], .payload.issue.state? // .payload.pull_request.state? // null, .payload.issue.locked? // .payload.pull_request.locked? // null, .payload.issue.assignee?.login? // .payload.pull_request.assignee?.login? // null, [.payload.issue.assignees?[]?.login? // .payload.pull_request.assignees?[]?.login?], .payload.issue.comments? // .payload.pull_request.comments? // null, .payload.review.author_association // .payload.issue.author_association? // .payload.pull_request.author_association? // null, .payload.issue.closed_at? // .payload.pull_request.closed_at? // null, .payload.pull_request.merged_at? // null, .payload.pull_request.merge_commit_sha? // null, [.payload.pull_request.requested_reviewers?[]?.login], [.payload.pull_request.requested_teams?[]?.name], .payload.pull_request.head?.ref? // null, .payload.pull_request.head?.sha? // null, .payload.pull_request.base?.ref? // null, .payload.pull_request.base?.sha? // null, .payload.pull_request.merged? // null, .payload.pull_request.mergeable? // null, .payload.pull_request.rebaseable? // null, .payload.pull_request.mergeable_state? // null, .payload.pull_request.merged_by?.login? // null, .payload.pull_request.review_comments? // null, .payload.pull_request.maintainer_can_modify? // null, .payload.pull_request.commits? // null, .payload.pull_request.additions? // null, .payload.pull_request.deletions? // null, .payload.pull_request.changed_files? // null, .payload.comment.diff_hunk? // null, .payload.comment.original_position? // null, .payload.comment.commit_id? // null, .payload.comment.original_commit_id? // null, .payload.size? // null, .payload.distinct_size? // null, .payload.member.login? // .payload.member? // null, .payload.release?.tag_name? // null, .payload.release?.name? // null, .payload.review?.state? // null ]' | clickhouse-client --port 19000 --input_format_null_as_default 1 --date_time_input_format best_effort --query 'INSERT INTO github_events FORMAT JSONCompactEachRow' || echo 'File {} has issues' "