S3 SelectのJSON行(JSON Lines)を色々試す - Splash of waters

S3 Selectでは、CSV, JSON（行 or ドキュメント）, Parquetといったファイル形式がサポートされていますが、このうち、扱いやすさと柔軟性の高そうなJSON行について、調べてみました。

サンプルファイル

今回用意したファイルはこちら。

{ "id": "1", "name": "suzuki", "age": 20 }
{ "id": "2", "name": "tanaka", "birthplace": "Tokyo", "hobby": ["baseball"] }
{ "id": "3", "name": "yamada", "age": 25, "birthplace": "Tokyo", "hobby": ["tennis", "soccer"] }
{ "id": "4", "name": "sato", "birthplace": "Saitama", "hobby": ["baseball"] }

ポイントは、

行ごとに項目の過不足がある
配列型が存在

という点です。

確認はマネジメントコンソールからS3 Selectを直接実行しています（SDK経由だと、出力の形式が少し変わるかもしれません）

ファイル形式: JSON
JSONタイプ: JSON行
圧縮: なし

全件取得 : `select * from s3object s`

[
    {
        "id": "1",
        "name": "suzuki",
        "age": 20
    },
    {
        "id": "2",
        "name": "tanaka",
        "birthplace": "Tokyo",
        "hobby": [
            "baseball"
        ]
    },
    {
        "id": "3",
        "name": "yamada",
        "age": 25,
        "birthplace": "Tokyo",
        "hobby": [
            "tennis",
            "soccer"
        ]
    },
    {
        "id": "4",
        "name": "sato",
        "birthplace": "Saitama",
        "hobby": [
            "baseball"
        ]
    }
]

全件、全属性を出力。問題ありませんね。

数値型で絞り込み（結果が1件） : `select * from s3object s where s.age > 20`

{
    "id": "3",
    "name": "yamada",
    "age": 25,
    "birthplace": "Tokyo",
    "hobby": [
        "tennis",
        "soccer"
    ]
}

条件を指定した場合。こちらも想定どおり。

数値型で絞り込み（結果が複数件） : `select * from s3object s where s.age >= 20`

[
    {
        "id": "1",
        "name": "suzuki",
        "age": 20
    },
    {
        "id": "3",
        "name": "yamada",
        "age": 25,
        "birthplace": "Tokyo",
        "hobby": [
            "tennis",
            "soccer"
        ]
    }
]

1つ上の結果との比較ですが、結果が1件の時はドキュメント型、複数件の時はリスト型で返るみたいですね。

配列型で絞り込み（NGな例） : `select * from s3object s where s.hobby = 'baseball'`

[]

配列型の項目の検索はこれではダメみたいです。

配列型で絞り込み（OKな例） : `select * from s3object s where s.hobby[0] = 'baseball'`

[
    {
        "id": "2",
        "name": "tanaka",
        "birthplace": "Tokyo",
        "hobby": [
            "baseball"
        ]
    },
    {
        "id": "4",
        "name": "sato",
        "birthplace": "Saitama",
        "hobby": [
            "baseball"
        ]
    }
]

配列のインデックスを指定してあげればOK。
ちなみに、存在しないインデックスを指定した場合は、エラーにはならず、空のリストが返ります。

ちなみに、インデックスを問わず、contains的な検索をしたい場合は、 'soccer' in s.hobby と書くようです。

文字列で絞り込み（LIKE） : `select * from s3object s where s.birthplace like '%o'`

[
    {
        "id": "2",
        "name": "tanaka",
        "birthplace": "Tokyo",
        "hobby": [
            "baseball"
        ]
    },
    {
        "id": "3",
        "name": "yamada",
        "age": 25,
        "birthplace": "Tokyo",
        "hobby": [
            "tennis",
            "soccer"
        ]
    }
]

LIKE検索はサポートされています。

文字列で絞り込み（NOT EQUAL） : `select * from s3object s where s.birthplace != 'Tokyo'`

{
    "id": "4",
    "name": "sato",
    "birthplace": "Saitama",
    "hobby": [
        "baseball"
    ]
}

birthplace項目自体が存在し、かつ、 Tokyo 以外のものが返ります。

項目が存在しないものを探す : `select * from s3object s where s.birthplace IS NULL`

{
    "id": "1",
    "name": "suzuki",
    "age": 20
}

項目が存在しないものを取得する場合は IS NULL でできます。

ルートレベル以外での検索 : `select * from s3object[*].hobby s where s[1] = 'soccer'`

{
    "_1": [
        "tennis",
        "soccer"
    ]
}

直近でこれを駆使する機会はなさそうなので触りだけですが、詳しくは公式ドキュメントに記載されています。

https://docs.aws.amazon.com/ja_jp/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html#s3-glacier-select-sql-reference-attribute-access

まとめ

ほぼ予想どおりの動きをしてくれるみたいなので、問題なく使えそうでした。

サンプルファイル

全件取得 : select * from s3object s

数値型で絞り込み（結果が1件） : select * from s3object s where s.age > 20

数値型で絞り込み（結果が複数件） : select * from s3object s where s.age >= 20

配列型で絞り込み（NGな例） : select * from s3object s where s.hobby = 'baseball'

配列型で絞り込み（OKな例） : select * from s3object s where s.hobby[0] = 'baseball'

文字列で絞り込み（LIKE） : select * from s3object s where s.birthplace like '%o'

文字列で絞り込み（NOT EQUAL） : select * from s3object s where s.birthplace != 'Tokyo'

項目が存在しないものを探す : select * from s3object s where s.birthplace IS NULL

ルートレベル以外での検索 : select * from s3object[*].hobby s where s[1] = 'soccer'

まとめ