ClickHouse/src/Formats/ProtobufSchemas.h

#pragma once

#include "config.h"
#if USE_PROTOBUF

#include <memory>
#include <mutex>
#include <unordered_map>
#include <base/types.h>
#include <boost/noncopyable.hpp>


namespace google
{
namespace protobuf
{
    class Descriptor;
}
}

namespace DB
{
class FormatSchemaInfo;

/** Keeps parsed google protobuf schemas parsed from files.
  * This class is used to handle the "Protobuf" input/output formats.
  */
class ProtobufSchemas : private boost::noncopyable
{
public:
    enum class WithEnvelope
    {
        // Return descriptor for a top-level message with a user-provided name.
        // Example: In protobuf schema
        //   message MessageType {
        //     string colA = 1;
        //     int32 colB = 2;
        //   }
        // message_name = "MessageType" returns a descriptor. Used by IO
        // formats Protobuf and ProtobufSingle.
        No,
        // Return descriptor for a message with a user-provided name one level
        // below a top-level message with the hardcoded name "Envelope".
        // Example: In protobuf schema
        //   message Envelope {
        //     message MessageType {
        //       string colA = 1;
        //       int32 colB = 2;
        //     }
        //   }
        // message_name = "MessageType" returns a descriptor. Used by IO format
        // ProtobufList.
        Yes
    };

    static ProtobufSchemas & instance();

    /// Parses the format schema, then parses the corresponding proto file, and returns the descriptor of the message type.
    /// The function never returns nullptr, it throws an exception if it cannot load or parse the file.
    const google::protobuf::Descriptor * getMessageTypeForFormatSchema(const FormatSchemaInfo & info, WithEnvelope with_envelope);

private:
    class ImporterWithSourceTree;
    std::unordered_map<String, std::unique_ptr<ImporterWithSourceTree>> importers;
    std::mutex mutex;
};

}

#endif
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00			`#pragma once`

Consolidate config_formats.h into config.h Less duplication, less confusion ... 2022-09-28 12:57:31 +00:00			`#include "config.h"`
Fix build without protobuf, gtest, cppkafka (#4152) 2019-01-25 20:02:03 +00:00			`#if USE_PROTOBUF`

Add support for absolute format schema paths. 2019-01-27 09:15:32 +00:00			`#include <memory>`
Add synchronization to ProtobufSchemas. 2021-12-10 20:18:47 +00:00			`#include <mutex>`
Add support for absolute format schema paths. 2019-01-27 09:15:32 +00:00			`#include <unordered_map>`
Rename "common" to "base" 2021-10-02 07:13:14 +00:00			`#include <base/types.h>`
Split libdbms.so using object library Now the linking time of incremental builds are around 1-2 seconds 2019-08-22 03:24:05 +00:00			`#include <boost/noncopyable.hpp>`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00

			`namespace google`
			`{`
			`namespace protobuf`
			`{`
			`class Descriptor;`
			`}`
			`}`

			`namespace DB`
			`{`
			`class FormatSchemaInfo;`

Remove unused function ProtobufSchemas::getMessageTypeForColumns(). 2019-02-26 19:51:42 +00:00			`/** Keeps parsed google protobuf schemas parsed from files.`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00			`* This class is used to handle the "Protobuf" input/output formats.`
			`*/`
Split libdbms.so using object library Now the linking time of incremental builds are around 1-2 seconds 2019-08-22 03:24:05 +00:00			`class ProtobufSchemas : private boost::noncopyable`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00			`{`
			`public:`
Implement ProtobufList - fixes ClickHouse#16436 Introduce IO format "ProtobufList" with protobuf schema // schemafile.proto message Envelope { message MessageType { uint32 colA = 1; string colB = 2; } repeated MessageType mt = 1; } where "Envelope" is a hard-coded/expected top-level message and "MessageType" is a message with user-provided name containing the table fields to export/import, e.g. SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema = 'schemafile:MessageType' As a result, the new format wraps a list of messages (one per row) into a single, containing message. Compare that to the schema of the existing IO formats "Protobuf" and "ProtobufSingle": message MessageType { uint32 colA = 1; string colB = 2; } The new format does not save space compared to the existing formats, but it is conceptually a bit more beautiful and also more convenenient. Implementation details: - Created new files ProtobufList(Input\|Output)Format which use the existing ProtobufSerializer mechanism. The goal was to reuse as much code as possible and avoid copypasta. - I was torn between inheriting from I(Input\|Output)Format vs. IRow(Input\|Output)Format for ProtobufList(Input\|Output)Format. The former is chunk-based which can be better for performance. Since the ProtobufSerializer mechanism is row-based but data is generally passed around in chunks, I decided for the latter to leverage the existing chunk <--> row mapping code in IRow(InputOutput)Format. - A new ProtobufSerializer called ProtobufSerializerEnvelope was introduced (--> ProtobufSerializer.cpp). It represents the top-level message which encloses the list of inner nested messages, i.e. the rows. - With the new format, parsing the schema file and matching the fields in the schema file to table column works like for the old formats. The only difference is that parsing starts one level below the "Envelope" (--> ProtobufSchema.cpp). This is more natural than forcing customers to have table columns start with "Envelope". - Creation of the ProtobufSerializer tree also works like before. What is different is that we finally add a ProtobufSerializerEnvelope as new root of the tree. It's only purpose is to write/read the top-level message for the first/last row to write/read. Caveats: - The low-level serialization code in ProtobufWriter uses an internal buffer which is flushed to the output file only in endMessage(). In the existing "Protobuf" format, this happens once per row, in the new format this happens only at the end of the serialization since row-level messages now call start/endNestedMessage(). As a future TODO to, the buffer should be flushed also in start/endNestedMessage() to reduce memory consumption. 2022-03-13 19:24:46 +00:00			`enum class WithEnvelope`
			`{`
			`// Return descriptor for a top-level message with a user-provided name.`
			`// Example: In protobuf schema`
			`// message MessageType {`
			`// string colA = 1;`
			`// int32 colB = 2;`
			`// }`
			`// message_name = "MessageType" returns a descriptor. Used by IO`
			`// formats Protobuf and ProtobufSingle.`
			`No,`
			`// Return descriptor for a message with a user-provided name one level`
			`// below a top-level message with the hardcoded name "Envelope".`
			`// Example: In protobuf schema`
			`// message Envelope {`
			`// message MessageType {`
			`// string colA = 1;`
			`// int32 colB = 2;`
			`// }`
			`// }`
			`// message_name = "MessageType" returns a descriptor. Used by IO format`
			`// ProtobufList.`
			`Yes`
			`};`
Split libdbms.so using object library Now the linking time of incremental builds are around 1-2 seconds 2019-08-22 03:24:05 +00:00
Implement ProtobufList - fixes ClickHouse#16436 Introduce IO format "ProtobufList" with protobuf schema // schemafile.proto message Envelope { message MessageType { uint32 colA = 1; string colB = 2; } repeated MessageType mt = 1; } where "Envelope" is a hard-coded/expected top-level message and "MessageType" is a message with user-provided name containing the table fields to export/import, e.g. SELECT * FROM db1.tab1 FORMAT ProtobufList SETTINGS format_schema = 'schemafile:MessageType' As a result, the new format wraps a list of messages (one per row) into a single, containing message. Compare that to the schema of the existing IO formats "Protobuf" and "ProtobufSingle": message MessageType { uint32 colA = 1; string colB = 2; } The new format does not save space compared to the existing formats, but it is conceptually a bit more beautiful and also more convenenient. Implementation details: - Created new files ProtobufList(Input\|Output)Format which use the existing ProtobufSerializer mechanism. The goal was to reuse as much code as possible and avoid copypasta. - I was torn between inheriting from I(Input\|Output)Format vs. IRow(Input\|Output)Format for ProtobufList(Input\|Output)Format. The former is chunk-based which can be better for performance. Since the ProtobufSerializer mechanism is row-based but data is generally passed around in chunks, I decided for the latter to leverage the existing chunk <--> row mapping code in IRow(InputOutput)Format. - A new ProtobufSerializer called ProtobufSerializerEnvelope was introduced (--> ProtobufSerializer.cpp). It represents the top-level message which encloses the list of inner nested messages, i.e. the rows. - With the new format, parsing the schema file and matching the fields in the schema file to table column works like for the old formats. The only difference is that parsing starts one level below the "Envelope" (--> ProtobufSchema.cpp). This is more natural than forcing customers to have table columns start with "Envelope". - Creation of the ProtobufSerializer tree also works like before. What is different is that we finally add a ProtobufSerializerEnvelope as new root of the tree. It's only purpose is to write/read the top-level message for the first/last row to write/read. Caveats: - The low-level serialization code in ProtobufWriter uses an internal buffer which is flushed to the output file only in endMessage(). In the existing "Protobuf" format, this happens once per row, in the new format this happens only at the end of the serialization since row-level messages now call start/endNestedMessage(). As a future TODO to, the buffer should be flushed also in start/endNestedMessage() to reduce memory consumption. 2022-03-13 19:24:46 +00:00			`static ProtobufSchemas & instance();`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00
			`/// Parses the format schema, then parses the corresponding proto file, and returns the descriptor of the message type.`
			`/// The function never returns nullptr, it throws an exception if it cannot load or parse the file.`
Fix build + typo 2022-03-17 10:35:23 +00:00			`const google::protobuf::Descriptor * getMessageTypeForFormatSchema(const FormatSchemaInfo & info, WithEnvelope with_envelope);`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00
			`private:`
Add support for absolute format schema paths. 2019-01-27 09:15:32 +00:00			`class ImporterWithSourceTree;`
			`std::unordered_map<String, std::unique_ptr<ImporterWithSourceTree>> importers;`
Add synchronization to ProtobufSchemas. 2021-12-10 20:18:47 +00:00			`std::mutex mutex;`
Implemented storage for parsed protobuf schemas. 2019-01-23 19:36:57 +00:00			`};`

			`}`
Fix build without protobuf, gtest, cppkafka (#4152) 2019-01-25 20:02:03 +00:00
			`#endif`