Telemetry architecture

Telemetry is designed to efficiently ingest, store, and query large volumes of data, ensuring rapid access and high performance. This discussion provides an overview of how Telemetry functions, focusing on its architecture, data ingestion process, schema evolution handling, real-time querying capabilities, fault tolerance mechanisms, and file compaction strategies.

Data Ingestion and Buffering Telemetry employs a Rust binary at the core of its data ingestion process. When a JSON blob arrives, the system first checks if the schema of the incoming JSON is compatible with the corresponding table. If the schema matches, the JSON data is placed into a buffer. This buffering strategy allows for efficient data management: the buffer holds data until either 15 minutes have elapsed or 10,000 rows have been accumulated. At this point, the buffer is uploaded to S3 for persistent storage.

Real-Time Querying One of the key advantages of Telemetry’s buffering mechanism is its ability to perform real-time queries. Because the data is initially stored in a buffer, Telemetry can perform queries that union data from both S3 (or disk) and the buffer. This capability ensures that even the most recent data, which has not yet been flushed to disk, is included in query results. As a result, users can perform real-time analytics on their data without waiting for the next buffer flush.

Handling Schema Evolution In dynamic environments, the structure of incoming data often evolves over time. For instance, you might have a JSON schema that gains additional fields as new features are added to your product. Telemetry is built to handle such schema evolution seamlessly. When new fields are added to a JSON schema, Telemetry automatically updates its internal schema management to accommodate these changes. This flexibility allows you to continuously extend your data model without disrupting existing queries or requiring manual schema adjustments. Telemetry ensures that your data remains accessible and usable even as its structure evolves.

Fault Tolerance and Buffer Management Telemetry’s design prioritizes data integrity and reliability. The buffer is designed to flush its contents when the server is shutting down, such as when it receives a signal. This ensures that no data is lost during routine maintenance or server restarts. Additionally, in the event of catastrophic failures, Telemetry maintains a dead letter queue. This queue captures any data that could not be processed or flushed due to the crash, allowing for recovery and ensuring that no data is permanently lost. This approach guarantees that Telemetry can handle unexpected scenarios while maintaining data integrity.

Compaction and File Management Telemetry utilizes Parquet files under the hood, which are known for their excellent compression ratios, especially as files grow larger and more data is available to exploit. To maintain and even enhance these compression ratios, Telemetry includes a compaction process. This process involves merging smaller files or parts into larger, more optimized files. Compaction helps in reducing the number of small files, which not only improves query performance but also maximizes storage efficiency by leveraging the full potential of Parquet's compression capabilities. As the amount of data grows, the compaction process ensures that your storage remains optimized and that data retrieval times are minimized.

Metadata and Query Optimization To facilitate rapid querying, Telemetry maintains a metadata store. This store plays a critical role when users execute queries. When a query is submitted, it is parsed into an Abstract Syntax Tree (AST). The AST helps the system determine which tables to query, the relevant time range, and the filters applied. With this information, Telemetry identifies the specific files that need to be materialized from S3, ensuring that only the necessary data is retrieved and processed.

Performance Optimization Telemetry achieves its high-speed querying through a combination of content-addressable disk caches and tenant-based sharding. Content-addressable disk caches ensure that once data is written to disk, it can be quickly retrieved based on its content rather than its location. Tenant-based sharding further enhances performance by routing queries to a server that already has the relevant data on disk. This approach ensures that over 99% of the time, the system can serve queries without needing to retrieve data from S3, drastically reducing latency.

Data Querying Once the data is available on disk, Telemetry leverages DataFusion, a powerful SQL query engine, to execute queries against the materialized files. DataFusion processes the data and returns the results to the user, ensuring that even complex queries are handled efficiently.

Telemetry's architecture is designed to optimize both data ingestion and querying, making it a robust solution for handling large-scale data. With its intelligent buffering, support for real-time queries, seamless schema evolution handling, fault tolerance mechanisms, file compaction strategies, metadata-driven query optimization, and advanced caching and sharding techniques, Telemetry ensures that users can quickly and efficiently access the data they need. Embrace Telemetry's capabilities to unlock the full potential of your data.

Last updated