Kafka
Structured streaming with Kafka:
- Data collection vs Data ingestion.
- Why they are key ?
- Streaming data sources.
- Kafka overview
- Integration of kafka and spark.
- Checkpointing.
- Kafka as Sink.
- Delivery semantics
- What next ?
Data Collection :
- Happens where data is created.
- Varies for different type of workloadds Batch vs Streaming.
- Different modes of data collection pull vs push.
Data collection tools :
a) rsyslog
-> Ancient data collector.
-> Streaming mode.
-> Comes in default and widely known.
b) Flume
-> Distributed data collection service.
-> Solution for data collection of all formats.
-> Initially designed to transfer log data into HDFS frequently and reliably.
-> Written and maintained by cloudera.
-> Popular for data collection even today in hadoop ecosystem.
c) LogStash
-> Pluggable architecture.
-> Popular choice in ELK stack.
-> Written in JRuby.
-> Multiple input/ Multiple Output.
--> Centralize logs - collect, parse and store/forward.
d) Fluentd
-> Plugin architecture.
-> Build in HA Architecture.
-> Lightweight multi-source, multi-destination log routing.
-> Its offered as a service inside google cloud.
Data ingestion :
- Receive and store data.
- Coupled with input sources.
- Help in routing data.
a) RabbitMQ
-> Written in Erlang.
-> Implements AMQP ( Advanced Message Queuing Protocol) architecture.
-> Has pluggable architecture and provides extension for HTTP.
-> Provides strong guarantees for messages.
Kafka overview :
Reference :
https://www.youtube.com/watch?v=-V5Fe2Xycao
No comments:
Post a Comment