Keep on Keeping on !!: Structured streaming with Kafka

Monday, May 14, 2018

Structured streaming with Kafka

Kafka

Structured streaming with Kafka:

Data collection vs Data ingestion.
Why they are key ?
Streaming data sources.
Kafka overview
Integration of kafka and spark.
Checkpointing.
Kafka as Sink.
Delivery semantics
What next ?

Data Collection :

Happens where data is created.
Varies for different type of workloadds Batch vs Streaming.
Different modes of data collection pull vs push.

Data collection tools :

a) rsyslog

-> Ancient data collector.
-> Streaming mode.
-> Comes in default and widely known.
b) Flume

-> Distributed data collection service.
-> Solution for data collection of all formats.
-> Initially designed to transfer log data into HDFS frequently and reliably.
-> Written and maintained by cloudera.
-> Popular for data collection even today in hadoop ecosystem.

c) LogStash

-> Pluggable architecture.
-> Popular choice in ELK stack.
-> Written in JRuby.
-> Multiple input/ Multiple Output.
--> Centralize logs - collect, parse and store/forward.

d) Fluentd

-> Plugin architecture.
-> Build in HA Architecture.
-> Lightweight multi-source, multi-destination log routing.
-> Its offered as a service inside google cloud.

Data ingestion :

Receive and store data.
Coupled with input sources.
Help in routing data.

Data Ingestion tools :

a) RabbitMQ

-> Written in Erlang.
-> Implements AMQP ( Advanced Message Queuing Protocol) architecture.
-> Has pluggable architecture and provides extension for HTTP.
-> Provides strong guarantees for messages.

Kafka overview :

Reference :

https://www.youtube.com/watch?v=-V5Fe2Xycao

Keep on Keeping on !!

Monday, May 14, 2018

Structured streaming with Kafka

No comments:

Post a Comment

Hyderabad Trip - Best Places to visit

Followers

Pages