Franz Kafka (3 July 1883 – 3 June 1924) was a German-language writer of novels and short stories who is widely regarded as one of the major figures of 20th-century literature.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs.
Every byte of data has a story to tell. The faster and easier we move it around, the more we can focus on the core business. Data pipelines are the epicenter of data-driven companies, and Apache Kafka is becoming the heart of it.
What is Kafka?
Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
- Kafka maintains feeds of messages in categories called topics.
- Producers write data to topics
- Consumers read from topics
- Kafka is run as a cluster comprised of one or more servers each of which is called a broker.
What is Kafka Connect?
The recent post from my co-founder Kostas Pardalis does a great job explaining it.
Kafka Connect was introduced recently as a feature of Apache Kafka 0.9+ with the narrow (although very important) scope of copying streaming data from and to a Kafka cluster.
Kafka Connect is about interacting with other data systems and move data between them and a Kafka Cluster. Many of the connectors that are available are focusing to systems that are managed by the owner of the Kafka Cluster, e.g. RDBMS systems that hold transactional data, trying to turn these systems into a stream of data.
So if you are exploring Kafka, check his detailed post, about a:
Cool intro to Kafka Connect
How he implemented a connector for getting data out of Mixpanel’s APl
and the actual source code.
Our favorite email marketing monkey. Mailchimp is the heart of our marketing campaigns. But it generates data and sometimes these data are important to be together with data from services like billing or social. How can you get them and then load them to a data warehouse like Amazon Redshift for further analysis?
Mailchimp helps businesses observe their subscribers’ activities, send automated emails to them based on their behavior and preferences, optimize and target the appropriate audience for each campaign using specific tools, and monitor sales and website activity with revenue reports. Companies can add content and collaborate on campaigns that fit their brand using MailChimp’s Email Designer; edit campaigns, collaborate with their teams using MailChimp Editor; send one-to-one messages using Mandrill; collect signups from their tablet using Chimpadeedoo; and access all the services via mobiles using MailChimp Mobile.
About Mailchimp API
[update 18/2] Mailchimp updated its API to v3, for more information visit MailChimp API v3.0 documentation.
MailChimp was always a promoter of APIs and encouraged integration with other systems. It has a rich API that exposes a large number of endpoints for interacting with the resources of the applications.
Athough I love Mixpanel, there are cases that you would like to extract data from it. Then load your data to a data warehouse like Amazon Redshift for further analysis. This post is a small overview off Mixpanel’s API and how to access and extract data from it. purpose of this guide is to help you define a process or pipeline .
For those that do not know Mixpanel, it helps you make your product better by measuring actions, instead of page views. Mixpanel gives you the ability to measure what people are doing in your app on iOS, Android, and web.
Extract data from Mixpanel
Mixpanel is an analytics as-a-service application. We usually think of it as place to see my data and not a place where I would get data from. Why? I may need to perform analysis that involves data from other sources.
Mixpanel collects data related to how your customers use your product. In case you need to have more sources you may:
- Enrich Mixpanel with data coming from other sources.
- Extract the data Mixpanel holds for you and load it on a data warehousing repository. This is what we are going to review here.