recentpopularlog in


« earlier   
The Changing Face of ETL | Confluent
Let’s build it!
So far, we’ve just looked at the theory behind a streaming platform and why one should consider it as a core piece of architecture in a data system. To learn more, you can read my InfoQ article Democratizing Stream Processing with Apache Kafka and KSQL, which builds on the concepts in this blog post and provides a tutorial for building out the e-commerce example described above. You may like to also check out the following resources:

Watch the online talk ETL is Dead; Long Live Streams
Get an introduction to KSQL
Download Confluent Platform and follow the quick start to begin using KSQL
Check out the KSQL video tutorials and KSQL hands-on tutorials
Follow KSQL recipes for additional tutorials and recommended deployment scenarios
Ask questions in the #ksql channel of the our community Slack group
etl  kafka  streamprocessing 
yesterday by euler
Stitch provides a Calm data pipeline | Stitch
Industry Mobile Company Size 21-100 employees Calm promotes mindfulness primarily through a mobile app that provides guided and unguided meditation on things…
IFTTT  Instapaper  ETL 
9 days ago by broderboy
Solved: How to deal with duplicate columns in case of full... - Dojo
Solved: When joining two tables using "full outer joins", the result will have duplicate columns. For example if the column matching is
full-outer-join  full  outer  join  sql  etl  tableau 
12 days ago by sphere2k
Convert CSVs to ORC Faster
Every analytical database I've used converts imported data into a form that is quicker to read. Often this means storing data in column form instead of row form. The taxi trip dataset I benchmark with is around 100 GB when gzip-compressed compressed in row form but the five columns that are queried can be stored in around 3.5 GB of space in columnar form when compressed using a mixture of dictionary encoding, run-length encoding and Snappy compression.

The process of converting rows into columns is time consuming and compute-intensive. Most systems can take the better part of an hour to finish this conversion, even when using a cluster of machines. I once believed that compression was causing most of the overhead but in researching this post I found Spark 2.4.0 had a ~7% difference in conversion time between using Snappy, zlib, lzo and not using any compression at all.
csv  etl  analytics 
13 days ago by euler
Built to Scale: Running Highly-Concurrent ETL with Apache Airflow (part 1) · Bostata | Boston Data Engineering
Apache Airflow is a highly capable, DAG-based scheduling tool capable of some pretty amazing things. Like any other complex system, it should be set up with care. The following is an overview of my thought process when attempting to minimize development and deployment friction.
etl  airflow 
15 days ago by Tafkas
What does your Python ETL pipeline look like? : Python
r/Python: news about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
15 days ago by gotdan
Apache NiFi
open source data flow software suite created by the NSA
distributed  data  etl  nifi  software  apache 
16 days ago by plaxx
Airflow: a workflow management platform – Airbnb Engineering & Data Science – Medium
Airbnb is a fast growing, data informed company. Our data teams and data volume are growing quickly, and accordingly, so does the complexity of the challenges we take on. Our growing workforce of…
etl  datascience  workflow 
17 days ago by sphere2k

Copy this bookmark:

to read