Pretty cool! We use Airflow heavily here at Instacart. Some of our teams use a managed service from google for deployment and orchestration https://cloud.google.com/composer/ For companies wanting a standard structure of dags and self hosting their airflow deployments, your tool would be super helpful to get started. One suggestion - would be cool to add separate deployments for the different components of airflow - webserver, workers, scheduler etc. reading through the readme it looks like you deploy the single image to the Qubole Cloud? Often times deploying code to airflow is updating the dags files in airflows file system.
Thanks for the feedback.
The main motivation behind building this tool was to make onboarding easier on Apache Airflow. There was no standard structure for a airflow projects and setting it up on local can be a nightmare sometimes. The simple CLI tool makes it very easy to create and test your project locally before deploying it to your production or staging environment via your CI/CD.
Right now we are using a docker-compose file which brings up all the Airflow services but we are also currently working on providing a command to control individual process.
Qubole is not a cloud but a self managed Data Platform. Deploying on Qubole means just putting all the Dag files on the machine (AWS/ GCP/ Azure) where airflow is running. Qubole provides out of the box solutions for running airflow on your cloud with a click of a button. We offer bunch of different things (Spark, Presto, notebooks, etc) and have a great eco system build around Airflow.
What's the simpler solution analogous to airflow that I an deploy on the same ec2 instance as my webserver? I prefer to use the simpler tool until it becomes unwieldable, and my current stack is just an elastic beanstalk deployment that runs the webserver as well as a celery worker. This seems easy to make the site highly available (scale up and down instances easily), with managed RDS and a redis broker taking care of managing state. A scheduled task runner is the missing piece - turns out celery is just not designed for long-running tasks, and both celery and airflow seem to require that only one instance of their scheduler run at any given time. I'd much prefer a tool where the service is robust enough to handle multiple redundant schedulers so I can go home and sleep without having to bring in kubernetes just to deal with this.
Currently using some custom code to do this using a database table but totally aware of how fickle and easy to screw up this method can be, and open to suggestions.
'Airflow' as in ....?
It does ring a bell, but airflow gives a lot of hits on google. Would be helpful to add a link in your readme to 'your' Airflow.
Apache Airflow :)
As well as commenting here. It would make sense to update your readme.
Thank you. It was updated.
https://astronomer.io offers a managed platform and installation into your cloud or data center.
Great product experience, fully open source. Gives your people using Airflow a sweet UI so they don't need to go CLI
It depends on your preference really. Most devs I know would prefer a CLI over UI, plus CLIs can be easily used to automate workflows like deployment and pushing pipelines to production.
Yes, devs prefer CLI, but most users of Airflow are not devs. Just because something is OSS does not mean it's geared towards devs
Deploying airflow workers took me loads of sweat to get used to - will defenitely try this