Harvest websites using raw http requests and automating browser using tools like Selenium or Scrapy, using all kind of web based APIs (REST, SOAP, …)
Implement batch jobs to retrieve data via common web protocols like HTTP, FTP, …
Extract data from various source formats by implementing heuristics to extract data
Being confronted with very different document formats, reaching from not very well formed HTML code or PDF documents to standard file formats like XML, JSON, CSV and using the default tools to process them (like XPATH)
Store extracted data in sql databases (mainly PostgreSQL) in a generic format
Integrate code into our data pipeline driving our whole data processing infrastructure.
We are looking for:
Relevant software development experience (preferably in Python language)
Basic Linux working experience and willingness to improve it
Ability to work on intricate details without losing the big picture.
Experience with Amazon Web Services or eager to learn about it
Nice to have experience with application containers (preferably Docker)
Experience in distributed version control systems (git)
Understanding of Agile methodologies
Must be a self-learner, possessing inherent inquisitiveness
Good problem solving and analytical skills
Strong interpersonal, communications, and organizational skills
Minimum Bachelor degree in Computer Science or related field, or equivalent
What We Offer
Be part of an international team distributed all over the globe
Relaxed work environment that values innovation, initiative, and energy
On a rainy day you can choose to work remotely, so most communication happens via video calls using Google Hangout