Collect BeautifulSoup data into your data warehouse or ours. The Matatika pipelines will take care of the data collection and preparation for your analytics and BI tools.
Python library for pulling data out of HTML and XML files.
Attempt to download all pages recursively into the output directory prior to parsing files. Set this to False if you've previously run wget -r -A. Html https://sdk.meltano.com/en/latest/
List of tags to exclude before extracting text content of the page.
This dict contains all the kwargs that should be passed to the find_all
call in order to extract text from the pages.
'True' to enable schema flattening and automatically expand nested properties.
The max depth to flatten schemas.
The file path of where to write the intermediate downloaded HTML files to.
The BeautifulSoup parser to use.
The site you'd like to scrape. The tap will download all pages recursively into the output directory prior to parsing files.
The name of the source you're scraping. This will be used as the stream name.
User-defined config values to be used within map expressions.
Config object for stream maps capability. For more information check out Stream Maps.
Extract, Transform, and Load BeautifulSoup data into your data warehouse or ours.