five

lightblue/architecture_faqs

收藏
Hugging Face2024-10-03 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/lightblue/architecture_faqs
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: question dtype: string - name: answer dtype: string splits: - name: train num_bytes: 130703 num_examples: 250 download_size: 54948 dataset_size: 130703 configs: - config_name: default data_files: - split: train path: data/train-* task_categories: - question-answering language: - ja --- Japanese construction themes FAQs scraped from [https://www.city.yokohama.lg.jp/business/bunyabetsu/kenchiku/annai/faq/qa.html](https://www.city.yokohama.lg.jp/business/bunyabetsu/kenchiku/annai/faq/qa.html). Downloaded using the following code: ```python import requests from lxml import html import pandas as pd from datasets import Dataset hrefs = [ "/business/bunyabetsu/kenchiku/annai/faq/ji-annnai.html", "/business/bunyabetsu/kenchiku/tetsuduki/kakunin/qa-kakunin.html", "/business/bunyabetsu/kenchiku/tetsuduki/teikihoukoku/seido/01.html", "/business/bunyabetsu/kenchiku/tetsuduki/teikihoukoku/seido/07.html", "/business/bunyabetsu/kenchiku/tetsuduki/doro/qa-doro.html", "/business/bunyabetsu/kenchiku/tetsuduki/doro/qa-doro.html", "/business/bunyabetsu/kenchiku/bosai/kyoai/jigyou/qanda.html", "/business/bunyabetsu/kenchiku/tetsuduki/kyoka/43.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/keikakuho/tokeihou.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/kiseiho/takuzo.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/keikakuho/q4-1.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/casbee/hairyo/qa.html", "/business/bunyabetsu/kenchiku/tetsuduki/jorei/machizukuri/fukumachiqa.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/chouki/qa-chouki.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/huuti/qa-huuchi.html", "/kurashi/machizukuri-kankyo/kotsu/toshikotsu/chushajo/jorei/qa.html", ] url_stem = "https://www.city.yokohama.lg.jp" def get_question_text(url): # Send a GET request to the webpage response = requests.get(url) # Parse the HTML content tree = html.fromstring(response.content) question_data = [] # Use XPath to find the desired elements for qa_element in tree.xpath('//div[@class="contents-area"]/section'): question_data.append({ "question": qa_element.xpath('.//div[@class="question-text"]/text()')[0], "answer": "\n".join(qa_element.xpath('.//div[@class="answer-text"]/div/p/text()')) }) return question_data qa_list = [] for href in hrefs: print(href) qa_list.extend(get_question_text(url_stem + href)) df = pd.DataFrame(qa_list) df.question = df.question.apply(lambda x: x[len(x.split()[0]):] if " " in x[:7] or " " in x[:7] else x) df.answer = df.answer.apply(lambda x: x[len(x.split()[0]):] if " " in x[:7] or " " in x[:7] else x) df.question = df.question.str.strip() df.answer = df.answer.str.strip() df.question = df.question.apply(lambda x: x[:-len(x.split("<")[-1])-1] if "<" in x else x) df.answer = df.answer.apply(lambda x: x[:-len(x.split("<")[-1])-1] if "<" in x else x) df.question = df.question.str.strip() df.answer = df.answer.str.strip() Dataset.from_pandas(df).push_to_hub("lightblue/architecture_faqs") ```

--- dataset_info: 特征: - 名称: question 数据类型: string - 名称: answer 数据类型: string 数据分割: - 名称: train 字节数: 130703 样本数: 250 下载大小: 54948 数据集大小: 130703 配置: - 配置名称: default 数据文件: - 分割: train 路径: data/train-* task_categories: - 问答 language: - 日语 --- 该数据集包含从横滨市官方网站(https://www.city.yokohama.lg.jp/business/bunyabetsu/kenchiku/annai/faq/qa.html)爬取的日语建筑主题常见问题(FAQs)。 数据集通过以下代码爬取生成: python import requests from lxml import html import pandas as pd from datasets import Dataset hrefs = [ "/business/bunyabetsu/kenchiku/annai/faq/ji-annnai.html", "/business/bunyabetsu/kenchiku/tetsuduki/kakunin/qa-kakunin.html", "/business/bunyabetsu/kenchiku/tetsuduki/teikihoukoku/seido/01.html", "/business/bunyabetsu/kenchiku/tetsuduki/teikihoukoku/seido/07.html", "/business/bunyabetsu/kenchiku/tetsuduki/doro/qa-doro.html", "/business/bunyabetsu/kenchiku/tetsuduki/doro/qa-doro.html", "/business/bunyabetsu/kenchiku/bosai/kyoai/jigyou/qanda.html", "/business/bunyabetsu/kenchiku/tetsuduki/kyoka/43.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/keikakuho/tokeihou.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/kiseiho/takuzo.html", "/business/bunyabetsu/kenchiku/takuchi/toiawase/keikakuho/q4-1.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/casbee/hairyo/qa.html", "/business/bunyabetsu/kenchiku/tetsuduki/jorei/machizukuri/fukumachiqa.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/chouki/qa-chouki.html", "/business/bunyabetsu/kenchiku/kankyo-shoene/huuti/qa-huuchi.html", "/kurashi/machizukuri-kankyo/kotsu/toshikotsu/chushajo/jorei/qa.html", ] url_stem = "https://www.city.yokohama.lg.jp" def get_question_text(url): # Send a GET request to the webpage response = requests.get(url) # Parse the HTML content tree = html.fromstring(response.content) question_data = [] # Use XPath to find the desired elements for qa_element in tree.xpath('//div[@class="contents-area"]/section'): question_data.append({ "question": qa_element.xpath('.//div[@class="question-text"]/text()')[0], "answer": " ".join(qa_element.xpath('.//div[@class="answer-text"]/div/p/text()')) }) return question_data qa_list = [] for href in hrefs: print(href) qa_list.extend(get_question_text(url_stem + href)) df = pd.DataFrame(qa_list) df.question = df.question.apply(lambda x: x[len(x.split()[0]):] if " " in x[:7] or " " in x[:7] else x) df.answer = df.answer.apply(lambda x: x[len(x.split()[0]):] if " " in x[:7] or " " in x[:7] else x) df.question = df.question.str.strip() df.answer = df.answer.str.strip() df.question = df.question.apply(lambda x: x[:-len(x.split("<")[-1])-1] if "<" in x else x) df.answer = df.answer.apply(lambda x: x[:-len(x.split("<")[-1])-1] if "<" in x else x) df.question = df.question.str.strip() df.answer = df.answer.str.strip() Dataset.from_pandas(df).push_to_hub("lightblue/architecture_faqs")
提供机构:
lightblue
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作