do-me/SemanticFinder

Name: do-me/SemanticFinder
Creator: do-me
Published: 2024-05-03 22:28:17
License: 暂无描述

Hugging Face2024-05-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/do-me/SemanticFinder

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含了多个文本文件的索引，这些文本文件包括原始文本、文本块及其嵌入向量。数据集中的文件涵盖了多种语言的文学作品、报告、法规等，如《资本论》、《神曲》、《唐吉诃德》等。这些文件可以导入到SemanticFinder中进行语义搜索。

提供机构：

do-me

原始信息汇总

数据集概述

该数据集是为SemanticFinder应用准备的索引文本数据，包含原始文本、文本块及其嵌入信息。数据集中的文件详细信息如下：

数据集目录

文件大小	文本标题	作者	年份	语言	URL	模型名称	量化	分割参数	分割类型	字符数	块数	避免词（全部）	导出小数位数	行数	文本备注	文本来源URL	文件名
100.96	Collection of 100 books	Various Authors	1890	en	https://do-me.github.io/SemanticFinder/?hf=Collection_of_100_books_dd80b04b	Xenova/bge-small-en-v1.5	True	100	Words	55705582	158957		2	1085035	US Public Domain Books (English)	https://huggingface.co/datasets/storytracer/US-PD-Books/tree/main/data	Collection_of_100_books_dd80b04b.json.gz
4.78	Das Kapital	Karl Marx	1867	de	https://do-me.github.io/SemanticFinder/?hf=Das_Kapital_c1a84fba	Xenova/multilingual-e5-small	True	80	Words	2003807	3164		5	28673		https://ia601605.us.archive.org/13/items/KarlMarxDasKapitalpdf/KAPITAL1.pdf	Das_Kapital_c1a84fba.json.gz
2.58	Divina Commedia	Dante	1321	it	https://do-me.github.io/SemanticFinder/?hf=Divina_Commedia_d5a0fa67	Xenova/multilingual-e5-base	True	50	Words	383782	1179		5	6225		http://www.letteratura-italiana.com/pdf/divina%20commedia/08%20Inferno%20in%20versione%20italiana.pdf	Divina_Commedia_d5a0fa67.json.gz
11.92	Don Quijote	Miguel de Cervantes	1605	es	https://do-me.github.io/SemanticFinder/?hf=Don_Quijote_14a0b44	Xenova/multilingual-e5-base	True	25	Words	1047150	7186		4	12005		https://parnaseo.uv.es/lemir/revista/revista19/textos/quijote_1.pdf	Don_Quijote_14a0b44.json.gz
0.06	Hansel and Gretel	Brothers Grimm	1812	en	https://do-me.github.io/SemanticFinder/?hf=Hansel_and_Gretel_4de079eb	TaylorAI/gte-tiny	True	100	Chars	5304	55		5	9		https://www.grimmstories.com/en/grimm_fairy-tales/hansel_and_gretel	Hansel_and_Gretel_4de079eb.json.gz
13.52	Iliad	Homer	-750	gr	https://do-me.github.io/SemanticFinder/?hf=Iliad_8de5d1ea	Xenova/multilingual-e5-small	True	20	Words	1597139	11848		5	32659	Including modern interpretation	https://www.stipsi.gr/homer/iliada.pdf	Iliad_8de5d1ea.json.gz
1.74	IPCC Report 2023	IPCC	2023	en	https://do-me.github.io/SemanticFinder/?hf=IPCC_Report_2023_2b260928	Supabase/bge-small-en	True	200	Chars	307811	1566		5	3230	state of knowledge of climate change	https://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_LongerReport.pdf	IPCC_Report_2023_2b260928.json.gz
25.56	King James Bible		None	en	https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_24f6dc4c	TaylorAI/gte-tiny	True	200	Chars	4556163	23056		5	80496		https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf	King_James_Bible_24f6dc4c.json.gz
11.45	King James Bible		None	en	https://do-me.github.io/SemanticFinder/?hf=King_James_Bible_6434a78d	TaylorAI/gte-tiny	True	200	Chars	4556163	23056		2	80496		https://www.holybooks.com/wp-content/uploads/2010/05/The-Holy-Bible-King-James-Version.pdf	King_James_Bible_6434a78d.json.gz
39.32	Les Misérables	Victor Hugo	1862	fr	https://do-me.github.io/SemanticFinder/?hf=Les_Misérables_2239df51	Xenova/multilingual-e5-base	True	25	Words	3236941	19463		5	74491	All five acts included	https://beq.ebooksgratuits.com/vents/Hugo-miserables-1.pdf	Les_Misérables_2239df51.json.gz
8.67	List of the Most Common English Words	Dolph	2012	en	https://do-me.github.io/SemanticFinder/?hf=List_of_the_Most_Common_English_Words_0d1e28dc	Xenova/bge-small-en-v1.5	True
Regex	210518	25322					2	25323	GitHub Repo	https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt	List_of_the_Most_Common_English_Words_0d1e28dc.json.gz
15.61	List of the Most Common English Words	Dolph	2012	en	https://do-me.github.io/SemanticFinder/?hf=List_of_the_Most_Common_English_Words_70320cde	Xenova/multilingual-e5-base	True
Regex	210518	25322					2	25323	GitHub Repo	https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt	List_of_the_Most_Common_English_Words_70320cde.json.gz
0.46	REGULATION (EU) 2023/138	European Commission	2022	en	https://do-me.github.io/SemanticFinder/?hf=REGULATION_(EU)_2023_138_c00e7ff6	Supabase/bge-small-en	True	25	Words	76809	424		5	1323		https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:32023R0138&qid=1704492501351	REGULATION_(EU)_2023_138_c00e7ff6.json.gz
0.07	Universal Declaration of Human Rights	United Nations	1948	en	https://do-me.github.io/SemanticFinder/?hf=Universal_Declaration_of_Human_Rights_0a7da79a	TaylorAI/gte-tiny	True
Article	Regex	8623	63					5	109	30 articles	https://www.un.org/en/about-us/universal-declaration-of-human-rights	Universal_Declaration_of_Human_Rights_0a7da79a.json.gz

数据集示例

在SemanticFinder中加载后，大约需要2秒钟即可在整个圣经中进行搜索。您可以尝试以下步骤：

点击您选择的示例URL之一。
索引加载后，输入您想要搜索的内容并点击“查找”。结果将几乎立即显示。

创建SemanticFinder文件

像往常一样使用SemanticFinder并至少运行一次搜索，以便创建索引。如果输入较大，这可能需要一些时间。例如，使用200个字符索引圣经会产生约23k个嵌入，使用量化的gte-tiny模型需要15-30分钟。
添加元数据（以便其他人可以找到您的索引）并导出文件。请注意，您可以自由减少小数位数以减小文件大小；通常3个足够，但根据模型可能需要更多。
如果您希望将其添加到官方集合中，请在此处创建PR！只需确保运行create_meta_data_csv_md.py一次以更新csv/md文件。目前，readme.md表需要手动更新meta_data.md。

隐私

此仓库是公开的，共享公共利益文档或公共领域文档。
如果您有敏感文档，仍然可以使用SemanticFinder创建索引并在本地使用。您可以每次从磁盘加载索引，或者在本地网络中托管并在SemanticFinder中添加URL。

使用案例

标准用例

在任何文本中搜索最相似的单词/句子/段落/页面。想象一下，CTRL+F可以找到相关单词，而不仅仅是您使用的完全相同的单词！如果您反复处理相同的文本，可以保存索引并重复使用。

此外，还可以在浏览器中使用生成式AI（如Qwen模型）总结结果，或者连接Ollama的重型Llama2实例。

高级用例

使用多语言嵌入翻译单词，或查看给定列表中哪些单词与您的输入单词最相似。使用约30k英语单词的索引，您可以使用超过100种输入语言进行查询！请注意，这里的专家设置更改，以便仅显示第一个匹配项。
英语同义词查找器，再次使用约30k英语单词的索引，但使用稍好（且更小）的英语专用嵌入。这里的专家设置相同。
通用索引概念，即使用30k英语单词索引，不对任何新单词进行推理。这样，您可以在未知/未见/未索引的文本上执行即时语义搜索！使用此URL，然后复制并粘贴您选择的任何文本到文本字段中。关闭新单词的推理以提高速度。
通用索引的混合版本，您使用30k英语单词作为起始索引，然后“填充”索引尚不知道的所有额外单词。对于此选项，只需使用此URL，其中推理再次开启。这产生最佳结果，并且可能是很好的折衷方案，假设新文本通常没有那么多新单词。即使有几百个（如特定领域的研究论文），推理也相当快。

5,000+

优质数据集

54 个

任务类型

进入经典数据集