five

BEE-spoke-data/awesome-python-apps

收藏
Hugging Face2026-03-27 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/awesome-python-apps
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: odc-by size_categories: - 10K<n<100K task_categories: - text-generation configs: - config_name: all data_files: - split: train path: all/train-* - split: validation path: all/validation-* - split: test path: all/test-* - config_name: c_and_cpp data_files: - split: train path: c_and_cpp/train-* - split: validation path: c_and_cpp/validation-* - split: test path: c_and_cpp/test-* - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* - config_name: docs_and_configs data_files: - split: train path: docs_and_configs/train-* - split: validation path: docs_and_configs/validation-* - split: test path: docs_and_configs/test-* - config_name: javascript data_files: - split: train path: javascript/train-* - split: validation path: javascript/validation-* - split: test path: javascript/test-* dataset_info: - config_name: all features: - name: section dtype: string - name: filename dtype: string - name: text dtype: string splits: - name: train num_bytes: 498751093 num_examples: 27498 - name: validation num_bytes: 26638283 num_examples: 1527 - name: test num_bytes: 25999454 num_examples: 1530 download_size: 181559230 dataset_size: 551388830 - config_name: c_and_cpp features: - name: section dtype: string - name: filename dtype: string - name: text dtype: string splits: - name: train num_bytes: 77766288.48753309 num_examples: 8157 - name: validation num_bytes: 4318760.412511033 num_examples: 453 - name: test num_bytes: 4328294.09995587 num_examples: 454 download_size: 29920612 dataset_size: 86413343 - config_name: default features: - name: section dtype: string - name: filename dtype: string - name: text dtype: string splits: - name: train num_bytes: 187113233.13447574 num_examples: 12394 - name: validation num_bytes: 10401889.432762126 num_examples: 689 - name: test num_bytes: 10401889.432762126 num_examples: 689 download_size: 74676703 dataset_size: 207917011.99999997 - config_name: docs_and_configs features: - name: section dtype: string - name: filename dtype: string - name: text dtype: string splits: - name: train num_bytes: 27004737.076526225 num_examples: 3140 - name: validation num_bytes: 1496440.8443680138 num_examples: 174 - name: test num_bytes: 1505041.079105761 num_examples: 175 download_size: 14090699 dataset_size: 30006219 - config_name: javascript features: - name: section dtype: string - name: filename dtype: string - name: text dtype: string splits: - name: train num_bytes: 204347030.4 num_examples: 3807 - name: validation num_bytes: 11325774.471867612 num_examples: 211 - name: test num_bytes: 11379451.128132388 num_examples: 212 download_size: 60838679 dataset_size: 227052256 tags: - code - python - cpp - javascript --- # Dataset Card for "awesome-python-apps" This contains `.py` files for the following repos taken from `awesome-python-applications` (on GitHub [here](https://github.com/mahmoud/awesome-python-applications)) <pre><font color="#268BD2"><b>abilian-sbe</b></font> <font color="#859900"><b>clone_repos.sh</b></font> <font color="#268BD2"><b>invesalius3</b></font> <font color="#268BD2"><b>photonix</b></font> <font color="#268BD2"><b>sk1-wx</b></font> <font color="#268BD2"><b>ambar</b></font> <font color="#859900"><b>CONTRIBUTING.md</b></font> <font color="#268BD2"><b>isso</b></font> <font color="#268BD2"><b>picard</b></font> <font color="#268BD2"><b>soundconverter</b></font> <font color="#268BD2"><b>apatite</b></font> <font color="#268BD2"><b>CTFd</b></font> <font color="#268BD2"><b>kibitzr</b></font> <font color="#268BD2"><b>pi-hole</b></font> <font color="#268BD2"><b>soundgrain</b></font> <font color="#268BD2"><b>ArchiveBox</b></font> <font color="#268BD2"><b>Cura</b></font> <font color="#268BD2"><b>KindleEar</b></font> <font color="#268BD2"><b>planet</b></font> <font color="#268BD2"><b>stargate</b></font> <font color="#268BD2"><b>archivematica</b></font> <font color="#268BD2"><b>deluge</b></font> <font color="#268BD2"><b>Lector</b></font> <font color="#268BD2"><b>plover</b></font> <font color="#268BD2"><b>streamlink</b></font> <font color="#268BD2"><b>autokey</b></font> <font color="#268BD2"><b>dissemin</b></font> <font color="#268BD2"><b>lucaschess</b></font> <font color="#268BD2"><b>pol</b></font> <font color="#268BD2"><b>Sunflower</b></font> <font color="#268BD2"><b>awesome-python-applications</b></font> <font color="#268BD2"><b>drawbot</b></font> <font color="#268BD2"><b>mackup</b></font> <font color="#268BD2"><b>posthog</b></font> <font color="#268BD2"><b>supysonic</b></font> <font color="#268BD2"><b>babybuddy</b></font> <font color="#268BD2"><b>exaile</b></font> <font color="#268BD2"><b>meshroom</b></font> <font color="#859900"><b>projects.yaml</b></font> <font color="#268BD2"><b>syncserver</b></font> <font color="#268BD2"><b>beancount</b></font> <font color="#268BD2"><b>flaskbb</b></font> <font color="#268BD2"><b>mopidy</b></font> <font color="#268BD2"><b>puddletag</b></font> <font color="#268BD2"><b>templates</b></font> <font color="#268BD2"><b>beets</b></font> <font color="#268BD2"><b>flowblade</b></font> <font color="#268BD2"><b>music-player</b></font> <font color="#268BD2"><b>PyBitmessage</b></font> <font color="#268BD2"><b>term2048</b></font> <font color="#268BD2"><b>bleachbit</b></font> <font color="#268BD2"><b>fofix</b></font> <font color="#268BD2"><b>mylar</b></font> <font color="#268BD2"><b>Pyfa</b></font> <font color="#268BD2"><b>thumbor</b></font> <font color="#268BD2"><b>bookwyrm</b></font> <font color="#268BD2"><b>formspree</b></font> <font color="#268BD2"><b>mypaint</b></font> <font color="#268BD2"><b>pyload</b></font> <font color="#859900"><b>TODO.md</b></font> <font color="#268BD2"><b>borg</b></font> <font color="#268BD2"><b>FreeCAD</b></font> <font color="#268BD2"><b>neubot</b></font> <font color="#268BD2"><b>pynocchio</b></font> <font color="#268BD2"><b>tribler</b></font> <font color="#268BD2"><b>buku</b></font> <font color="#268BD2"><b>frescobaldi</b></font> <font color="#268BD2"><b>newspipe</b></font> <font color="#268BD2"><b>pyvideo</b></font> <font color="#268BD2"><b>vidcutter</b></font> <font color="#268BD2"><b>Bup</b></font> <font color="#268BD2"><b>friture</b></font> <font color="#268BD2"><b>nfoview</b></font> <font color="#268BD2"><b>qis</b></font> <font color="#268BD2"><b>whipper</b></font> <font color="#859900"><b>BY_PLATFORM.md</b></font> <font color="#268BD2"><b>gaphor</b></font> <font color="#268BD2"><b>notebooks</b></font> <font color="#268BD2"><b>quodlibet</b></font> <font color="#268BD2"><b>you-get</b></font> <font color="#268BD2"><b>Cactus</b></font> <font color="#268BD2"><b>-github</b></font> <font color="#268BD2"><b>nyaa</b></font> <font color="#268BD2"><b>qutebrowser</b></font> <font color="#268BD2"><b>youtube-dl</b></font> <font color="#268BD2"><b>canto-curses</b></font> <font color="#268BD2"><b>gnuradio</b></font> <font color="#268BD2"><b>ocropy</b></font> <font color="#859900"><b>README.md</b></font> <font color="#268BD2"><b>ZeroNet</b></font> <font color="#268BD2"><b>canto-next</b></font> <font color="#268BD2"><b>gpodder</b></font> <font color="#268BD2"><b>OctoPrint</b></font> <font color="#268BD2"><b>reddit</b></font> <font color="#268BD2"><b>cartoonify</b></font> <font color="#268BD2"><b>hangups</b></font> <font color="#268BD2"><b>oncall</b></font> <font color="#859900"><b>revisit.yaml</b></font> <font color="#268BD2"><b>CDDA-Game-Launcher</b></font> <font color="#268BD2"><b>hosts</b></font> <font color="#268BD2"><b>openshot-qt</b></font> <font color="#268BD2"><b>sabnzbd</b></font> <font color="#268BD2"><b>ckan</b></font> <font color="#268BD2"><b>httpie</b></font> <font color="#268BD2"><b>PhotoCollage</b></font> <font color="#268BD2"><b>searx</b></font> </pre> ## details There are different configs for different languages and one for 'dcoumentation' type stuff. All of them should have the same column names, so if you want to use all of them for training it should be easy to aggregate them, just use the `concatenate_datasets` function + shuffle ### by config #### python (default) - all of them have been formatted with `ruff` counts: <pre>(ki) <font color="#859900"><b>➜ </b></font><font color="#2AA198"><b>primerdata-for-LLMs</b></font> python Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information. &gt;&gt;&gt; from datasets import load_dataset &gt;&gt;&gt; &gt;&gt;&gt; # If the dataset is gated/private, make sure you have run huggingface-cli login &gt;&gt;&gt; dataset = load_dataset(&quot;BEE-spoke-data/awesome-python-apps&quot;) &gt;&gt;&gt; dataset DatasetDict({ train: Dataset({ features: [&apos;section&apos;, &apos;filename&apos;, &apos;text&apos;], num_rows: 12394 }) validation: Dataset({ features: [&apos;section&apos;, &apos;filename&apos;, &apos;text&apos;], num_rows: 689 }) test: Dataset({ features: [&apos;section&apos;, &apos;filename&apos;, &apos;text&apos;], num_rows: 689 }) }) </pre> ##### token counts Llama (`use_fast=False`): <pre> token_len count 12394.0 mean 7863.521704 std 266330.944961 min 254.0 25% 727.0 50% 1417.0 75% 3083.5 max 27741612.0 </pre> NeoX: <pre> token_len count 12394.0 mean 6615.093432 std 213090.741642 min 234.0 25% 677.0 50% 1326.0 75% 2883.75 max 22081563.0 </pre> #### javascript <pre> INFO:__main__:Looking for files with extensions: [&apos;js&apos;, &apos;ts&apos;, &apos;tsx&apos;] Processing js files: 100%|██████████████████████████| 1441/1441 [00:00&lt;00:00, 2232.28it/s] Processing ts files: 100%|██████████████████████████| 1717/1717 [00:01&lt;00:00, 1343.75it/s] Processing tsx files: 100%|█████████████████████████| 1072/1072 [00:00&lt;00:00, 3929.08it/s] INFO:__main__:Found 4230 text files. INFO:__main__:Performing train-test split... INFO:__main__:Performing validation-test split... INFO:__main__:Train size: 3807 INFO:__main__:Validation size: 211 INFO:__main__:Test size: 212 INFO:__main__:Pushing dataset to Huggingface hub (BEE-spoke-data/awesome-python-apps)... INFO:__main__:Using repo id: BEE-spoke-data/awesome-python-apps, config name: javascript </pre> #### c_and_cpp <pre> INFO:__main__:Looking for files with extensions: [&apos;c&apos;, &apos;h&apos;, &apos;i&apos;, &apos;cc&apos;, &apos;cpp&apos;, &apos;cxx&apos;, &apos;hpp&apos;, &apos;hxx&apos;, &apos;inl&apos;, &apos;ino&apos;, &apos;ixx&apos;, &apos;jxx&apos;, &apos;lxx&apos;, &apos;yxx&apos;] Processing c files: 100%|█████████████████████████████| 450/450 [00:00&lt;00:00, 2741.92it/s] Processing h files: 100%|███████████████████████████| 3616/3616 [00:00&lt;00:00, 4250.46it/s] Processing i files: 100%|██████████████████████████████████| 1/1 [00:00&lt;00:00, 197.47it/s] Processing cc files: 100%|██████████████████████████| 1386/1386 [00:00&lt;00:00, 4209.33it/s] Processing cpp files: 100%|█████████████████████████| 3015/3015 [00:00&lt;00:00, 3254.90it/s] Processing cxx files: 100%|█████████████████████████████| 10/10 [00:00&lt;00:00, 1203.77it/s] Processing hpp files: 100%|███████████████████████████| 268/268 [00:00&lt;00:00, 3523.76it/s] Processing hxx files: 100%|███████████████████████████| 257/257 [00:00&lt;00:00, 3406.82it/s] Processing inl files: 100%|█████████████████████████████| 56/56 [00:00&lt;00:00, 2770.02it/s] Processing ino files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 195.83it/s] Processing ixx files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 200.43it/s] Processing jxx files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 187.72it/s] Processing lxx files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 185.00it/s] Processing yxx files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 192.69it/s] INFO:__main__:Found 9064 text files. INFO:__main__:Performing train-test split... INFO:__main__:Performing validation-test split... INFO:__main__:Train size: 8157 INFO:__main__:Validation size: 453 INFO:__main__:Test size: 454 INFO:__main__:Pushing dataset to Huggingface hub (BEE-spoke-data/awesome-python-apps)... INFO:__main__:Using repo id: BEE-spoke-data/awesome-python-apps, config name: c_and_cpp </pre> #### docs and configs <pre>INFO:__main__:Looking for files with extensions: [&apos;md&apos;, &apos;mdx&apos;, &apos;rst&apos;, &apos;txt&apos;, &apos;env&apos;, &apos;yml&apos;, &apos;yaml&apos;, &apos;toml&apos;, &apos;cfg&apos;, &apos;conf&apos;, &apos;config&apos;, &apos;gitignore&apos;, &apos;MD&apos;, &apos;mkd&apos;] Processing md files: 100%|████████████████████████████| 741/741 [00:00&lt;00:00, 3664.75it/s] Processing mdx files: 100%|███████████████████████████████| 8/8 [00:00&lt;00:00, 1168.29it/s] Processing rst files: 100%|███████████████████████████| 699/699 [00:00&lt;00:00, 3901.65it/s] Processing txt files: 100%|███████████████████████████| 646/646 [00:00&lt;00:00, 3348.51it/s] Processing env files: 100%|████████████████████████████████| 1/1 [00:00&lt;00:00, 188.49it/s] Processing yml files: 100%|███████████████████████████| 650/650 [00:00&lt;00:00, 4145.67it/s] Processing yaml files: 100%|████████████████████████████| 62/62 [00:00&lt;00:00, 2146.06it/s] Processing toml files: 100%|████████████████████████████| 13/13 [00:00&lt;00:00, 1934.71it/s] Processing cfg files: 100%|███████████████████████████| 618/618 [00:00&lt;00:00, 3866.66it/s] Processing conf files: 100%|████████████████████████████| 21/21 [00:00&lt;00:00, 2169.95it/s] Processing config files: 100%|██████████████████████████| 20/20 [00:00&lt;00:00, 2146.14it/s] Processing gitignore files: 100%|██████████████████████████| 6/6 [00:00&lt;00:00, 890.83it/s] Processing MD files: 100%|█████████████████████████████████| 1/1 [00:00&lt;00:00, 197.70it/s] Processing mkd files: 100%|███████████████████████████████| 3/3 [00:00&lt;00:00, 3669.56it/s] INFO:__main__:Found 3489 text files. INFO:__main__:Performing train-test split... INFO:__main__:Performing validation-test split... INFO:__main__:Train size: 3140 INFO:__main__:Validation size: 174 INFO:__main__:Test size: 175 INFO:__main__:Pushing dataset to Huggingface hub (BEE-spoke-data/awesome-python-apps)... INFO:__main__:Using repo id: BEE-spoke-data/awesome-python-apps, config name: docs_and_configs </pre>
提供机构:
BEE-spoke-data
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作