Package Downloads Dataset
收藏数据集概述
数据集描述
该数据集包含关于从不同来源下载的软件包信息的大量数据。数据集用于通过Apache Hadoop的MapReduce功能进行处理和分析,以回答与下载统计相关的几个关键问题。
数据集列
- date: 下载日期
- time: 下载时间(UTC)
- size: 下载的软件包大小(字节)
- r_version: 用于下载的R版本
- r_arch: 处理器架构(i386 = 32位,x86_64 = 64位)
- r_os: 操作系统(darwin9.8.0 = macOS,mingw32 = Windows)
- package: 下载的软件包名称
- country: 两字母ISO国家代码
- ip_id: 每个IP地址的唯一每日标识符
目标和MapReduce任务
任务1:ggplot2软件包的下载次数
-
命令: shell $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input /user/bdm/assignment/input -output /user/bdm/assignment/output -file /home/bdm/assignment/mapper.py -file /home/bdm/assignment/reducer.py -mapper python3 mapper.py -reducer python3 reducer.py
-
输出: ggplot2软件包的下载次数: 22,360,632
任务2:下载次数最多的国家
-
命令: shell $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input /user/bdm/assignment/input -output /user/bdm/assignment/output2 -file /home/bdm/assignment/mapper.py -file /home/bdm/assignment/reducer2.py -mapper python3 mapper.py -reducer python3 reducer2.py
-
输出: 下载次数最多的国家: "NA",下载次数: 3,225,550
任务3:最受欢迎的前10个软件包
-
命令: shell $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input /user/bdm/assignment/input -output /user/bdm/assignment/output3 -file /home/bdm/assignment/mapper.py -file /home/bdm/assignment/reducer3.py -mapper python3 mapper.py -reducer python3 reducer3.py
-
输出:
- "NA": 3,225,550 下载次数
- "mingw32": 3,194,919 下载次数
- "US": 3,061,236 下载次数
- "linux-gnu": 778,523 下载次数
- "darwin17.0": 648,165 下载次数
- "GB": 569,535 下载次数
- "darwin20": 328,304 下载次数
- "CN": 282,214 下载次数
- "KR": 254,392 下载次数
- "DE": 236,903 下载次数
任务4:爱尔兰最受欢迎的软件包
-
命令: shell $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input /user/bdm/assignment/input -output /user/bdm/assignment/output4 -file /home/bdm/assignment/mapper.py -file /home/bdm/assignment/reducer4.py -mapper python3 mapper.py -reducer python3 reducer4.py
-
输出: 爱尔兰最受欢迎的软件包: "mingw32",下载次数: 3,194,919
任务5:R程序员中最受欢迎的操作系统
-
命令: shell $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar -input /user/bdm/assignment/input -output /user/bdm/assignment/output5 -file /home/bdm/assignment/mapper.py -file /home/bdm/assignment/reducer5.py -mapper python3 mapper.py -reducer python3 reducer5.py
-
输出: R程序员中最受欢迎的操作系统: "mingw32",下载次数: 3,194,919




