five

CRAWDAD mit/reality

收藏
DataCite Commons2022-12-16 更新2025-04-16 收录
下载链接:
https://ieee-dataport.org/open-access/crawdad-mitreality
下载链接
链接失效反馈
官方服务:
资源简介:
The authors have captured communication, proximity, location, and activity information from 100 subjects at MIT over the course of the 2004-2005 academic year. This data represents over 350,000 hours (~40 years) of continuous data on human behavior. Such rich data on complex social systems have implications for a variety of fields.last modified :2006-11-09release date :2005-07-01date/time of measurement start :2004-07-26date/time of measurement end :2005-05-05collection environment :Our study consists of one hundred Nokia 6600 smart phones pre-installed with several pieces of software we have developed as well as a version of the Context application from the University of Helsinki. Seventy-five users are either students or faculty in the MIT Media Laboratory, while the remaining twenty-five are incoming students at the MIT Sloan business school adjacent to the laboratory. Of the seventy-five users at the lab, twenty are incoming masters students and five are incoming MIT freshman.network configuration :We exploit the fact that modern phones use both a short-range RF network (e.g., Bluetooth) and a long-range RF network (e.g., GSM), and that the two networks can augment each other for location and activity inference. We logged cell tower ID to determine approximate location and at the same time we logged Bluetooth devices. Bluetooth is a wireless protocol in the 2.40-2.48 GHz range, developed by Ericsson in 1994 and released in 1998 as a serial-cable replacement to connect different devices.data collection methodology :The information we are collecting includes call logs, Bluetooth devices in proximity, cell tower IDs, application usage, and phone status (such as charging and idle), which comes primarily from the Context application. The study will generate data collected by one hundred human subjects over the course of nine months and represent approximately 500,000 hours of data on users' location, communication and device usage behavior.Tracesetmit/reality/blueawareTraceset of communication, proximity, location, and activity information.description: The authors have captured communication, proximity, location, and activity information from 100 subjects at MIT over the course of the 2004-2005 academic year. This data represents over 350,000 hours (~40 years) of continuous data on human behavior.measurement purpose: Social Network Analysis, Human Behavior Modelingmethodology: Every Bluetooth device is capable of device-discovery, which allows them to collect information on other Bluetooth devices within 5-10 meters. This information includes the Bluetooth MAC address (BTID), device name, and device type. The BTID is a hex number unique to the particular device. The device name can be set at the user's discretion; e.g., Tony's Nokia. Finally, the device type is a set of three integers that correspond to the device discovered; e.g., Nokia mobile phone, or IBM laptop. To log BTIDs we designed a software application, BlueAware, that runs passively in the background on MIDP2-enabled mobile phones. Bluetooth was primarily designed to enable wireless headsets or laptops to connect to phones, but as a byproduct, devices are becoming aware of other Bluetooth devices carried by people nearby. Our application records and timestamps the BTIDs encountered in a proximity log and makes them available to other applications. BlueAware is automatically run in the background when the phone is turned on, making it essentially invisible to the user. Bluedar was developed to be placed in a social setting and continuously scan for visible devices, wirelessly transmitting detected BTIDs to a server over an 802.11b network. The heart of the device is a Bluetooth beacon designed by Mat Laibowitz incorporating a class 2 Bluetooth chipset that can be controlled by an XPort web server. We integrated this beacon with an 802.11b wireless bridge and packaged them in an unobtrusive box. An application was written to continuously telnet into multiple BlueDar systems, repeatedly scan for Bluetooth devices, and transmit the discovered proximate BTIDs to our server. Because the Bluetooth chipset is a class 2 device, it is able to detect any visible Bluetooth device within a working range of up to twenty-five meters.last modified: 2006-10-17dataname: mit/reality/blueawareversion: 20050701change: the initial versionrelease date: 2005-06-01date/time of measurement start: 2004-07-26date/time of measurement end: 2005-05-05limitation: 1. Continually scanning and logging BTIDs can expend an older mobile phone battery in about 18 hours. While continuous scans provide a rich depiction of a user's dynamic environment, most individuals expect phones to have standby times exceeding 48 hours. Therefore BlueAware was modified to only scan the environment once every five minutes, providing at least 36 hours of standby time. 2. While the custom logging application on the phone crashes occasionally (approximately once every week), these crashes fortunately do not result in significant data loss. An additional small application was written to start on boot and continually review the running processes on the phone, verifying that our logging application is always running. Should there be a time where this is not the case, the application is immediately restarted. This functionality also ensures that logging begins immediately once the phone is turned on. However, while this logging application is now fairly robust and can be assumed to be running anytime the phone is on, the dataset generated is certainly not without noise. 3. By scanning only periodically every five minutes, shorter proximity events may be missed.hole: 1. All the data from a phone are stored on a flash memory card, which has a finite number of read-write cycles. Initial versions of our application wrote over the same cells of the memory card. This led to failure of a new card after about a month of data collection, resulting in the complete loss of data. When the application was changed to store the incremental logs in RAM and subsequently write each complete log to the flash memory, our data corruption issues virtually vanished. However, ten cards were lost before this problem was identified, destroying portions of the data collected during the months of September and October for six Sloan students and four Media Lab students. 2. Another source of missing data is due to powered-off devices. On average we have logs accounting for approximately 85.3% of the time since the phones have been deployed. Less than 5% of this is due to data corruption, while the majority of the missing 14.7% is due to almost one fifth of the subjects turning off their phones at night. 3. There is a small probability (between 1-3% depending on the phone) that a proximate, visible device will not be discovered during a scan. Typically this is due to either a low level Symbian crash of an application called the "BTServer", or a lapse in the device discovery protocol. The BT server crashes and restarts approximately once every three days (at a 5 minute scanning interval) and accounts for a small fraction of the total error. However, to detect other subjects, we can leverage the redundancy implicit in the system. Because both of the subjects' phones are actually scanning, the probability of a simultaneous crash or device discovery error is less than 1 in 1000 scans.error: 1. The ten meter range of Bluetooth along with the fact that it can penetrate some types of walls, means that people not physically proximate may incorrectly be logged as such. 2. An error comes from the phone being either explicitly turned off by the user or exhausting the batteries. According to our collected survey data, users report exhausting the batteries approximately 2.5 times each month. One fifth of our subjects manually turn the phone off on a regular basis during specific contexts such as classes, movies, and (most frequently) when sleeping. Immediately before the phone powers down, the event is timestamped and the most recent log is closed. A new log is created when the phone is restarted and again a timestamp is associated with the event. 3. A more critical source of error occurs when the phone is left on, but not carried by the user. From surveys, we have found that 30% of our subjects claim to never forget their phones, while 40% report forgetting it about once each month, and the remaining 30% state that they forget the phone approximately once each week. Identifying the times where the phone is on, but left at home or in the office presents a significant challenge when working with the dataset. To grapple with the problem, we have created a 'forgotten phone' classifier. Features included staying in the same location for an extended period of time, charging, and remaining idle through missed phone calls, text messages and alarms. When applied to a subsection of the dataset which had corresponding diary text labels, the classifier was able to identify the day where the phone was forgotten, but also mislabeled a day when the user stayed home sick. By ignoring both days, we risk throwing out data on outlying days, but have greater certainty that the phone is actually with the user. A significantly harder problem is to determine whether the user has temporarily moved beyond ten meters of his or her office without taking the phone. Empirically, this appears to happen with many subjects on a regular basis and there doesn't seem to be enough unique features of the event to accurately classify it. However, this phenomenon does not diminish the extremely strong correlation between detected proximity and self-report interactions. Lastly, while frequency of proximity within the workplace can be useful, the most salient data comes from detecting a proximity event outside MIT, where temporarily forgetting the phone is less likely to repeatedly occur.note: In return for the use of the Nokia 6600 phones, students have been asked to fill out web-based surveys regarding their social activities and the people they interact with throughout the day. Comparison of the logs with survey data has given us insight into our dataset's ability to accurately map social network dynamics. Through surveys of approximately forty senior students, we have validated that the reported frequency of (self-report) interaction is strongly correlated with the number of logged BTIDs (R=.78, p=.003), and that the dyadic self-report data has a similar correlation with the dyadic proximity data (R=.74, p~=.0001). Additionally, a subset of subjects kept detailed activity diaries over several months. Comparisons revealed no systematic errors with respect to proximity and location, except for omissions due to the phone being turned off.mit/reality/blueaware Tracesactivityscpan: Activity span logs.configuration: activity span logsformat: oid, endtime, starttime, person_oiddescription: Activity span logs.last modified: 2006-10-17dataname: mit/reality/blueaware/activityscpanversion: 20050701change: The initial versionrelease date: 2005-07-01date/time of measurement start: 2004-07-26date/time of measurement end: 2005-05-05callspan: Call span logs.configuration: call span logsformat: oid, endtime, starttime, person_oid, phonenumber_oid, callid, contact, description, direction, duration, number, status, remote "person_oid" refers to the person running the software on their phone, for which this call was logged. It is who this callspan is 'attached' to, and will always be attached to some person_oid. "direction" refers to the direction of the call from the perspective of this particular person/cellphone that recorded this callspan (the same as the person referred to by person_oid). Can be Incoming, Missed Call, or Outgoing. "phonenumber_oid" refers to the number 'on the other end' of the network, which may be a landline, a cell phone line, or even that phone network's voicemail. So in other words, person_oid and phonenumber_oid represent the two ends of the phone call, with the direction of the phone call represented in the direction field. If you want to utilize all 897921 callspan records, you might want to define these "calls" as between two phonenumbers, instead of as between two persons. So the call would exist between callspan.person_oid's phonenumber_oid, and the callspan.phonenumber_oid. In addition, if the callspan records a call between two people that were running the software and part of the study (they both are part of the study), then there are a few additional properties that will hold about the callspan: For some person src: src.oid = callspan.person_oid (for all calls) For some person dst: dst.phonenumber_oid = callspan.phonenumber_oid (only for in-network calls) There should also be a symmetric callspan going in the other direction. For some callspan Y: Y.person_oid == dst.oid Y.phonenumber_oid = src.phonenumber_oiddescription: Call span logs.last modified: 2006-10-17dataname: mit/reality/blueaware/callspanversion: 20050701change: The initial versionrelease date: 2005-07-01date/time of measurement start: 2004-08-03date/time of measurement end: 2004-12-25cellspan: Cell span logs.configuration: cell span logsformat: oid, endtime, starttime, person_oid, celltower_oiddescription: Cell span logs.last modified: 2006-10-17dataname: mit/reality/blueaware/cellspanversion: 20050701change: The initial versionrelease date: 2005-07-01date/time of measurement start: 2004-07-26date/time of measurement end: 2005-05-05coverspan: Cover span logs.configuration: cover span logsformat: oid, endtime, starttime, person_oiddescription: Cover span logs.last modified: 2006-10-17dataname: mit/reality/blueaware/coverspanversion: 20050701change: The initial versionrelease date: 2005-07-01date/time of measurement start: 2004-07-27date/time of measurement end: 2005-05-05devicespan: Device span logs.configuration: device span logsformat: oid, endtime, starttime, person_oid, device_oiddescription: Device span logs.last modified: 2006-10-17dataname: mit/reality/blueaware/devicespanversion: 20050701change: The initial versionrelease date: 2005-07-01date/time of measurement start: 2004-07-26date/time of measurement end: 2005-05-05

本研究团队于2004-2005学年期间,在美国麻省理工学院(MIT)对100名受试者采集了通信、近距离交互、位置及活动相关信息。该数据集包含超过35万小时(约40年)的连续人类行为数据,这类针对复杂社会系统的高粒度数据可对众多研究领域提供参考价值。最后修改时间:2006-11-09;发布日期:2005-07-01;测量开始时间:2004-07-26;测量结束时间:2005-05-05 ## 采集环境 本研究为每位受试者配备一台预装自研软件及赫尔辛基大学Context应用版本的诺基亚6600(Nokia 6600)智能手机。其中75名受试者为MIT媒体实验室的学生或教职员工,剩余25名为MIT斯隆商学院的新生(该商学院毗邻媒体实验室)。在媒体实验室的75名受试者中,20人为即将入学的硕士生,5人为MIT本科新生。 ## 网络配置 现代智能手机同时具备短距射频网络(如蓝牙(Bluetooth))与长距射频网络(如全球移动通信系统(GSM)),二者可互为补充以实现位置与行为推断。本研究通过记录基站蜂窝塔(cell tower)ID以获取近似位置信息,同时记录蓝牙设备。蓝牙是工作于2.40-2.48 GHz频段的无线协议,由爱立信(Ericsson)于1994年设计,1998年发布,最初用于替代串行线缆实现多设备互联。 ## 数据收集方法论 本次采集的信息主要通过Context应用获取,包括通话日志、近距离蓝牙设备、蜂窝塔ID、应用使用情况及手机状态(如充电、空闲状态)。本研究在9个月内完成100名受试者的数据采集,累计产生约50万小时的用户位置、通信及设备使用行为数据。数据集标识为Tracesetmit/reality/blueaware,为通信、近距离交互、位置及活动信息的追踪数据集。数据集描述:研究团队于2004-2005学年期间,在MIT对100名受试者采集了通信、近距离交互、位置及活动相关信息。该数据集包含超过35万小时(约40年)的连续人类行为数据。测量目的:社交网络分析、人类行为建模。 每台蓝牙设备均支持设备发现功能,可收集5-10米范围内其他蓝牙设备的信息,包括蓝牙MAC地址(BTID)、设备名称及设备类型。BTID为对应设备唯一的十六进制编号;设备名称可由用户自定义,例如"Tony's Nokia";设备类型为三个整数组成的集合,对应被发现的设备类型,例如"诺基亚手机"或"IBM笔记本电脑"。为记录BTID,我们开发了BlueAware应用,可在支持MIDP2的移动设备后台被动运行。蓝牙最初设计用于实现无线耳机或笔记本电脑与手机的连接,但附带实现了设备对附近携带蓝牙设备人员的感知功能。本应用会记录遇到的BTID并添加时间戳,生成近距离交互日志,并将数据开放给其他应用。BlueAware会在手机开机时自动在后台运行,对用户基本无感知。 Bluedar设备部署于社交场景中,可持续扫描可见设备,并通过802.11b无线网络将检测到的BTID无线传输至服务器。该设备的核心为Mat Laibowitz设计的蓝牙信标,搭载可通过XPort网页服务器控制的2类蓝牙芯片组。我们将该信标与802.11b无线网桥集成,并封装于小巧的外壳中。配套应用可通过Telnet连接多台BlueDar系统,持续扫描蓝牙设备并将发现的近距离BTID传输至服务器。由于采用2类蓝牙芯片,该设备可检测工作范围内(最大25米)的所有可见蓝牙设备。 最后修改时间:2006-10-17;数据集名称:mit/reality/blueaware;版本:20050701;变更记录:初始版本;发布日期:2005-06-01;测量开始时间:2004-07-26;测量结束时间:2005-05-05 ## 数据集局限性与误差 ### 局限性 1. 持续扫描并记录BTID会快速消耗老旧智能手机的电池,续航仅约18小时。尽管连续扫描可全面捕捉用户所处的动态环境,但多数用户期望手机待机时长超过48小时。因此BlueAware被修改为每5分钟扫描一次,可实现至少36小时的待机时长。 2. 手机端的自定义日志应用偶尔会崩溃(约每周一次),但幸运的是不会造成大量数据丢失。我们开发了小型辅助应用,可在手机开机时自动启动并持续监控后台进程,确保日志应用始终运行。若检测到应用停止运行,辅助应用会立即将其重启,该功能也可确保手机开机后立即开始日志记录。尽管目前日志应用已较为稳定,可认为开机状态下始终运行,但生成的数据集仍存在噪声。 3. 由于仅每5分钟进行一次周期性扫描,时长较短的近距离交互事件可能被遗漏。 ### 数据缺失原因 1. 手机的所有数据存储于有限读写次数的闪存卡中。早期版本的应用会重复写入闪存卡的同一存储区块,导致约1个月的采集后闪存卡损坏,造成数据完全丢失。后续我们将应用修改为:先将增量日志存储于RAM中,待日志完整后再一次性写入闪存,数据损坏问题基本解决。但在该问题修复前,已有10张闪存卡损坏,导致6名斯隆商学院学生与4名媒体实验室学生在9月和10月采集的数据部分丢失。 2. 另一类数据缺失源于设备关机。平均而言,手机部署后约85.3%的时间可产生有效日志。其中少于5%的缺失源于数据损坏,剩余14.7%的主要原因是约五分之一的受试者会在夜间关闭手机。 3. 存在较小概率(依手机型号不同为1%-3%)导致扫描时无法发现近距离可见设备,通常源于名为"BTServer"的应用发生Symbian系统低级崩溃,或是设备发现协议出现异常。BTServer约每3天崩溃并重启一次(扫描间隔为5分钟),仅占总错误的一小部分。但由于系统存在冗余:两名受试者的手机均会进行扫描,因此同时发生崩溃或设备发现错误的概率低于千分之一。 ### 数据误差来源 1. 蓝牙的10米通信范围及可穿透部分墙体的特性,可能导致实际距离较远的人员被错误记录为近距离交互。 2. 误差源于用户主动关机或电池耗尽。根据收集的调研数据,受试者平均每月约2.5次遇到电池耗尽的情况。五分之一的受试者会在特定场景下主动关机,例如上课、观影,最常见的场景是夜间睡眠。手机关机前会记录时间戳并关闭当前日志,重启后会创建新日志并记录重启时间戳。 3. 更严重的误差场景是手机开机但未被用户携带。根据调研,30%的受试者表示从未遗忘手机,40%的受试者每月约遗忘一次,剩余30%的受试者每周约遗忘一次。如何识别手机开机但被留在家里或办公室的场景,是该数据集处理的重大挑战。为此我们开发了"遗忘手机"分类器,特征包括长时间停留于同一位置、处于充电状态、未接来电、短信及闹钟均未触发。在带有手动日记标签的数据集子集上测试时,该分类器可准确识别手机被遗忘的日期,但也会将用户因病居家的日期错误标记。若忽略这两类日期,虽可降低误判风险,但也会丢失部分异常日期的有效数据。更困难的问题是判断用户是否临时离开办公室超过10米但未携带手机,经验表明该情况在多名受试者中定期发生,且缺乏足够的唯一特征实现准确分类。但该现象并未削弱检测到的近距离交互与自我报告的社交互动之间的强相关性。此外,尽管工作场所的近距离交互频率具有研究价值,但最有价值的数据来自MIT校园外的近距离交互事件,此时用户遗忘手机的概率更低。 ## 数据集验证 作为使用诺基亚6600手机的回报,受试者需填写基于网页的调查问卷,内容涵盖当日社交活动及互动对象。将日志数据与调查问卷结果对比,可验证数据集准确映射社交网络动态的能力。通过对约40名高年级学生的调研,我们验证了自我报告的交互频率与记录的BTID数量强相关(R=0.78, p=0.003),且二元组的自我报告数据与二元组近距离交互数据也具有强相关性(R=0.74, p≈0.0001)。此外,部分受试者连续数月填写了详细的活动日记,对比结果显示除因设备关机导致的遗漏外,近距离交互与位置数据无系统性误差。 --- ### activityscpan:活动时长日志 配置:活动时长日志 格式:oid, endtime, starttime, person_oid 描述:活动时长日志。 最后修改时间:2006-10-17 数据集名称:mit/reality/blueaware/activityscpan 版本:20050701 变更记录:初始版本 发布日期:2005-07-01 测量开始时间:2004-07-26 测量结束时间:2005-05-05 ### callspan:通话时长日志 配置:通话时长日志 格式:oid, endtime, starttime, person_oid, phonenumber_oid, callid, contact, direction, duration, number, status, remote 说明:`person_oid`指运行本软件并产生该通话日志的受试者,即该通话记录所属的用户,始终对应某一具体受试者。`direction`指从记录该通话的手机(即`person_oid`对应的设备)视角出发的通话方向,可取值为「来电」「未接来电」或「去电」。`phonenumber_oid`指通话另一端的网络号码,可为固定电话、手机号码或运营商语音信箱(voicemail)。简言之,`person_oid`与`phonenumber_oid`分别代表通话的两端,`direction`字段表示通话方向。若需使用全部897921条callspan记录,建议将通话定义为两个电话号码之间的交互,而非两个受试者之间:即通话存在于`callspan.person_oid`对应的手机号码与`callspan.phonenumber_oid`之间。 若通话双方均为本研究的受试者且均运行了本软件,则该通话记录还满足以下额外属性: - 对某一受试者`src`:`src.oid = callspan.person_oid`(所有通话均满足该条件) - 对某一受试者`dst`:`dst.phonenumber_oid = callspan.phonenumber_oid`(仅网内通话满足该条件) 此时必然存在一条方向相反的对称通话记录`Y`:`Y.person_oid == dst.oid` 且 `Y.phonenumber_oid = src.phonenumber_oid` 描述:通话时长日志。 最后修改时间:2006-10-17 数据集名称:mit/reality/blueaware/callspan 版本:20050701 变更记录:初始版本 发布日期:2005-07-01 测量开始时间:2004-08-03 测量结束时间:2004-12-25 ### cellspan:蜂窝基站时长日志 配置:蜂窝基站时长日志 格式:oid, endtime, starttime, person_oid, celltower_oid 描述:蜂窝基站时长日志。 最后修改时间:2006-10-17 数据集名称:mit/reality/blueaware/cellspan 版本:20050701 变更记录:初始版本 发布日期:2005-07-01 测量开始时间:2004-07-26 测量结束时间:2005-05-05 ### coverspan:覆盖时长日志 配置:覆盖时长日志 格式:oid, endtime, starttime, person_oid 描述:覆盖时长日志。 最后修改时间:2006-10-17 数据集名称:mit/reality/blueaware/coverspan 版本:20050701 变更记录:初始版本 发布日期:2005-07-01 测量开始时间:2004-07-27 测量结束时间:2005-05-05 ### devicespan:设备时长日志 配置:设备时长日志 格式:oid, endtime, starttime, person_oid, device_oid 描述:设备时长日志。 最后修改时间:2006-10-17 数据集名称:mit/reality/blueaware/devicespan 版本:20050701 变更记录:初始版本 发布日期:2005-07-01 测量开始时间:2004-07-26 测量结束时间:2005-05-05
提供机构:
IEEE DataPort
创建时间:
2022-12-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作