政策资讯

Policy Information


Dataset之MNIST:MNIST(手写数字图片识别+ubyte.gz文件)数据集的下载(基于python语言根据爬虫技术自动下载MNIST数据集)

来源: 重庆市软件正版化服务中心    |    时间: 2022-09-19    |    浏览量: 64078    |   

Dataset之MNIST:MNIST(手写数字图片识别+ubyte.gz文件)数据集的下载(基于python语言根据爬虫技术自动下载MNIST数据集)

目录

数据集下载的所有代码

1、主文件 mnist_download_main.py文件

2、mnist.py文件

3、dataset.py文件

4、cache.py

5、download.py文件


数据集下载的所有代码

代码打包地址:mnist数据集下载的完整代码——mnist_download_main.rar

1、主文件 mnist_download_main.py文件

  1. 1、读取数据集
  2. MNIST数据集大约12MB,如果没在指定的路径中找到就会自动下载。
  3. from mnist import MNIST
  4. data = MNIST(data_dir="data/MNIST/") 它由70,000张图像和对应的标签(图像的类别)组成。数据集分成三份互相独立的子集。本教程中只用训练集和测试集。
  5. print("Size of:")
  6. print("- Training-set:\t\t{}".format(data.num_train))
  7. print("- Validation-set:\t{}".format(data.num_val))
  8. print("- Test-set:\t\t{}".format(data.num_test))

2、mnist.py文件

  1. Downloads the MNIST data-set for recognizing hand-written digits.
  2. Implemented in Python 3.6
  3. Usage:
  4. 1) Create a new object instance: data = MNIST(data_dir="data/MNIST/")
  5. This automatically downloads the files to the given dir.
  6. 2) Use the training-set as data.x_train, data.y_train and data.y_train_cls
  7. 3) Get random batches of training data using data.random_batch()
  8. 4) Use the test-set as data.x_test, data.y_test and data.y_test_cls
  9. This file is part of the TensorFlow Tutorials available at:
  10. https://github.com/Hvass-Labs/TensorFlow-Tutorials
  11. Published under the MIT License. See the file LICENSE for details.
  12. Copyright 2016-18 by Magnus Erik Hvass Pedersen
  13. import numpy as np
  14. import gzip
  15. import os
  16. from dataset import one_hot_encoded
  17. from download import download
  18. Base URL for downloading the data-files from the internet.
  19. base_url = "https://storage.googleapis.com/cvdf-datasets/mnist/"
  20. Filenames for the data-set.
  21. filename_x_train = "train-images-idx3-ubyte.gz"
  22. filename_y_train = "train-labels-idx1-ubyte.gz"
  23. filename_x_test = "t10k-images-idx3-ubyte.gz"
  24. filename_y_test = "t10k-labels-idx1-ubyte.gz"
  25. class MNIST:
  26. """
  27. The MNIST data-set for recognizing hand-written digits.
  28. This automatically downloads the data-files if they do
  29. not already exist in the local data_dir.
  30. Note: Pixel-values are floats between 0.0 and 1.0.
  31. """
  32. The images are 28 pixels in each dimension.
  33. img_size = 28
  34. The images are stored in one-dimensional arrays of this length.
  35. img_size_flat = img_size * img_size
  36. Tuple with height and width of images used to reshape arrays.
  37. img_shape = (img_size, img_size)
  38. Number of colour channels for the images: 1 channel for gray-scale.
  39. num_channels = 1
  40. Tuple with height, width and depth used to reshape arrays.
  41. This is used for reshaping in Keras.
  42. img_shape_full = (img_size, img_size, num_channels)
  43. Number of classes, one class for each of 10 digits.
  44. num_classes = 10
  45. def __init__(self, data_dir="data/MNIST/"):
  46. """
  47. Load the MNIST data-set. Automatically downloads the files
  48. if they do not already exist locally.
  49. :param data_dir: Base-directory for downloading files.
  50. """
  51. Copy args to self.
  52. self.data_dir = data_dir
  53. Number of images in each sub-set.
  54. self.num_train = 55000
  55. self.num_val = 5000
  56. self.num_test = 10000
  57. Download / load the training-set.
  58. x_train = self._load_images(filename=filename_x_train)
  59. y_train_cls = self._load_cls(filename=filename_y_train)
  60. Split the training-set into train / validation.
  61. Pixel-values are converted from ints between 0 and 255
  62. to floats between 0.0 and 1.0.
  63. self.x_train = x_train[0:self.num_train] / 255.0
  64. self.x_val = x_train[self.num_train:] / 255.0
  65. self.y_train_cls = y_train_cls[0:self.num_train]
  66. self.y_val_cls = y_train_cls[self.num_train:]
  67. Download / load the test-set.
  68. self.x_test = self._load_images(filename=filename_x_test) / 255.0
  69. self.y_test_cls = self._load_cls(filename=filename_y_test)
  70. Convert the class-numbers from bytes to ints as that is needed
  71. some places in TensorFlow.
  72. self.y_train_cls = self.y_train_cls.astype(np.int)
  73. self.y_val_cls = self.y_val_cls.astype(np.int)
  74. self.y_test_cls = self.y_test_cls.astype(np.int)
  75. Convert the integer class-numbers into one-hot encoded arrays.
  76. self.y_train = one_hot_encoded(class_numbers=self.y_train_cls,
  77. num_classes=self.num_classes)
  78. self.y_val = one_hot_encoded(class_numbers=self.y_val_cls,
  79. num_classes=self.num_classes)
  80. self.y_test = one_hot_encoded(class_numbers=self.y_test_cls,
  81. num_classes=self.num_classes)
  82. def _load_data(self, filename, offset):
  83. """
  84. Load the data in the given file. Automatically downloads the file
  85. if it does not already exist in the data_dir.
  86. :param filename: Name of the data-file.
  87. :param offset: Start offset in bytes when reading the data-file.
  88. :return: The data as a numpy array.
  89. """
  90. Download the file from the internet if it does not exist locally.
  91. download(base_url=base_url, filename=filename, download_dir=self.data_dir)
  92. Read the data-file.
  93. path = os.path.join(self.data_dir, filename)
  94. with gzip.open(path, 'rb') as f:
  95. data = np.frombuffer(f.read(), np.uint8, offset=offset)
  96. return data
  97. def _load_images(self, filename):
  98. """
  99. Load image-data from the given file.
  100. Automatically downloads the file if it does not exist locally.
  101. :param filename: Name of the data-file.
  102. :return: Numpy array.
  103. """
  104. Read the data as one long array of bytes.
  105. data = self._load_data(filename=filename, offset=16)
  106. Reshape to 2-dim array with shape (num_images, img_size_flat).
  107. images_flat = data.reshape(-1, self.img_size_flat)
  108. return images_flat
  109. def _load_cls(self, filename):
  110. """
  111. Load class-numbers from the given file.
  112. Automatically downloads the file if it does not exist locally.
  113. :param filename: Name of the data-file.
  114. :return: Numpy array.
  115. """
  116. return self._load_data(filename=filename, offset=8)
  117. def random_batch(self, batch_size=32):
  118. """
  119. Create a random batch of training-data.
  120. :param batch_size: Number of images in the batch.
  121. :return: 3 numpy arrays (x, y, y_cls)
  122. """
  123. Create a random index into the training-set.
  124. idx = np.random.randint(low=0, high=self.num_train, size=batch_size)
  125. Use the index to lookup random training-data.
  126. x_batch = self.x_train[idx]
  127. y_batch = self.y_train[idx]
  128. y_batch_cls = self.y_train_cls[idx]
  129. return x_batch, y_batch, y_batch_cls

3、dataset.py文件

  1. Class for creating a data-set consisting of all files in a directory.
  2. Example usage is shown in the file knifey.py and Tutorial 09.
  3. Implemented in Python 3.5
  4. This file is part of the TensorFlow Tutorials available at:
  5. https://github.com/Hvass-Labs/TensorFlow-Tutorials
  6. Published under the MIT License. See the file LICENSE for details.
  7. Copyright 2016 by Magnus Erik Hvass Pedersen
  8. import numpy as np
  9. import os
  10. import shutil
  11. from cache import cache
  12. def one_hot_encoded(class_numbers, num_classes=None):
  13. """
  14. Generate the One-Hot encoded class-labels from an array of integers.
  15. For example, if class_number=2 and num_classes=4 then
  16. the one-hot encoded label is the float array: [0. 0. 1. 0.]
  17. :param class_numbers:
  18. Array of integers with class-numbers.
  19. Assume the integers are from zero to num_classes-1 inclusive.
  20. :param num_classes:
  21. Number of classes. If None then use max(class_numbers)+1.
  22. :return:
  23. 2-dim array of shape: [len(class_numbers), num_classes]
  24. """
  25. Find the number of classes if None is provided.
  26. Assumes the lowest class-number is zero.
  27. if num_classes is None:
  28. num_classes = np.max(class_numbers) + 1
  29. return np.eye(num_classes, dtype=float)[class_numbers]
  30. class DataSet:
  31. def __init__(self, in_dir, exts='.jpg'):
  32. """
  33. Create a data-set consisting of the filenames in the given directory
  34. and sub-dirs that match the given filename-extensions.
  35. For example, the knifey-spoony data-set (see knifey.py) has the
  36. following dir-structure:
  37. knifey-spoony/forky/
  38. knifey-spoony/knifey/
  39. knifey-spoony/spoony/
  40. knifey-spoony/forky/test/
  41. knifey-spoony/knifey/test/
  42. knifey-spoony/spoony/test/

评论

QQ咨询 扫一扫加入群聊,了解更多平台咨询
微信咨询 扫一扫加入群聊,了解更多平台咨询
意见反馈
立即提交
QQ咨询
微信咨询
意见反馈