Published 30 Oct, 2022

Java - Java - Parse delimited file and find column datatypes

Category Java
Modified : Dec 01, 2022
76

Is it possible to parse a delimited file and find column datatypes? e.g

Delimited file:

Email,FirstName,DOB,Age,CreateDate
[email protected],Test User1,20/01/2001,24,23/02/2015 14:06:45
[email protected],Test User2,14/02/2001,24,23/02/2015 14:06:45
[email protected],Test User3,15/01/2001,24,23/02/2015 14:06:45
[email protected],Test User4,23/05/2001,24,23/02/2015 14:06:45

Output:

Email datatype: email
FirstName datatype: Text
DOB datatype: date
Age datatype: int
CreateDate datatype: Timestamp

The purpose of this is to read a delimited file and construct a table creation query on the fly and insert data into that table.

I tried using apache validator, I believe we need to parse the complete file in order to determine each column data type.

EDIT: The code that I've tried:

CSVReader csvReader = new CSVReader(new FileReader(fileName),',');
String[] row = null;
int[] colLength=(int[]) null;
int colCount = 0;
String[] colDataType = null;
String[] colHeaders = null;

String[] header = csvReader.readNext();
if (header != null) {
    colCount = header.length;
}

colLength = new int[colCount];
colDataType = new String[colCount];
colHeaders = new String[colCount];

for (int i=0;i<colCount;i++){
    colHeaders[i]=header[i];
}

int templength=0;
String tempType = null;
IntegerValidator intValidator = new IntegerValidator();
DateValidator dateValidator = new DateValidator();
TimeValidator timeValidator = new TimeValidator();

while((row = csvReader.readNext()) != null) {
        for(int i=0;i<colCount;i++) {

                templength = row[i].length();

                colLength[i] = templength > colLength[i] ? templength : colLength[i];

                if(colHeaders[i].equalsIgnoreCase("email")){
                        logger.info("Col "+i+" is Email");
                } else if(intValidator.isValid(row[i])){
                        tempType="Integer";
                        logger.info("Col "+i+" is Integer");
                } else if(timeValidator.isValid(row[i])){
                        tempType="Time";
                        logger.info("Col "+i+" is Time");
                } else if(dateValidator.isValid(row[i])){
                        tempType="Date";
                        logger.info("Col "+i+" is Date");
                } else {
                        tempType="Text";
                        logger.info("Col "+i+" is Text");
                }

                logger.info(row[i].length()+"");
        }

Not sure if this is the best way of doing this, any pointers in the right direction would be of help

Answers

There are 3 suggested solutions here and each one has been listed below with a detailed description. The following topics have been covered briefly such as Java. These have been categorized in sections for a clear and precise explanation.

46

If you wish to write this yourself rather than use a third party library then probably the easiest mechanism is to define a regular expression for each data type and then check if all fields satisfy it. Here's some sample code to get you started (using Java 8).

public enum DataType {
    DATETIME("dd/dd/dddd dd:dd:dd"),
    DATE("dd/dd/dddd",
    EMAIL("\\[email protected]\\w+"),
    TEXT(".*");

    private final Predicate<String> tester;
    DateType(String regexp) {
        tester = Pattern.compile(regexp).asPredicate();
    }

    public static Optional<DataType> getTypeOfField(String[] fieldValues) {
        return Arrays.stream(values())
            .filter(dt -> Arrays.stream(fieldValues).allMatch(dt.tester)
            .findFirst();
    }
}

Note that this relies on the order of the enum values (e.g. testing for datetime before date).


32

Yes it is possible and you do have to parse the entire file first. Have a set of rules for each data type. Iterate over every row in the column. Start of with every column having every data type and cancel of data types if a row in that column violates a rule of that data type. After iterating the column check what data type is left for the column. Eg. Lets say we have two data types integer and text... rules for integer... well it must only contain numbers 0-9 and may begin with '-'. Text can be anything.

Our column:

345
-1ab
123

The integer data type would be removed by the second row so it would be text. If row two was just -1 then you would be left with integer and text so it would be integer because text would never be removed as our rule says text can be anything... you dont have to check for text basically if you left with no other data type the answer is text. Hope this answers your question


23

I have slight similar kind of logic needed for my project. Searched lot but did not get right solution. For me i need to pass string object to the method that should return datatype of the obj. finally i found post from @sprinter, it looks similar to my logic but i need to pass string instead of string array.

Modified the code for my need and posted below.

public enum DataType {
        DATE("dd/dd/dddd"),
        EMAIL("@gmail"),
        NUMBER("[0-9]+"),
        STRING("^[A-Za-z0-9? ,_-]+$");

        private final String regEx;

        public String getRegEx() {
            return regEx;
        }
        DataType(String regEx) {
            this.regEx = regEx;
        }

        public static Optional<DataType> getTypeOfField(String str) {
            return Arrays.stream(DataType.values())
                .filter(dt -> {
                    return Pattern.compile(dt.getRegEx()).matcher(str).matches();
                 })
                .findFirst();
        }
}

For example:

Optional<DataType> dataType = getTypeOfField("Bharathiraja");
System.out.println(dataType);
System.out.println(dataType .get());

Output:
Optional[STRING]
STRING

Please note, regular exp pattern is vary based on requirements, so modify the pattern as per your need don't take as it is.

Happy Coding !